EPISODE#28: How AIOps Tames Systems Complexity & Overcomes Talent Shortages
( Click here to listen to this podcast episode )
It’s been said that the enemy of execution is complexity. Nowhere has this rung truer than corporate data centers, where growing systems complexity threatens the execution of digital transformation strategies. How can enterprises successfully transition to this new digital business paradigm, without succumbing to the hazards that can jeopardize increasingly complicated IT environments? According to one expert, the answer is AIOps.
In this episode, we speak with Gadi Oren, Vice President of Technology Evangelism for LogicMonitor. His company recently conducted a global survey of 300 senior IT executives and found that 90% of them had experienced a serious outage in the past 3 years. A shocking figure which LogicMonitor directly attributes to the rising complexity of IT infrastructures. Gadi shares with us how the emerging field of AIOps can counteract this negative trend, mitigate industry talent shortages, and ultimately reshape the future of data centers.
Guy Nadivi: Welcome, everyone. My name is Guy Nadivi and I’m the host of Intelligent Automation Radio. Our guest today is Gadi Oren, Vice President of Technology Evangelism for LogicMonitor, a SaaS-based performance monitoring vendor that serves up 70 billion metrics per day to an extensive list of global clients. All that enterprise-grade monitoring also produces massive data sets that are fueling automation, artificial intelligence, and machine learning in the emerging field of AIOps. Since digital transformations often start in IT, many organizations increasingly view AIOps as a foundational component for their digital transformation strategy. Given Gadi’s extensive expertise in this domain, we invited him to come on the show and help us better understand how AIOps can add value to an enterprise and perhaps also make them more competitive. Gadi, welcome to Intelligent Automation Radio.
Gadi Oren: Guy, thank you for having me.
Guy Nadivi: Gadi, the first question I think our listeners would want me to ask you is, how do you define AIOps?
Gadi Oren: So that’s a great question. It’s actually a very large one. There’s multiple definitions to what AIOps is. We used to work, at the beginning, with a definition that’s saying “…any advanced algorithm that can be used to create a sustainable and stable value for the customer, in terms of making their operations work a lot better.” Any one of those algorithms is relevant. That includes also artificial intelligence and machine learning type of algorithms. However, there’s a lot of people that just don’t understand what that means. So more recently, we’re trying to explain what we think about AIOps is by just trying to tell a story about what is it that we’re building? What it is that we do and use terms that help customers understand in a more simplified way.
So we tend to look at AIOps … certainly one of the main things that it will do is, with LogicMonitor is something we call an early warning system. So an early warning system is something that can understand when something is about to happen, something that may have a negative impact, and letting you know about it and giving you really strong information about what’s going to happen while you can still do something about it and prevent it or improve the situation. So the early warning system, we kind of break it into roughly three parts. There’s three parts to it. One is understanding what’s going on and sifting out the noise from the signal, if you will. The signal could be anything like metrics. It could be logs. It could be anything, but understanding the things that are important for you to understand versus the things that just are not material right now. Then these things will create alerts or incidents or things you need to know about.
Then the second part of it is making those alerts into more actionable type of information. When something is more actionable, it could be that you know enough in order to do something about it or it could be the system will propose what you can do in order to solve it or giving you contextual information that will help you understand what does it all mean. So that’s the second piece of it.
The last part of it is remediation or prevention. So that part is, once you know what’s going on and you have actionable information about it, what could this early warning system … What could it do in order to change it or prevent it from happening enough time in advance to make a difference? So when we … We tend to … We get some good responses on the notion of early warning system. It’s something that people tend to understand much better than, a list of buzz words and descriptions, very technical descriptions.
Guy Nadivi: I imagine having a high quality signal to noise ratio requires sifting through a lot of data. So I’m curious, how much data, from logs and elsewhere, does an AIOps tool need to ingest and analyze before it can start churning out meaningful predictions?
Gadi Oren: Right, so that’s a great question. It’s almost a question of, if I buy into that, how quickly will I get the value? Well, there’s not one answer to that. It actually depends, because the field of AIOps is composed of so many discrete problems that are being solved that there’s multiple answers to that. Some of the problems where AIOps or the implementation of AIOps will start giving you value, will happen very, very quickly. For example, part of what we do with AIOps is we build topology of the IT and relationships between different objects. That is being done actually very, very quickly. It would take minutes to discover that and learn what are those relationships are, probably less than 15 minutes and you will start seeing certain types of value as soon as that, within minutes to, let’s say 30 minutes.
Other types of problems, let’s say things that relates to abnormal behavior or some people call it dynamic thresholds. Those are things that will take longer to get into value and for the system to learn, because the system needs to look at the IT for a day or two in order to understand what is normal for those two days, and then in the third day maybe will be able to tell you, “Look, this is not what’s happening. This is significantly different than what happened in the last two days.” So there’s a whole category of problems that the more data we see, the better the predictions will be. You will probably get a reasonable prediction after a day or two, but after a week or after a month, those predictions will go and become better. You also have to add on top of that the fact that some cycles, especially certain business cycles, are just very long. You have … The weekend backup happens over the weekend, so in order to tell you something intelligent about that, we have to wait until at least one or two weekends. The same goes for different business cycles that take … some take 24 hours, some take a week, some take a month. So certainly different types of timeframes to start seeing the value, but some of them will be very short, definitely minutes to hours, and then the rest of them will come through over time.
Guy Nadivi: So there’s been a lot of buzz about the numerous areas where AIOps can provide benefits to organizations, but I’d like to hear what you think is the single biggest benefit AIOps delivers.
Gadi Oren: So, AIOps … So you’re right. It’s multiple things. The challenge here is to choose, what is the one thing? What is the biggest thing that will make the difference? I think that the value is, for organizations that choose to do that and choose to adopt it, is that the organizations will be able to be a lot more proactive. So they will have … Again, looking at using this analogy of an early warning system is very applicable here. They will know about problems well in advance, comparing to other companies, and hopefully will be able to act on it and prevent them. The business value of such a capability is very, very high because you will have opportunity to avoid issues. You will probably be able to use less people to manage more IT and more complexity if you will be more proactive than you are today. That is the biggest value, I think, that AIOps will deliver organizations.
Guy Nadivi: Okay. So can you please talk about some particularly interesting use cases where AIOps acted as an early warning system and made predictions that mitigated problems, big problems?
Gadi Oren: Yeah. I have a number of examples. One of them is obviously a simple situation of a signal that was measuring behavior of a certain system within the IT. That system was basically starting to behave differently. That was different than the previous seven or eight days, so that company, they were looking at what could be the causes to that and they found that one of the hardware pieces, there was some redundancy in the system and one of the hardware pieces that was part of this redundancy was failing. That was just a much bigger … It was sort of a brown out situation that had the potential to become into a blackout, which would cost a lot of money because the company was running trading over the network, which means that they would lose a lot of money per every hour of down time. So that really prevented a huge financial loss. So that was definitely one example.
The other example that we had was a situation where there was … with a customer that’s an MSP, so it’s a service provider. The service provider have customers and they have service level agreement between them and their customers. So one of the customers all of a sudden had what they called an alert storm, where there’s just not a single alert, but it’s coming as 3,500 alerts are coming in at the same time. Now, that becomes a big issue because now people just don’t know where to start. Which one of the 3,500… Are they all problems? It’s very unlikely. Usually they’re all interconnected somehow, but finding what caused this is like looking for a needle in a haystack, right? 3,500 alerts, size of a haystack. So what LogicMonitor does in that situation is we’re able to categorize the alerts based on their dependency.
I mentioned before that we have topology and we map topology between devices. So we can extrapolate from that the dependency between alerts. In that type of a situation, there’s usually one or two or three root cause alerts, if you will. So we put them … We pull them aside and say, “These are the two or three things you need to look at right now and all the rest are likely to be an artifacts of the first.” What that is doing is that it allows you to not waste any more time, but just focus on the problem. That in turn means that the MSP is not breaking their SLA with their customers, so there’s actual financial incentive to use such a system, because otherwise it’s just everybody is suffering and there’s business implications to that.
Guy Nadivi: So those are definitely interesting predictions, but like many trending buzzwords, we hear a lot of hype about AIOps, including its implication for the future of data centers. Do you think AIOps is the beginning of the end for data centers?
Gadi Oren: No, I do not. I think that AIOps will actually enable … Well, first of all, you have to ask yourself, what is a data center? Is a data center, cloud, is what we used to call a data center on premise? I don’t know, it really depends on how you define that, but I think that, if anything, what AIOps and different implementations of it, what they will do is they will allow you to manage more complexity and more devices and more network with the same amount of people. So this is not the end of the data center. I think it’s the beginning of what data center will be in the future. It will be different, bigger, more complex, and do a lot more for you. I don’t know if it will be exactly the same shape and form that it is today, but this is the beginning of the new data center.
Guy Nadivi: When I read about AIOps, it’s primarily positioned as being for IT operations, but I’m curious if it has, in your opinion, applicability in other non-IT areas of a business?
Gadi Oren: Right, so I think there’s definitely a crossover between IT and other parts of a business. First of all, it’s obvious that more and more companies and their IT, their various … The threshold between what IT is and what is the other parts of the business getting more and more gray and they’re all blending together because the business is so dependent on IT working correctly for the normal operations. There’s also different types of activities that were used to be considered strictly more on the business side, things like capacity management and planning. When you are running out of equipment or running out of a certain capability and that capability is either very expensive or it has a long lead time, right? So we can say that about, let’s say, storage systems, right? If it takes you a long time to buy the storage and bring it on premise or it’s just very, very expensive and you need to manage it, then there is some activity of capacity planning, looking ahead and trying to understand how the business needs are aligned with all those things that you need to bring into the organization.
So I think that AIOps is definitely spilling out from strictly low-level IT, which is speeds and feeds and making sure that things are working, into other parts of the business and into aligning your just general business activities with how they map into what you have in your IT.
Guy Nadivi: So AIOps is heavily dependent on AI and machine learning and the entire value proposition of AI and machine learning being able to get you to the point where you can predict failures before they happen and mitigate them in advance is all predicated on one particular skill, data science. If you don’t have the data scientist to build the algorithms to generate the predictions, then you can’t leverage AIOps. Right now there is a very big shortage of data scientists. The August 2018 LinkedIn Workforce Report stated that there was a nationwide deficit of over 150,000 data scientists. How will companies like Logic Monitor overcome this staggering talent shortage?
Gadi Oren: Yeah, that’s a great question. So it’s a very real problem. There are a couple things that I can tell you about that. First of all, it’s true that a lot of the machine learning side requires a lot of data and may require data scientists, but a lot of those systems have other parts to it. So first of all, we create a lot of value also without strictly the data science side. So we definitely need data scientists, but it’s not a black and white situation where if you need 10 data scientists and you have only eight, then everything is coming to a screeching halt because there are other types of things that you have to put together in order to have a really great product. So it’s not just data scientists. Some algorithms are really not about data scientists, but having said that, we do need data scientists and have them. When we had a shortage, then we used external companies that provided us with those capabilities, which, again, is confirming your statement. It’s not the solution, but so far we were able to hire the people we need, as well as augment it with external companies.
I think also that the one thing I always like to think is that people really want to work on very interesting problems. So I believe that, so far, the fact that we were able to hire anybody that … All the different parts, whether it’s computer science or data scientists or other positions, we were able to hire and sustain and have very rapid growth to the company because, once people understand what we’re doing and they look under the hood, it’s very, very interesting. That is one of the things that help us with hiring people. So I guess one answer would be, make sure you’re working on interesting things and your problems will probably still be there, but not as difficult as other companies.
Guy Nadivi: Gadi, for the CIOs, CTOs, and other IT executives listening in, what is your one big must-have piece of advice you’d like them to take away from our discussion with regards to implementing AIOps for their IT operation?
Gadi Oren: Right, so I think that it’s about understanding your goal. I think that the goal of implementing AIOps, you should really think about what are you trying to achieve by that? Maybe five, six years ago, there used to be companies that decided they’d move to the cloud. Moving to the cloud is not a solution. It’s a type of an activity that you do in order to achieve a certain goal. So understand what your goals are when you’re implementing AIOps. Try and have them measurable. I believe I can suggest a goal and that is really to make your organization a lot more proactive than it is today. That’s something that you can measure. You can measure the results of that. You’ll have less brown outs and less outages. I think this is a very important type of a goal that will help you get value from your AIOps.
We also recently published a study that we did, and I don’t remember the exact numbers, but we did it across around the world with 300 people and roughly CIO, CTO, or VP of Infrastructure type of level. 90% of them said that they experienced some serious outage in the recent three years. This is a very, very high number. So because the complexity is constantly increasing, the size of IT that you have is constantly increasing, and usually the size of the team is not, then having AIOps will really help you go through this digital transformation and make sure that you are proactive in preventing issues rather than being constantly in a firefighting mode.
Guy Nadivi: I think you’ve just given any CIOs or CTOs that have experienced outages a great reason to consider deploying AIOps. All right, looks like that’s all the time we have for on this episode of Intelligent Automation Radio. Gadi, thank you so much for joining us today and providing your expert opinion on the value and business benefits of AIOps. We’ve really enjoyed having you on.
Gadi Oren: Thank you, Guy, for having me on the show.
Guy Nadivi: Gadi Oren, Vice President of Technology Evangelism for LogicMonitor. Thank you for listening, everyone, and remember, don’t hesitate, automate.
Vice President of Technology Evangelism for LogicMonitor
In his role as Vice President of Technology Evangelism, Gadi is responsible for the company’s strategic vision and product initiatives. Previously, Gadi was the CEO and Co-Founder of ITculate, where he was responsible for developing world-class technology and product that created contextual monitoring by discovering and leveraging application topology. With over 18 years of industry experience from IT operations management and monitoring to storage and cloud.
Gadi previously served as the CTO and Co-founder of Cloudoscope. Prior to Cloudoscope, Gadi was head of Product Management for analytics and manageability BU at NetApp. He served in leadership roles for some of the fastest growing and industry-transforming companies. Gadi has a management degree from Sloan MIT and B.Sc. Eng. in computer science and electrical engineering from the Technion in Israel.
Gadi has been featured in a number of articles:
- Using predictive analytics to troubleshoot network issues: Fact or fiction?
- LogicMonitor wades into AIOps with anomaly detection
- Latest Kubernetes monitoring, AIOps features trace DevOps trends
Gadi can be reached at: