Click here to close now.




















Welcome!

Industrial IoT Authors: Elizabeth White, Aruna Ravichandran, Trevor Parsons, Liz McMillan, Aruna Ravichandran

Related Topics: Industrial IoT, Java IoT, Microservices Expo

Industrial IoT: Article

Troubleshooting Response Time Problems

Why you cannot trust your system metrics

Production Monitoring is about ensuring the stability and health of our system, that also includes the application. A lot of times we encounter production systems that concentrate on System Monitoring, under the assumption that a stable system leads to stable and healthy applications. So let’s see what System Monitoring can tell us about our Application.

Let’s take a very simple two-tier Web Application:

A simple two tier web application

A simple two tier web application

This is a simple multi-tier eCommerce solution. Users are concerned about bad performance when they do a search. Let's see what we can find out about it if performance is not satisfactory. We start by looking at a couple of simple metrics.

CPU Utilization
The best known operating system metric is CPU utilization, but it is also the most misunderstood. This metric tells us how much time the CPU spent executing code in the last interval and how much more it could execute theoretically. Like all other utilization measures it tells us something about the capacity, but not about health, stability or even performance. Simply put: 99% CPU utilization can either be optimal or indicate impeding disaster depending on the application.

The CPU Usage of the two tiers

The CPU charts show no shortage on either tier

Let's look at our setup. We see that the CPU utilization is well below 100%, so we do have capacity left. But does that mean the machine or the application can be considered healthy? Let’s look at another measure that is better suited for the job, the Load Average  (System\Processor QueueLength on Windows). The Load Average tells us how many threads or processes are currently executed or waiting to get CPU time.

Unix Top Output: load average: 1.31, 1.13, 1.10

Linux systems display three sliding load averages for the last one, five and 15 minutes. The output above shows that in the last minute there were on average 1.3 processes that needed a CPU core at the same time.

If the Load Average is higher than the number of cores in the system we should either see near 100% CPU utilization, or the system has to wait for other resources and cannot max out the CPU. Examples would be Swapping or other I/O related tasks. So the Load Average tells us if we should trust the CPU usage on the one hand and if the machine is overloaded on the other. It does not tell us how well the application itself is performing, but whether the shortage of CPU might impact it negatively. If we do notice a problem we can identify the application that is causing the issue, but not why it is causing it.

In our case we see that neither the load average nor the CPU usage shines any light on our performance issue. If it were to show high CPU utilization or a high load average we could assume that the shortage in CPU is a problem, but we could not be certain.

Memory Usage
Used memory is monitored because the lack of memory will lead to system instability. An important fact to note is that Unix and Linux operating systems will most always show close to 100% memory utilization over time. They fill the memory up with buffers and caches which get discarded, as opposed to swapped out, if that memory is needed otherwise. In order to get the "real" memory usage we need subtract these. In Linux we can do by using the free command.

Memory Usage on the two systems

Memory Usage on the two systems, neither is suffering memory problems

If we do not have enough memory we can try to identify which application consumes the most by looking at the resident memory usage of a process. Once identified we will have to use other means to identify why the process uses up the memory and whether this is ok. When we look towards memory regarding Java/.NET performance we have to make sure that the application itself is never swapped out. This is especially important because  Java accesses all its memory in a random-access fashion and if a portion were to be swapped out it would have serve performance penalties. We can monitor this via swapping measures on the process itself. So what we can learn here is whether the shortage of memory has a negative impact on application performance. As this is not the case, we are tempted to ignore memory as the issue.

We could look at other measures like network or disk, but in all cases the same thing would be true, the shortage of a resource might have impact, but we cannot say for sure. And if we don't find a shortage it does not necessarily mean that everything is fine.

Database
An especially good example of this problem is the database. Very often the database is considered the source of all performance problems, at least by the application people. From a DBA's and operations point of view the database is often running fine though.  Their reasoning is simple enough, the database is not running out of any resources, there are no especially long running or CPU consuming statements or processes running and most statements execute quite fast. So the database can not be the problem.

Let's look at this from an application point of view

Looking at the Application
As users are reporting performance problems the first thing that we do is to look at the response time and its distribution within our system.

The overall distribution in our web application

The overall distribution in our system does not show any particular bottleneck

At first glance we don't see anything particularly interesting when looking at the whole system. As users are complaining about specific requests lets go ahead and look at these in particular:

Response time distribution of the search

The response time distribution of the specific request shows a bottleneck in the backend and a lot of database calls for each and every search request

We see that the majority of the response time lies in the backend and the database layer. That the database contributes a major portion to the response time does not mean however that the DBA was wrong. We see that every single search executes 416 statements on average! That means that every statement is executing in under one millisecond and this is fast enough from the database point of view. The problem really lies within the application and its usage of the database. Let's look at the backend next.

Heap usage and GC activity on the backend

The Heap Usage and GC activity chart shows a lot of GC runs, but does it have negative impact?

Looking at the JVM we immediately see that it does execute a lot of garbage collection (the red spikes), as you would probably see in every monitoring tool. Although this gives us a strong suspicion, we do not know how this is affecting our users. So let's look at that impact:

GC Runtime suspensions that have an impact on the search

These are the runtime suspensions that directly impact the search. It is considerable but still amounts to only 10% of the response time

A single transaction is hit by garbage collection several times and if we do the math we find out that garbage collection contributes 10% to the response time. While that is considerable it would not have made sense to spend a lot of time on tuning it just now. Even if we get it down to half it would only have saved us 5% of the response time. So while monitoring garbage collection is important, we should always analyze the impact before we jump to conclusions.

So let's take a deeper look at where that particular transaction is spending time on the backend. To do this we need to have application centric monitoring in place which we can then use to isolate the root cause.

Response time distribution of the search within the backend

The detailed response time distribution of the search within the backend shows two main problems: too many EJB calls and a very slow doPost method

With the right measure points within our application we immediately see the root causes of the response time problem. At first we see that the WebService call done by the search takes up a large portion of the response time. It is also the largest CPU hotspot within that call. So while the host is not suffering CPU problems, we are in fact consuming a lot of it in that particular transaction. Secondly we see that an awful lot of EJB calls are done which in turn leads to the many database calls that we have already noticed.

That means we have identified a small memory-related issue; although there are no memory problems noticeable if we were to look only at system monitoring. We also found that we have a CPU hotspot, but the machine itself does not have a CPU problem. And finally we found that the biggest issue is squarely within the application; too many database and EJB calls, which we cannot see on a system monitoring level at all.

Conclusion
System metrics do a very good job at describing the environment, after all that is what they are meant for. If the environment itself has resource shortages we can almost assume that this has a negative impact on the applications, but we cannot be sure. If there is no obvious shortage this does not, however, imply that the application is running smoothly. A healthy and stable environment does not guarantee a healthy, stable and performing application.

Similar to the system, the application needs to be monitored in detail and with application-specific metrics in order to ensure its health and stability. There is no universal rule as to what these metrics are, but they should enable us to describe the health, stability and performance of the application itself.

Related reading:

  1. The impact of Garbage Collection on Java performance // In my last post I explained what a major...
  2. Top 10 Performance Problems taken from Zappos, Monster, Thomson and Co For a recent edition of the Swiss Computerworld Magazine we...
  3. Top 10 Client-Side Performance Problems in Web 2.0 Inspired by the Top 10 Performance Problems post which focuses...
  4. Real Life Ajax Troubleshooting Guide One of our clients occasionally runs into the following problem...
  5. Presenting Top Web 2.0 Performance Problems at WebTechCon 2010 in Mainz, Germany WebTech Conference 2010 takes place in Mainz, Germany from October...

More Stories By Michael Kopp

Michael Kopp has over 12 years of experience as an architect and developer in the Enterprise Java space. Before coming to CompuwareAPM dynaTrace he was the Chief Architect at GoldenSource, a major player in the EDM space. In 2009 he joined dynaTrace as a technology strategist in the center of excellence. He specializes application performance management in large scale production environments with special focus on virtualized and cloud environments. His current focus is how to effectively leverage BigData Solutions and how these technologies impact and change the application landscape.

@ThingsExpo Stories
For IoT to grow as quickly as analyst firms’ project, a lot is going to fall on developers to quickly bring applications to market. But the lack of a standard development platform threatens to slow growth and make application development more time consuming and costly, much like we’ve seen in the mobile space. In his session at @ThingsExpo, Mike Weiner, Product Manager of the Omega DevCloud with KORE Telematics Inc., discussed the evolving requirements for developers as IoT matures and conducted a live demonstration of how quickly application development can happen when the need to comply wit...
Converging digital disruptions is creating a major sea change - Cisco calls this the Internet of Everything (IoE). IoE is the network connection of People, Process, Data and Things, fueled by Cloud, Mobile, Social, Analytics and Security, and it represents a $19Trillion value-at-stake over the next 10 years. In her keynote at @ThingsExpo, Manjula Talreja, VP of Cisco Consulting Services, discussed IoE and the enormous opportunities it provides to public and private firms alike. She will share what businesses must do to thrive in the IoE economy, citing examples from several industry sectors.
Growth hacking is common for startups to make unheard-of progress in building their business. Career Hacks can help Geek Girls and those who support them (yes, that's you too, Dad!) to excel in this typically male-dominated world. Get ready to learn the facts: Is there a bias against women in the tech / developer communities? Why are women 50% of the workforce, but hold only 24% of the STEM or IT positions? Some beginnings of what to do about it! In her Opening Keynote at 16th Cloud Expo, Sandy Carter, IBM General Manager Cloud Ecosystem and Developers, and a Social Business Evangelist, d...
Explosive growth in connected devices. Enormous amounts of data for collection and analysis. Critical use of data for split-second decision making and actionable information. All three are factors in making the Internet of Things a reality. Yet, any one factor would have an IT organization pondering its infrastructure strategy. How should your organization enhance its IT framework to enable an Internet of Things implementation? In his session at @ThingsExpo, James Kirkland, Red Hat's Chief Architect for the Internet of Things and Intelligent Systems, described how to revolutionize your archit...
The Internet of Things is not only adding billions of sensors and billions of terabytes to the Internet. It is also forcing a fundamental change in the way we envision Information Technology. For the first time, more data is being created by devices at the edge of the Internet rather than from centralized systems. What does this mean for today's IT professional? In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists addressed this very serious issue of profound change in the industry.
Discussions about cloud computing are evolving into discussions about enterprise IT in general. As enterprises increasingly migrate toward their own unique clouds, new issues such as the use of containers and microservices emerge to keep things interesting. In this Power Panel at 16th Cloud Expo, moderated by Conference Chair Roger Strukhoff, panelists addressed the state of cloud computing today, and what enterprise IT professionals need to know about how the latest topics and trends affect their organization.
The Internet of Everything (IoE) brings together people, process, data and things to make networked connections more relevant and valuable than ever before – transforming information into knowledge and knowledge into wisdom. IoE creates new capabilities, richer experiences, and unprecedented opportunities to improve business and government operations, decision making and mission support capabilities.
There will be 150 billion connected devices by 2020. New digital businesses have already disrupted value chains across every industry. APIs are at the center of the digital business. You need to understand what assets you have that can be exposed digitally, what their digital value chain is, and how to create an effective business model around that value chain to compete in this economy. No enterprise can be complacent and not engage in the digital economy. Learn how to be the disruptor and not the disruptee.
SYS-CON Events announced today that HPM Networks will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. For 20 years, HPM Networks has been integrating technology solutions that solve complex business challenges. HPM Networks has designed solutions for both SMB and enterprise customers throughout the San Francisco Bay Area.
Akana has released Envision, an enhanced API analytics platform that helps enterprises mine critical insights across their digital eco-systems, understand their customers and partners and offer value-added personalized services. “In today’s digital economy, data-driven insights are proving to be a key differentiator for businesses. Understanding the data that is being tunneled through their APIs and how it can be used to optimize their business and operations is of paramount importance,” said Alistair Farquharson, CTO of Akana.
Business as usual for IT is evolving into a "Make or Buy" decision on a service-by-service conversation with input from the LOBs. How does your organization move forward with cloud? In his general session at 16th Cloud Expo, Paul Maravei, Regional Sales Manager, Hybrid Cloud and Managed Services at Cisco, discusses how Cisco and its partners offer a market-leading portfolio and ecosystem of cloud infrastructure and application services that allow you to uniquely and securely combine cloud business applications and services across multiple cloud delivery models.
The enterprise market will drive IoT device adoption over the next five years. In his session at @ThingsExpo, John Greenough, an analyst at BI Intelligence, division of Business Insider, analyzed how companies will adopt IoT products and the associated cost of adopting those products. John Greenough is the lead analyst covering the Internet of Things for BI Intelligence- Business Insider’s paid research service. Numerous IoT companies have cited his analysis of the IoT. Prior to joining BI Intelligence, he worked analyzing bank technology for Corporate Insight and The Clearing House Payment...
It is one thing to build single industrial IoT applications, but what will it take to build the Smart Cities and truly society-changing applications of the future? The technology won’t be the problem, it will be the number of parties that need to work together and be aligned in their motivation to succeed. In his session at @ThingsExpo, Jason Mondanaro, Director, Product Management at Metanga, discussed how you can plan to cooperate, partner, and form lasting all-star teams to change the world and it starts with business models and monetization strategies.
In his keynote at 16th Cloud Expo, Rodney Rogers, CEO of Virtustream, discussed the evolution of the company from inception to its recent acquisition by EMC – including personal insights, lessons learned (and some WTF moments) along the way. Learn how Virtustream’s unique approach of combining the economics and elasticity of the consumer cloud model with proper performance, application automation and security into a platform became a breakout success with enterprise customers and a natural fit for the EMC Federation.
"Optimal Design is a technology integration and product development firm that specializes in connecting devices to the cloud," stated Joe Wascow, Co-Founder & CMO of Optimal Design, in this SYS-CON.tv interview at @ThingsExpo, held June 9-11, 2015, at the Javits Center in New York City.
SYS-CON Events announced today that CommVault has been named “Bronze Sponsor” of SYS-CON's 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. A singular vision – a belief in a better way to address current and future data management needs – guides CommVault in the development of Singular Information Management® solutions for high-performance data protection, universal availability and simplified management of data on complex storage networks. CommVault's exclusive single-platform architecture gives companies unp...
Electric Cloud and Arynga have announced a product integration partnership that will bring Continuous Delivery solutions to the automotive Internet-of-Things (IoT) market. The joint solution will help automotive manufacturers, OEMs and system integrators adopt DevOps automation and Continuous Delivery practices that reduce software build and release cycle times within the complex and specific parameters of embedded and IoT software systems.
"ciqada is a combined platform of hardware modules and server products that lets people take their existing devices or new devices and lets them be accessible over the Internet for their users," noted Geoff Engelstein of ciqada, a division of Mars International, in this SYS-CON.tv interview at @ThingsExpo, held June 9-11, 2015, at the Javits Center in New York City.
Internet of Things is moving from being a hype to a reality. Experts estimate that internet connected cars will grow to 152 million, while over 100 million internet connected wireless light bulbs and lamps will be operational by 2020. These and many other intriguing statistics highlight the importance of Internet powered devices and how market penetration is going to multiply many times over in the next few years.
SYS-CON Events announced today that Dyn, the worldwide leader in Internet Performance, will exhibit at SYS-CON's 17th International Cloud Expo®, which will take place on November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Dyn is a cloud-based Internet Performance company. Dyn helps companies monitor, control, and optimize online infrastructure for an exceptional end-user experience. Through a world-class network and unrivaled, objective intelligence into Internet conditions, Dyn ensures traffic gets delivered faster, safer, and more reliably than ever.