Welcome!

Industrial IoT Authors: Yeshim Deniz, Liz McMillan, Pat Romanski, Elizabeth White, SmartBear Blog

Related Topics: Industrial IoT, Java IoT, Microservices Expo

Industrial IoT: Article

Troubleshooting Response Time Problems

Why you cannot trust your system metrics

Production Monitoring is about ensuring the stability and health of our system, that also includes the application. A lot of times we encounter production systems that concentrate on System Monitoring, under the assumption that a stable system leads to stable and healthy applications. So let’s see what System Monitoring can tell us about our Application.

Let’s take a very simple two-tier Web Application:

A simple two tier web application

A simple two tier web application

This is a simple multi-tier eCommerce solution. Users are concerned about bad performance when they do a search. Let's see what we can find out about it if performance is not satisfactory. We start by looking at a couple of simple metrics.

CPU Utilization
The best known operating system metric is CPU utilization, but it is also the most misunderstood. This metric tells us how much time the CPU spent executing code in the last interval and how much more it could execute theoretically. Like all other utilization measures it tells us something about the capacity, but not about health, stability or even performance. Simply put: 99% CPU utilization can either be optimal or indicate impeding disaster depending on the application.

The CPU Usage of the two tiers

The CPU charts show no shortage on either tier

Let's look at our setup. We see that the CPU utilization is well below 100%, so we do have capacity left. But does that mean the machine or the application can be considered healthy? Let’s look at another measure that is better suited for the job, the Load Average  (System\Processor QueueLength on Windows). The Load Average tells us how many threads or processes are currently executed or waiting to get CPU time.

Unix Top Output: load average: 1.31, 1.13, 1.10

Linux systems display three sliding load averages for the last one, five and 15 minutes. The output above shows that in the last minute there were on average 1.3 processes that needed a CPU core at the same time.

If the Load Average is higher than the number of cores in the system we should either see near 100% CPU utilization, or the system has to wait for other resources and cannot max out the CPU. Examples would be Swapping or other I/O related tasks. So the Load Average tells us if we should trust the CPU usage on the one hand and if the machine is overloaded on the other. It does not tell us how well the application itself is performing, but whether the shortage of CPU might impact it negatively. If we do notice a problem we can identify the application that is causing the issue, but not why it is causing it.

In our case we see that neither the load average nor the CPU usage shines any light on our performance issue. If it were to show high CPU utilization or a high load average we could assume that the shortage in CPU is a problem, but we could not be certain.

Memory Usage
Used memory is monitored because the lack of memory will lead to system instability. An important fact to note is that Unix and Linux operating systems will most always show close to 100% memory utilization over time. They fill the memory up with buffers and caches which get discarded, as opposed to swapped out, if that memory is needed otherwise. In order to get the "real" memory usage we need subtract these. In Linux we can do by using the free command.

Memory Usage on the two systems

Memory Usage on the two systems, neither is suffering memory problems

If we do not have enough memory we can try to identify which application consumes the most by looking at the resident memory usage of a process. Once identified we will have to use other means to identify why the process uses up the memory and whether this is ok. When we look towards memory regarding Java/.NET performance we have to make sure that the application itself is never swapped out. This is especially important because  Java accesses all its memory in a random-access fashion and if a portion were to be swapped out it would have serve performance penalties. We can monitor this via swapping measures on the process itself. So what we can learn here is whether the shortage of memory has a negative impact on application performance. As this is not the case, we are tempted to ignore memory as the issue.

We could look at other measures like network or disk, but in all cases the same thing would be true, the shortage of a resource might have impact, but we cannot say for sure. And if we don't find a shortage it does not necessarily mean that everything is fine.

Database
An especially good example of this problem is the database. Very often the database is considered the source of all performance problems, at least by the application people. From a DBA's and operations point of view the database is often running fine though.  Their reasoning is simple enough, the database is not running out of any resources, there are no especially long running or CPU consuming statements or processes running and most statements execute quite fast. So the database can not be the problem.

Let's look at this from an application point of view

Looking at the Application
As users are reporting performance problems the first thing that we do is to look at the response time and its distribution within our system.

The overall distribution in our web application

The overall distribution in our system does not show any particular bottleneck

At first glance we don't see anything particularly interesting when looking at the whole system. As users are complaining about specific requests lets go ahead and look at these in particular:

Response time distribution of the search

The response time distribution of the specific request shows a bottleneck in the backend and a lot of database calls for each and every search request

We see that the majority of the response time lies in the backend and the database layer. That the database contributes a major portion to the response time does not mean however that the DBA was wrong. We see that every single search executes 416 statements on average! That means that every statement is executing in under one millisecond and this is fast enough from the database point of view. The problem really lies within the application and its usage of the database. Let's look at the backend next.

Heap usage and GC activity on the backend

The Heap Usage and GC activity chart shows a lot of GC runs, but does it have negative impact?

Looking at the JVM we immediately see that it does execute a lot of garbage collection (the red spikes), as you would probably see in every monitoring tool. Although this gives us a strong suspicion, we do not know how this is affecting our users. So let's look at that impact:

GC Runtime suspensions that have an impact on the search

These are the runtime suspensions that directly impact the search. It is considerable but still amounts to only 10% of the response time

A single transaction is hit by garbage collection several times and if we do the math we find out that garbage collection contributes 10% to the response time. While that is considerable it would not have made sense to spend a lot of time on tuning it just now. Even if we get it down to half it would only have saved us 5% of the response time. So while monitoring garbage collection is important, we should always analyze the impact before we jump to conclusions.

So let's take a deeper look at where that particular transaction is spending time on the backend. To do this we need to have application centric monitoring in place which we can then use to isolate the root cause.

Response time distribution of the search within the backend

The detailed response time distribution of the search within the backend shows two main problems: too many EJB calls and a very slow doPost method

With the right measure points within our application we immediately see the root causes of the response time problem. At first we see that the WebService call done by the search takes up a large portion of the response time. It is also the largest CPU hotspot within that call. So while the host is not suffering CPU problems, we are in fact consuming a lot of it in that particular transaction. Secondly we see that an awful lot of EJB calls are done which in turn leads to the many database calls that we have already noticed.

That means we have identified a small memory-related issue; although there are no memory problems noticeable if we were to look only at system monitoring. We also found that we have a CPU hotspot, but the machine itself does not have a CPU problem. And finally we found that the biggest issue is squarely within the application; too many database and EJB calls, which we cannot see on a system monitoring level at all.

Conclusion
System metrics do a very good job at describing the environment, after all that is what they are meant for. If the environment itself has resource shortages we can almost assume that this has a negative impact on the applications, but we cannot be sure. If there is no obvious shortage this does not, however, imply that the application is running smoothly. A healthy and stable environment does not guarantee a healthy, stable and performing application.

Similar to the system, the application needs to be monitored in detail and with application-specific metrics in order to ensure its health and stability. There is no universal rule as to what these metrics are, but they should enable us to describe the health, stability and performance of the application itself.

Related reading:

  1. The impact of Garbage Collection on Java performance // In my last post I explained what a major...
  2. Top 10 Performance Problems taken from Zappos, Monster, Thomson and Co For a recent edition of the Swiss Computerworld Magazine we...
  3. Top 10 Client-Side Performance Problems in Web 2.0 Inspired by the Top 10 Performance Problems post which focuses...
  4. Real Life Ajax Troubleshooting Guide One of our clients occasionally runs into the following problem...
  5. Presenting Top Web 2.0 Performance Problems at WebTechCon 2010 in Mainz, Germany WebTech Conference 2010 takes place in Mainz, Germany from October...

More Stories By Michael Kopp

Michael Kopp has over 12 years of experience as an architect and developer in the Enterprise Java space. Before coming to CompuwareAPM dynaTrace he was the Chief Architect at GoldenSource, a major player in the EDM space. In 2009 he joined dynaTrace as a technology strategist in the center of excellence. He specializes application performance management in large scale production environments with special focus on virtualized and cloud environments. His current focus is how to effectively leverage BigData Solutions and how these technologies impact and change the application landscape.

@ThingsExpo Stories
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...
While the focus and objectives of IoT initiatives are many and diverse, they all share a few common attributes, and one of those is the network. Commonly, that network includes the Internet, over which there isn't any real control for performance and availability. Or is there? The current state of the art for Big Data analytics, as applied to network telemetry, offers new opportunities for improving and assuring operational integrity. In his session at @ThingsExpo, Jim Frey, Vice President of S...
DX World EXPO, LLC., a Lighthouse Point, Florida-based startup trade show producer and the creator of "DXWorldEXPO® - Digital Transformation Conference & Expo" has announced its executive management team. The team is headed by Levent Selamoglu, who has been named CEO. "Now is the time for a truly global DX event, to bring together the leading minds from the technology world in a conversation about Digital Transformation," he said in making the announcement.
"DX encompasses the continuing technology revolution, and is addressing society's most important issues throughout the entire $78 trillion 21st-century global economy," said Roger Strukhoff, Conference Chair. "DX World Expo has organized these issues along 10 tracks with more than 150 of the world's top speakers coming to Istanbul to help change the world."
"We provide IoT solutions. We provide the most compatible solutions for many applications. Our solutions are industry agnostic and also protocol agnostic," explained Richard Han, Head of Sales and Marketing and Engineering at Systena America, in this SYS-CON.tv interview at @ThingsExpo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We are focused on SAP running in the clouds, to make this super easy because we believe in the tremendous value of those powerful worlds - SAP and the cloud," explained Frank Stienhans, CTO of Ocean9, Inc., in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We've been engaging with a lot of customers including Panasonic, we've been involved with Cisco and now we're working with the U.S. government - the Department of Homeland Security," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held June 6-8, 2017, at the Javits Center in New York City, NY.
In the enterprise today, connected IoT devices are everywhere – both inside and outside corporate environments. The need to identify, manage, control and secure a quickly growing web of connections and outside devices is making the already challenging task of security even more important, and onerous. In his session at @ThingsExpo, Rich Boyer, CISO and Chief Architect for Security at NTT i3, discussed new ways of thinking and the approaches needed to address the emerging challenges of security i...
What sort of WebRTC based applications can we expect to see over the next year and beyond? One way to predict development trends is to see what sorts of applications startups are building. In his session at @ThingsExpo, Arin Sime, founder of WebRTC.ventures, discussed the current and likely future trends in WebRTC application development based on real requests for custom applications from real customers, as well as other public sources of information.
The financial services market is one of the most data-driven industries in the world, yet it’s bogged down by legacy CPU technologies that simply can’t keep up with the task of querying and visualizing billions of records. In his session at 20th Cloud Expo, Karthik Lalithraj, a Principal Solutions Architect at Kinetica, discussed how the advent of advanced in-database analytics on the GPU makes it possible to run sophisticated data science workloads on the same database that is housing the rich...
SYS-CON Events announced today that Massive Networks will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Massive Networks mission is simple. To help your business operate seamlessly with fast, reliable, and secure internet and network solutions. Improve your customer's experience with outstanding connections to your cloud.
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
Everything run by electricity will eventually be connected to the Internet. Get ahead of the Internet of Things revolution and join Akvelon expert and IoT industry leader, Sergey Grebnov, in his session at @ThingsExpo, for an educational dive into the world of managing your home, workplace and all the devices they contain with the power of machine-based AI and intelligent Bot services for a completely streamlined experience.
"The Striim platform is a full end-to-end streaming integration and analytics platform that is middleware that covers a lot of different use cases," explained Steve Wilkes, Founder and CTO at Striim, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, will examine the regulations and provide insight on how it affects technology, challenges the established rules and will usher in new levels of diligence...
SYS-CON Events announced today that Calligo, an innovative cloud service provider offering mid-sized companies the highest levels of data privacy and security, has been named "Bronze Sponsor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Calligo offers unparalleled application performance guarantees, commercial flexibility and a personalised support service from its globally located cloud plat...
"MobiDev is a Ukraine-based software development company. We do mobile development, and we're specialists in that. But we do full stack software development for entrepreneurs, for emerging companies, and for enterprise ventures," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
In his opening keynote at 20th Cloud Expo, Michael Maximilien, Research Scientist, Architect, and Engineer at IBM, discussed the full potential of the cloud and social data requires artificial intelligence. By mixing Cloud Foundry and the rich set of Watson services, IBM's Bluemix is the best cloud operating system for enterprises today, providing rapid development and deployment of applications that can take advantage of the rich catalog of Watson services to help drive insights from the vast t...