Click here to close now.

Welcome!

XML Authors: AppDynamics Blog, Pat Romanski, Elizabeth White, Carmen Gonzalez, Liz McMillan

Related Topics: XML, Java, Microservices Journal

XML: Article

Troubleshooting Response Time Problems

Why you cannot trust your system metrics

Production Monitoring is about ensuring the stability and health of our system, that also includes the application. A lot of times we encounter production systems that concentrate on System Monitoring, under the assumption that a stable system leads to stable and healthy applications. So let’s see what System Monitoring can tell us about our Application.

Let’s take a very simple two-tier Web Application:

A simple two tier web application

A simple two tier web application

This is a simple multi-tier eCommerce solution. Users are concerned about bad performance when they do a search. Let's see what we can find out about it if performance is not satisfactory. We start by looking at a couple of simple metrics.

CPU Utilization
The best known operating system metric is CPU utilization, but it is also the most misunderstood. This metric tells us how much time the CPU spent executing code in the last interval and how much more it could execute theoretically. Like all other utilization measures it tells us something about the capacity, but not about health, stability or even performance. Simply put: 99% CPU utilization can either be optimal or indicate impeding disaster depending on the application.

The CPU Usage of the two tiers

The CPU charts show no shortage on either tier

Let's look at our setup. We see that the CPU utilization is well below 100%, so we do have capacity left. But does that mean the machine or the application can be considered healthy? Let’s look at another measure that is better suited for the job, the Load Average  (System\Processor QueueLength on Windows). The Load Average tells us how many threads or processes are currently executed or waiting to get CPU time.

Unix Top Output: load average: 1.31, 1.13, 1.10

Linux systems display three sliding load averages for the last one, five and 15 minutes. The output above shows that in the last minute there were on average 1.3 processes that needed a CPU core at the same time.

If the Load Average is higher than the number of cores in the system we should either see near 100% CPU utilization, or the system has to wait for other resources and cannot max out the CPU. Examples would be Swapping or other I/O related tasks. So the Load Average tells us if we should trust the CPU usage on the one hand and if the machine is overloaded on the other. It does not tell us how well the application itself is performing, but whether the shortage of CPU might impact it negatively. If we do notice a problem we can identify the application that is causing the issue, but not why it is causing it.

In our case we see that neither the load average nor the CPU usage shines any light on our performance issue. If it were to show high CPU utilization or a high load average we could assume that the shortage in CPU is a problem, but we could not be certain.

Memory Usage
Used memory is monitored because the lack of memory will lead to system instability. An important fact to note is that Unix and Linux operating systems will most always show close to 100% memory utilization over time. They fill the memory up with buffers and caches which get discarded, as opposed to swapped out, if that memory is needed otherwise. In order to get the "real" memory usage we need subtract these. In Linux we can do by using the free command.

Memory Usage on the two systems

Memory Usage on the two systems, neither is suffering memory problems

If we do not have enough memory we can try to identify which application consumes the most by looking at the resident memory usage of a process. Once identified we will have to use other means to identify why the process uses up the memory and whether this is ok. When we look towards memory regarding Java/.NET performance we have to make sure that the application itself is never swapped out. This is especially important because  Java accesses all its memory in a random-access fashion and if a portion were to be swapped out it would have serve performance penalties. We can monitor this via swapping measures on the process itself. So what we can learn here is whether the shortage of memory has a negative impact on application performance. As this is not the case, we are tempted to ignore memory as the issue.

We could look at other measures like network or disk, but in all cases the same thing would be true, the shortage of a resource might have impact, but we cannot say for sure. And if we don't find a shortage it does not necessarily mean that everything is fine.

Database
An especially good example of this problem is the database. Very often the database is considered the source of all performance problems, at least by the application people. From a DBA's and operations point of view the database is often running fine though.  Their reasoning is simple enough, the database is not running out of any resources, there are no especially long running or CPU consuming statements or processes running and most statements execute quite fast. So the database can not be the problem.

Let's look at this from an application point of view

Looking at the Application
As users are reporting performance problems the first thing that we do is to look at the response time and its distribution within our system.

The overall distribution in our web application

The overall distribution in our system does not show any particular bottleneck

At first glance we don't see anything particularly interesting when looking at the whole system. As users are complaining about specific requests lets go ahead and look at these in particular:

Response time distribution of the search

The response time distribution of the specific request shows a bottleneck in the backend and a lot of database calls for each and every search request

We see that the majority of the response time lies in the backend and the database layer. That the database contributes a major portion to the response time does not mean however that the DBA was wrong. We see that every single search executes 416 statements on average! That means that every statement is executing in under one millisecond and this is fast enough from the database point of view. The problem really lies within the application and its usage of the database. Let's look at the backend next.

Heap usage and GC activity on the backend

The Heap Usage and GC activity chart shows a lot of GC runs, but does it have negative impact?

Looking at the JVM we immediately see that it does execute a lot of garbage collection (the red spikes), as you would probably see in every monitoring tool. Although this gives us a strong suspicion, we do not know how this is affecting our users. So let's look at that impact:

GC Runtime suspensions that have an impact on the search

These are the runtime suspensions that directly impact the search. It is considerable but still amounts to only 10% of the response time

A single transaction is hit by garbage collection several times and if we do the math we find out that garbage collection contributes 10% to the response time. While that is considerable it would not have made sense to spend a lot of time on tuning it just now. Even if we get it down to half it would only have saved us 5% of the response time. So while monitoring garbage collection is important, we should always analyze the impact before we jump to conclusions.

So let's take a deeper look at where that particular transaction is spending time on the backend. To do this we need to have application centric monitoring in place which we can then use to isolate the root cause.

Response time distribution of the search within the backend

The detailed response time distribution of the search within the backend shows two main problems: too many EJB calls and a very slow doPost method

With the right measure points within our application we immediately see the root causes of the response time problem. At first we see that the WebService call done by the search takes up a large portion of the response time. It is also the largest CPU hotspot within that call. So while the host is not suffering CPU problems, we are in fact consuming a lot of it in that particular transaction. Secondly we see that an awful lot of EJB calls are done which in turn leads to the many database calls that we have already noticed.

That means we have identified a small memory-related issue; although there are no memory problems noticeable if we were to look only at system monitoring. We also found that we have a CPU hotspot, but the machine itself does not have a CPU problem. And finally we found that the biggest issue is squarely within the application; too many database and EJB calls, which we cannot see on a system monitoring level at all.

Conclusion
System metrics do a very good job at describing the environment, after all that is what they are meant for. If the environment itself has resource shortages we can almost assume that this has a negative impact on the applications, but we cannot be sure. If there is no obvious shortage this does not, however, imply that the application is running smoothly. A healthy and stable environment does not guarantee a healthy, stable and performing application.

Similar to the system, the application needs to be monitored in detail and with application-specific metrics in order to ensure its health and stability. There is no universal rule as to what these metrics are, but they should enable us to describe the health, stability and performance of the application itself.

Related reading:

  1. The impact of Garbage Collection on Java performance // In my last post I explained what a major...
  2. Top 10 Performance Problems taken from Zappos, Monster, Thomson and Co For a recent edition of the Swiss Computerworld Magazine we...
  3. Top 10 Client-Side Performance Problems in Web 2.0 Inspired by the Top 10 Performance Problems post which focuses...
  4. Real Life Ajax Troubleshooting Guide One of our clients occasionally runs into the following problem...
  5. Presenting Top Web 2.0 Performance Problems at WebTechCon 2010 in Mainz, Germany WebTech Conference 2010 takes place in Mainz, Germany from October...

More Stories By Michael Kopp

Michael Kopp has over 12 years of experience as an architect and developer in the Enterprise Java space. Before coming to CompuwareAPM dynaTrace he was the Chief Architect at GoldenSource, a major player in the EDM space. In 2009 he joined dynaTrace as a technology strategist in the center of excellence. He specializes application performance management in large scale production environments with special focus on virtualized and cloud environments. His current focus is how to effectively leverage BigData Solutions and how these technologies impact and change the application landscape.

@ThingsExpo Stories
There's Big Data, then there's really Big Data from the Internet of Things. IoT is evolving to include many data possibilities like new types of event, log and network data. The volumes are enormous, generating tens of billions of logs per day, which raise data challenges. Early IoT deployments are relying heavily on both the cloud and managed service providers to navigate these challenges. In her session at Big Data Expo®, Hannah Smalltree, Director at Treasure Data, discussed how IoT, Big Data and deployments are processing massive data volumes from wearables, utilities and other machines...
Buzzword alert: Microservices and IoT at a DevOps conference? What could possibly go wrong? In this Power Panel at DevOps Summit, moderated by Jason Bloomberg, the leading expert on architecting agility for the enterprise and president of Intellyx, panelists will peel away the buzz and discuss the important architectural principles behind implementing IoT solutions for the enterprise. As remote IoT devices and sensors become increasingly intelligent, they become part of our distributed cloud environment, and we must architect and code accordingly. At the very least, you'll have no problem fil...
SYS-CON Events announced today that MetraTech, now part of Ericsson, has been named “Silver Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9–11, 2015, at the Javits Center in New York, NY. Ericsson is the driving force behind the Networked Society- a world leader in communications infrastructure, software and services. Some 40% of the world’s mobile traffic runs through networks Ericsson has supplied, serving more than 2.5 billion subscribers.
The 4th International Internet of @ThingsExpo, co-located with the 17th International Cloud Expo - to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA - announces that its Call for Papers is open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than 20 years ago.
The worldwide cellular network will be the backbone of the future IoT, and the telecom industry is clamoring to get on board as more than just a data pipe. In his session at @ThingsExpo, Evan McGee, CTO of Ring Plus, Inc., discussed what service operators can offer that would benefit IoT entrepreneurs, inventors, and consumers. Evan McGee is the CTO of RingPlus, a leading innovative U.S. MVNO and wireless enabler. His focus is on combining web technologies with traditional telecom to create a new breed of unified communication that is easily accessible to the general consumer. With over a de...
Disruptive macro trends in technology are impacting and dramatically changing the "art of the possible" relative to supply chain management practices through the innovative use of IoT, cloud, machine learning and Big Data to enable connected ecosystems of engagement. Enterprise informatics can now move beyond point solutions that merely monitor the past and implement integrated enterprise fabrics that enable end-to-end supply chain visibility to improve customer service delivery and optimize supplier management. Learn about enterprise architecture strategies for designing connected systems tha...
Cloud is not a commodity. And no matter what you call it, computing doesn’t come out of the sky. It comes from physical hardware inside brick and mortar facilities connected by hundreds of miles of networking cable. And no two clouds are built the same way. SoftLayer gives you the highest performing cloud infrastructure available. One platform that takes data centers around the world that are full of the widest range of cloud computing options, and then integrates and automates everything. Join SoftLayer on June 9 at 16th Cloud Expo to learn about IBM Cloud's SoftLayer platform, explore se...
SYS-CON Media announced today that 9 out of 10 " most read" DevOps articles are published by @DevOpsSummit Blog. Launched in October 2014, @DevOpsSummit Blog offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce softw...
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo in Silicon Valley. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 17th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal an...
15th Cloud Expo, which took place Nov. 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA, expanded the conference content of @ThingsExpo, Big Data Expo, and DevOps Summit to include two developer events. IBM held a Bluemix Developer Playground on November 5 and ElasticBox held a Hackathon on November 6. Both events took place on the expo floor. The Bluemix Developer Playground, for developers of all levels, highlighted the ease of use of Bluemix, its services and functionality and provide short-term introductory projects that developers can complete between sessions.
From telemedicine to smart cars, digital homes and industrial monitoring, the explosive growth of IoT has created exciting new business opportunities for real time calls and messaging. In his session at @ThingsExpo, Ivelin Ivanov, CEO and Co-Founder of Telestax, shared some of the new revenue sources that IoT created for Restcomm – the open source telephony platform from Telestax. Ivelin Ivanov is a technology entrepreneur who founded Mobicents, an Open Source VoIP Platform, to help create, deploy, and manage applications integrating voice, video and data. He is the co-founder of TeleStax, a...
The Internet of Things (IoT) promises to evolve the way the world does business; however, understanding how to apply it to your company can be a mystery. Most people struggle with understanding the potential business uses or tend to get caught up in the technology, resulting in solutions that fail to meet even minimum business goals. In his session at @ThingsExpo, Jesse Shiah, CEO / President / Co-Founder of AgilePoint Inc., showed what is needed to leverage the IoT to transform your business. He discussed opportunities and challenges ahead for the IoT from a market and technical point of vie...
Grow your business with enterprise wearable apps using SAP Platforms and Google Glass. SAP and Google just launched the SAP and Google Glass Challenge, an opportunity for you to innovate and develop the best Enterprise Wearable App using SAP Platforms and Google Glass and gain valuable market exposure. In his session at @ThingsExpo, Brian McPhail, Senior Director of Business Development, ISVs & Digital Commerce at SAP, outlined the timeline of the SAP Google Glass Challenge and the opportunity for developers, start-ups, and companies of all sizes to engage with SAP today.
The 3rd International @ThingsExpo, co-located with the 16th International Cloud Expo – to be held June 9-11, 2015, at the Javits Center in New York City, NY – is now accepting Hackathon proposals. Hackathon sponsorship benefits include general brand exposure and increasing engagement with the developer ecosystem. At Cloud Expo 2014 Silicon Valley, IBM held the Bluemix Developer Playground on November 5 and ElasticBox held the DevOps Hackathon on November 6. Both events took place on the expo floor. The Bluemix Developer Playground, for developers of all levels, highlighted the ease of use of...
Enthusiasm for the Internet of Things has reached an all-time high. In 2013 alone, venture capitalists spent more than $1 billion dollars investing in the IoT space. With "smart" appliances and devices, IoT covers wearable smart devices, cloud services to hardware companies. Nest, a Google company, detects temperatures inside homes and automatically adjusts it by tracking its user's habit. These technologies are quickly developing and with it come challenges such as bridging infrastructure gaps, abiding by privacy concerns and making the concept a reality. These challenges can't be addressed w...
The industrial software market has treated data with the mentality of “collect everything now, worry about how to use it later.” We now find ourselves buried in data, with the pervasive connectivity of the (Industrial) Internet of Things only piling on more numbers. There’s too much data and not enough information. In his session at @ThingsExpo, Bob Gates, Global Marketing Director, GE’s Intelligent Platforms business, to discuss how realizing the power of IoT, software developers are now focused on understanding how industrial data can create intelligence for industrial operations. Imagine ...
SYS-CON Events announced today that Liaison Technologies, a leading provider of data management and integration cloud services and solutions, has been named "Silver Sponsor" of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York, NY. Liaison Technologies is a recognized market leader in providing cloud-enabled data integration and data management solutions to break down complex information barriers, enabling enterprises to make smarter decisions, faster.
The 17th International Cloud Expo has announced that its Call for Papers is open. 17th International Cloud Expo, to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, APM, APIs, Microservices, Security, Big Data, Internet of Things, DevOps and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal today!
Hadoop as a Service (as offered by handful of niche vendors now) is a cloud computing solution that makes medium and large-scale data processing accessible, easy, fast and inexpensive. In his session at Big Data Expo, Kumar Ramamurthy, Vice President and Chief Technologist, EIM & Big Data, at Virtusa, will discuss how this is achieved by eliminating the operational challenges of running Hadoop, so one can focus on business growth. The fragmented Hadoop distribution world and various PaaS solutions that provide a Hadoop flavor either make choices for customers very flexible in the name of opti...
Cultural, regulatory, environmental, political and economic (CREPE) conditions over the past decade are creating cross-industry solution spaces that require processes and technologies from both the Internet of Things (IoT), and Data Management and Analytics (DMA). These solution spaces are evolving into Sensor Analytics Ecosystems (SAE) that represent significant new opportunities for organizations of all types. Public Utilities throughout the world, providing electricity, natural gas and water, are pursuing SmartGrid initiatives that represent one of the more mature examples of SAE. We have s...