Welcome!

XML Authors: Liz McMillan, Anne Buff, Carmen Gonzalez, Paige Leidig, Sanjeev Sharma

Related Topics: XML, SOA & WOA

XML: Blog Post

Eating Our Own Dog Food: dynaTrace Does Continuous APM

How dynaTrace does continuous APM internally in development

I sat together with Stefan Frandl, Test Automation Lead in dynaTrace’s R&D Lab in Linz, Austria to discuss how dynaTrace does Continuous APM in Development. Obviously dynaTrace takes performance very seriously as we preach to our clients that Continuous Application Performance Management is a critical component across the Application Lifecycle. The earlier in the Lifecycle you manage and get your performance under control the less you have to worry about actual problems later on when you ship your product.

In the discussion I had with Stefan he talked about how dynaTrace transitioned from traditional performance management to where we are now – which means: “eat our own dog food” and “live the dynaTrace Continuous APM message”.

In this article we learn that it is not simply done by plugging in an APM Solution and all your performance problems are automatically detected. It is about building a robust continuous integration environment with meaningful functional and performance tests. It is about having “buy in” from your engineers. It is about figuring out what needs to be measured – what can be measured – which measures you can actually trust and figuring out which measures indicate your performance health status.

What are the problems that performance management in development solves?

I’ve been a developer for many years – so I have no problems ranting a bit about an attitude that many of us have. Meaning: We always “think” that our code is fast enough. And that might be true on my local dev-machine with enough RAM and CPU power that can easily handle the single user load when testing the newly implemented feature or recent bug fix.
In addition to this problem Stefan listed the following areas:

  • Continuous changes on the codebase by different people over a longer period of time increase the probability of small problems sneaking in and accumulating over time into big problems
  • Multiple “New Features” or “Bug Fixes” across the code base from different developers impacting each other
  • Different hardware reveals different problems – especially in multi-threaded environments
  • Other software running on the target machine impacts your application performance

Executing performance tests only at the end of a sprint/iteration or as the very last step before a product release will uncover all the small accumulated problems or environmental related problems at once. Finding all these problems late means additional effort by the dev team to analyze these problems (going back in the change log, getting back into the code, …) and jeopardizes the project schedule.

Furthermore a developer cannot verify if his improvements are real improvements or only improving the product in his local test setup.

Therefore: focus early and continuously on the performance aspect of your code.

Why the traditional approach failed

Prior to “eating our own dog food” we approached performance management in two traditional ways:

Using Profilers
Developers used profilers on their local machine to identify hot spots on manually executed test cases, e.g.: clicking through the main use case of the new feature. This is of course a valid approach and identifies general performance problems like non optimized algorithms – non-performing usage of collections, “wasted” memory …

The usage of a profiler is limited to low load environments. Why is that? Because profilers – in order to capture all this information – have a significant impact on application performance and don’t work well under heavy load. That means that problems that happen outside the “one user” test scenario are harder to catch with a profiler. I don’t say it’s impossible as you can run profilers in different modes that lower the overhead – but then you often don’t get the detailed data you need. So – there is a big trade-off here. Concurrency problems often only occur in high load scenarios which can’t be covered by profilers. So these problems remain undiscovered.

Using manual timings
Adding custom timers in the code is another approach that was used. Developers added their own time measuring statements in what they believed are critical methods in their code. This approach works better in high load environments – but brings two problems with it:

  1. it requires code changes and it is limited to your own code
  2. you have to manually dig through the collected information and try to make sense of it -> timings alone often don’t help you either so you need to add additional logging to get things like method arguments, …
  3. it’s very hard to compare results collected from different machines (e.g. different hardware)

The manual effort on the one side and the inability to manage performance under heavy load on the other made the traditional approaches fail.

Challenge with accurate and stable measuring

Measuring execution time is a thing that is not too hard to do with all the options that the runtime, application server or operating system provides. But it is not easy to measure the right thing and to measure it accurately. Here are some of the problems Stefan ran into when measuring execution time:

Garbage Collection
In a managed environment like Java and .NET, the Garbage Collector plays a big role in application performance. GC Runs and their impact on performance are unpredictable across runs. Even though you run the same test in the same environment with the same parameters it doesn’t mean that the GC runs consistently. Measuring execution times of Methods in Java & .NET Applications is therefore done by extracting the time the Garbage Collector “suspended” a method from execution. What does that mean? When you take a timestamp at the beginning and at the end of the method the difference is not necessarily the pure execution time. If the GC kicked in while your method was executing it impacts your measured time. In order to get accurate execution time it is therefore necessary to subtract the GC Collection time. This Use Case is supported by dynaTrace in a way that it measures execution time including and excluding GC. But it is still very important to monitor GC times and GC activations. A fast implementation that produces a lot of garbage and therefore adds high load to the CPU may degrade the overall performance of the system by starving other threads.

dynaTrace captures the runtime suspension time per method and per transaction

dynaTrace captures the runtime suspension time per method and per transaction

Intel SpeedStep
Some of the testing machines that Stefan uses showed very volatile test results. The same test on the same code base could not return stable results. One of the biggest challenges is to really come up with a test environment that produces accurate and stable measures. In his particular case it turned out that Intel SpeedStep caused the unpredictable performance behaviour. This might not be true for all of you out there – but it’s a great data point that hopefully helps some of you when trying to find a stable test environment.

CPU Timings under Windows
Besides execution times – meaning taking a timestamp at the beginning and at the end of the method call to measure execution time – it is possible to get the actual time spent on the CPU for the executing thread. This is a very valuable measure. The difference between CPU and Execution Time is explained by time waiting for I/O, the database, a remote call or time waiting on synchronization.
Another hint from Stefan: CPU timings on Windows Operating Systems can be very inaccurate and are therefore ignored for short runs as they don’t give us enough value.

dynaTrace APM is used to manage application performance. With built-in features separating GC Time from Execution Time, the ability to capture CPU Timings (on OS’s where these values make sense), and the fact that dynaTrace traces individual transactions across tiers down to the method level with almost no overhead enables us to use this data for performance management.

If you want to get stable results you have to have stable tests

We learned that there are certain aspects to consider when collecting execution time measures, e.g.: Extract GC Times. In order to get stable results it is not only necessary to have a stable environment and a stable way to measure timing – the key thing is to have stable tests with realistic test data in a realistic environment.
dynaTrace uses two types of load tests: Unit tests to performance test certain features, and SilkPerformer tests to load test our event collection and sending by putting those applications under load that are actively traced and monitored.

Getting Stable and realistic Unit Tests
The Unit tests are executed in all different environments that the dynaTrace software runs on. The key to getting stable results is that every test method has a setup and teardown to ensure that everything is cleaned up before the next test case executes. The test method itself then runs through several “Warm-Up” runs followed by multiple Test Runs that are taken for performance measurement. The Warm-Up phase is necessary to rule out performance impacts of a JVM/CLR that has just started, heap spaces that have not yet reached its “normal” utilization level or impacts on e.g.: not fully initialized caches. How long does the Warm-Up phase last? An indication that is used here is an execution time volatility of < 5%. This means that the warm-up phase runs until the execution times are stabilized. Following that are multiple Test Runs. Average execution values across these test runs are taken to validate the performance of the tested feature.

The dynaTrace Test Automation Team has invested a lot into a home-grown testing framework that allows the execution and the measurement of the above-described approach.

Warm-Up phase runs as long as tests produce stable results

Warm-Up phase runs as long as tests start producing stable results

Getting Stable Load Testing Results
The concept of a Warm-Up and the actual measured Testing-Phase is not a new concept in general. The SilkPerformer Load-Tests that are used work in the same way where SilkPerformer has the built-in feature of a Warm-Up and Measurement Period which made it easy for us to apply this process to these kinds of load tests as well.

So Stefan – How often do you run your tests and what happens if things go wrong?

We use QuickBuild as our Build/Continuous Integration Server. Every time a build is triggered all functional unit tests are executed giving us immediate feedback about functional correctness of the build. A broken build or failed Unit tests trigger alerts to those developers that checked in code in the respective code base since the last successful build. This gives us the chance to immediately fix functional regressions.

Twice a day we also execute the performance Unit tests as described above. In case of a performance regression the same alerting mechanism is triggered – meaning that the developers who made code modifications in the code are automatically notified about the problem. Larger scale performance tests for critical features are executed every day as it is not feasible to execute them more than once.

Providing the information to the developer
In addition to the Unit test results – whether we are talking about functional or performance tests – we capture transactional tracing data (PurePath’s) with dynaTrace Continuous APM (this is where we “eat our own dog food”). dynaTrace runs on the Continuous Integration Environment and traces all tests starting from the test method through those components that are being tested.

Not only do we use dynaTrace to capture transaction-based information like executed methods, method arguments, SQL Statements, Exceptions, … – we also use dynaTrace Dashboards to make the data easily accessible to everybody. The developer is notified via automatically-triggered Email alerts if a threshold is violated. Afterwards he can have a look at our dashboards that show execution times of individual test cases over time. This is great to see performance regressions on the Entry Point level. From here we can drill deeper to a Triage Dashboard in order to identify the root cause of the identified regression, e.g.: unnecessary exceptions or method calls causing overhead. All this is available at the fingertips of the developer or architect who needs to look at the details.

A Real-Life Example on how to prevent problems from shipping with the product

Now let’s look at a real life example. For version 3.1 of dynaTrace we changed the way we read and write memory dumps. We expected huge performance improvements with that change. In order to verify this and in order to compare it against the existing implementation we created a set of load tests that tested the memory dump feature with different sizes of memory dumps.
The following graph shows the load testing results of different dump sizes.

Performance results over time for different test use cases

Performance results over time for different test use cases

We executed the tests back in May for the existing implementation to have a baseline. On May 19th we ran the first test with the new implementation. We can observe that most of the tests ran faster – but – we had some that were significantly slower. It turned out this was due to an incorrect internal cache strategy which was really fast for large amounts of data but slow for small dumps. Once this problem was fixed we overall got much better times with the new implementation.

Further down the road two bug fixes – that were not intended to “hurt” performance – actually impacted the performance of a single use case dramatically. With the tests in place and with the collection of detailed results, it was easy to identify the problem caused by the fix and solve it in no time.
Further performance improvements that were made early June can be seen and verified that they actually improved the new version.

What was necessary to catch this problem?
It was necessary to have the correct test cases in place. In this case it was essential to run the same test on different environments with different input data. Otherwise the problem would have remained undetected.
Running these tests twice a day allowed the developer who committed the fix to go back in the code that he was still very familiar with and fix the problem for those scenarios that were not tested on his local machine but were tested in the CI Environment.
dynaTrace Continuous APM is used to analyze test executions – it provides accurate timings and in-depth tracing information including method execution times, argument values, database statements, exceptions, … This information is essential for analyzing the root cause of the problem fast in order to bring performance back on track.

Conclusion

dynaTrace uses dynaTrace Continuous APM internally to live what we believe is the correct approach for Continuous Application Performance Management. We learned that an APM Solution is one piece of the puzzle in a Development Environment. In order to do performance management it is essential to have good and stable test cases that are executed continuously. Then the Continuous APM Solution enables you to identify regressions early on providing your developers the in-depth information they need to minimize bug fixing time.

Here is the dynaTrace Test Automation check list for successful performance changes:

  • Create performance test before improving feature performance
  • Carry them out at least 5 times to be sure that they are stable
  • Implement the “Performance” of the feature
  • Rerun the tests and check if your assumptions are correct

Related reading:

  1. Continuous Performance Management in Development Continuous Integration has become a well established practice in todays...
  2. Performance Management in Continuous Integration I recently gave  presentations on Performance Management as part of...
  3. 5 Steps to Automate Browser Performance Analysis with Watir and dynaTrace AJAX Edition I’ve recently been working with several clients to analyze their...
  4. Visual Studio Team System for Unit-, Web- and Load-Testing with dynaTrace Last week I was given the opportunity to meet the...
  5. Do more with Functional Testing – Take the Next Evolutionary Step Functional Testing has always been an activity done by Test...

More Stories By Andreas Grabner

Andreas Grabner has more than a decade of experience as an architect and developer in the Java and .NET space. In his current role, Andi works as a Technology Strategist for Compuware and leads the Compuware APM Center of Excellence team. In his role he influences the Compuware APM product strategy and works closely with customers in implementing performance management solutions across the entire application lifecycle. He is a frequent speaker at technology conferences on performance and architecture-related topics, and regularly authors articles offering business and technology advice for Compuware’s About:Performance blog.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
SYS-CON Events announced today that Harbinger Systems will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Harbinger Systems is a global company providing software technology services. Since 1990, Harbinger has developed a strong customer base worldwide. Its customers include software product companies ranging from hi-tech start-ups in Silicon Valley to leading product companies in the US and large in-house IT organizations.
The only place to be June 9-11 is Cloud Expo & @ThingsExpo 2015 East at the Javits Center in New York City. Join us there as delegates from all over the world come to listen to and engage with speakers & sponsors from the leading Cloud Computing, IoT & Big Data companies. Cloud Expo & @ThingsExpo are the leading events covering the booming market of Cloud Computing, IoT & Big Data for the enterprise. Speakers from all over the world will be hand-picked for their ability to explore the economic strategies that utility/cloud computing provides. Whether public, private, or in a hybrid form, clo...
SYS-CON Events announces a new pavilion on the Cloud Expo floor where WebRTC converges with the Internet of Things. Pavilion will showcase WebRTC and the Internet of Things. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devices--computers, smartphones, tablets, and sensors – connected to the Internet by 2020. This number will continue to grow at a rapid pace for the next several decades.
SYS-CON Events announced today that Gridstore™, the leader in software-defined storage (SDS) purpose-built for Windows Servers and Hyper-V, will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Gridstore™ is the leader in software-defined storage purpose built for virtualization that is designed to accelerate applications in virtualized environments. Using its patented Server-Side Virtual Controller™ Technology (SVCT) to eliminate the I/O blender effect and accelerate applications Gridsto...
SYS-CON Events announced today that Red Hat, the world's leading provider of open source solutions, will exhibit at Internet of @ThingsExpo, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Red Hat is the world's leading provider of open source software solutions, using a community-powered approach to reliable and high-performing cloud, Linux, middleware, storage and virtualization technologies. Red Hat also offers award-winning support, training, and consulting services. As the connective hub in a global network of enterprises, partners, a...
As the Internet of Things unfolds, mobile and wearable devices are blurring the line between physical and digital, integrating ever more closely with our interests, our routines, our daily lives. Contextual computing and smart, sensor-equipped spaces bring the potential to walk through a world that recognizes us and responds accordingly. We become continuous transmitters and receivers of data. In his session at Internet of @ThingsExpo, Andrew Bolwell, Director of Innovation for HP’s Printing and Personal Systems Group, will discuss how key attributes of mobile technology – touch input, senso...
The Internet of Things (IoT) is making everything it touches smarter – smart devices, smart cars and smart cities. And lucky us, we’re just beginning to reap the benefits as we work toward a networked society. However, this technology-driven innovation is impacting more than just individuals. The IoT has an environmental impact as well, which brings us to the theme of this month’s #IoTuesday Twitter chat. The ability to remove inefficiencies through connected objects is driving change throughout every sector, including waste management. BigBelly Solar, located just outside of Boston, is trans...
Connected devices and the Internet of Things are getting significant momentum in 2014. In his session at Internet of @ThingsExpo, Jim Hunter, Chief Scientist & Technology Evangelist at Greenwave Systems, will examine three key elements that together will drive mass adoption of the IoT before the end of 2015. The first element is the recent advent of robust open source protocols (like AllJoyn and WebRTC) that facilitate M2M communication. The second is broad availability of flexible, cost-effective storage designed to handle the massive surge in back-end data in a world where timely analytics...
Internet of @ThingsExpo Silicon Valley announced on Thursday its first 12 all-star speakers and sessions for its upcoming event, which will take place November 4-6, 2014, at the Santa Clara Convention Center in California. @ThingsExpo, the first and largest IoT event in the world, debuted at the Javits Center in New York City in June 10-12, 2014 with over 6,000 delegates attending the conference. Among the first 12 announced world class speakers, IBM will present two highly popular IoT sessions, which will take place November 4-6, 2014 at the Santa Clara Convention Center in Santa Clara, Calif...
The Internet of Things (IoT) promises to evolve the way the world does business; however, understanding how to apply it to your company can be a mystery. Most people struggle with understanding the potential business uses or tend to get caught up in the technology, resulting in solutions that fail to meet even minimum business goals. In his session at Internet of @ThingsExpo, Jesse Shiah, CEO / President / Co-Founder of AgilePoint Inc., will show what is needed to leverage the IoT to transform your business. He will discuss opportunities and challenges ahead for the IoT from a market and tec...
SYS-CON Events announced today that TeleStax, the main sponsor of Mobicents, will exhibit at Internet of @ThingsExpo, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. TeleStax provides Open Source Communications software and services that facilitate the shift from legacy SS7 based IN networks to IP based LTE and IMS networks hosted on private (on-premise), hybrid or public clouds. TeleStax products include Restcomm, JSLEE, SMSC Gateway, USSD Gateway, SS7 Resource Adaptors, SIP Servlets, Rich Multimedia Services, Presence Services/RCS, Diame...
From a software development perspective IoT is about programming "things," about connecting them with each other or integrating them with existing applications. In his session at @ThingsExpo, Yakov Fain, co-founder of Farata Systems and SuranceBay, will show you how small IoT-enabled devices from multiple manufacturers can be integrated into the workflow of an enterprise application. This is a practical demo of building a framework and components in HTML/Java/Mobile technologies to serve as a platform that can integrate new devices as they become available on the market.
SYS-CON Events announced today that O'Reilly Media has been named “Media Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. O'Reilly Media spreads the knowledge of innovators through its books, online services, magazines, and conferences. Since 1978, O'Reilly Media has been a chronicler and catalyst of cutting-edge development, homing in on the technology trends that really matter and spurring their adoption by amplifying "faint signals" from the alpha geeks who are creating the future. An...
The Transparent Cloud-computing Consortium (abbreviation: T-Cloud Consortium) will conduct research activities into changes in the computing model as a result of collaboration between "device" and "cloud" and the creation of new value and markets through organic data processing High speed and high quality networks, and dramatic improvements in computer processing capabilities, have greatly changed the nature of applications and made the storing and processing of data on the network commonplace.
SYS-CON Events announced today that Aria Systems, the recurring revenue expert, has been named "Bronze Sponsor" of SYS-CON's 15th International Cloud Expo®, which will take place on November 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Aria Systems helps leading businesses connect their customers with the products and services they love. Industry leaders like Pitney Bowes, Experian, AAA NCNU, VMware, HootSuite and many others choose Aria to power their recurring revenue business and deliver exceptional experiences to their customers.
The Internet of Things (IoT) is going to require a new way of thinking and of developing software for speed, security and innovation. This requires IT leaders to balance business as usual while anticipating for the next market and technology trends. Cloud provides the right IT asset portfolio to help today’s IT leaders manage the old and prepare for the new. Today the cloud conversation is evolving from private and public to hybrid. This session will provide use cases and insights to reinforce the value of the network in helping organizations to maximize their company’s cloud experience.
As a disruptive technology, Web Real-Time Communication (WebRTC), which is an emerging standard of web communications, is redefining how brands and consumers communicate in real time. The on-going narrative around WebRTC has largely been around incorporating video, audio and chat functions to apps. In his session at Internet of @ThingsExpo, Alex Gouaillard, Founder and CTO of Temasys Communications, will look at a fourth element – data channels – and talk about its potential to move WebRTC beyond browsers and into the Internet of Things.
SYS-CON Events announced today that Gigaom Research has been named "Media Sponsor" of SYS-CON's 15th International Cloud Expo®, which will take place on November 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Ashar Baig, Research Director, Cloud, at Gigaom Research, will also lead a Power Panel on the topic "Choosing the Right Cloud Option." Gigaom Research provides timely, in-depth analysis of emerging technologies for individual and corporate subscribers. Gigaom Research's network of 200+ independent analysts provides new content daily that bridges the gap between break...
We certainly live in interesting technological times. And no more interesting than the current competing IoT standards for connectivity. Various standards bodies, approaches, and ecosystems are vying for mindshare and positioning for a competitive edge. It is clear that when the dust settles, we will have new protocols, evolved protocols, that will change the way we interact with devices and infrastructure. We will also have evolved web protocols, like HTTP/2, that will be changing the very core of our infrastructures. At the same time, we have old approaches made new again like micro-services...
The Industrial Internet revolution is now underway, enabled by connected machines and billions of devices that communicate and collaborate. The massive amounts of Big Data requiring real-time analysis is flooding legacy IT systems and giving way to cloud environments that can handle the unpredictable workloads. Yet many barriers remain until we can fully realize the opportunities and benefits from the convergence of machines and devices with Big Data and the cloud, including interoperability, data security and privacy.