Welcome!

Industrial IoT Authors: Don MacVittie, Elizabeth White, Stackify Blog, Yeshim Deniz, SmartBear Blog

Related Topics: @CloudExpo, Java IoT, Industrial IoT, Microservices Expo, Machine Learning , Agile Computing, Cloud Security

@CloudExpo: Article

The DevOps Way to Solve JVM Memory Issues

Foster collaboration and be proactive

The killer in any IT operation is unplanned work. Unplanned work may go by many names: firefighting, war rooms, Sev 1 incidents. The bottom line is that Operations must stop whatever planned work it was doing to manage this drill. This means little or no normal work is being accomplished. It is a scenario most of you will be familiar with: your application servers are humming along happily until suddenly, without an obvious reason, memory usage starts to increase, soon followed by longer garbage collection suspensions that finally force you to restart the application. The operations team is typically unaware of the actual impact on end users (other than a service being down), and it additionally lacks data and time to further investigate the issue. As communication between the traditional silos of operations, testing and development teams is often less than ideal, a scheduled restart in a "low impact" timeframe is often the easiest solution and turns into something resembling a "Production Best Practice" over time. This adds to the workload of an operations team because unplanned work becomes unnecessary preventative work. It also becomes a suspect every time there is a problem with the application. Wouldn't it be better to actually fix the issues instead of just working around them? Shouldn't there be a general understanding across all teams responsible for an application to fix the problem as fast as possible and make sure that it's prevented in the future?

DevOps Fundamentals at Cloud Expo Event Calendar

In this blog we walk you through a case study where a memory leak in a third-party plugin impacted end user performance. Instead of hiding the problem with preventive JVM restarts, DevOps Best Practices were used, which fostered the collaboration between Ops, Test and Dev.

The Rise of Third-Party Plugins
While applications in the early days of computing were monolithic behemoths, their modern successors, no matter if desktop or browser based, usually provide extension points that allow developers to extend their functionality with plugins. Such plugins can be used both on the client and server side. Familiar examples from every day use include browser plugins for IE and Firefox such as Skype, Flash or Java, and Add-ons for Outlook or Excel. A popular server-side example is plugins for WordPress, the platform we use for this blog instance. We use plugins that automatically filter out spam comments and provide various ways for you to share our posts on your social network of choice.

From an application owner's perspective, the biggest benefit of a plugin-based architecture is the increase in flexibility - you can meet changing needs by adding new plugins, instead of worrying about upgrading a much larger system. But by using plugins, you grant a (more or less well-known) third party access to your data and systems, which frequently raises privacy concerns as well as security issues. An example of this is the Java browser plugin's gaping security holes. While best practices such as sandboxing help with these risks, these discussions typically focus on the client side; the performance impact of plugins on your application's server-side is often missed. We have covered the possible effects of client-side plugins before, and will focus on the server side in this blog post.

The Trigger for Operations
Let's get back to our memory issue that forced us - the R&D lab of Compuware APM - to get Dev and Ops together and work on a solution to regular scheduled restarts of our application servers. Across all Compuware APM product lines we use a Salesforce-based case management solution called Case360 to support our customers. Within our R&D organization we internally use Atlassian JIRA, a popular Java-based bug tracking solution that has grown into a platform for agile software development through its plugin ecosystem. New issues raised by customers as well as changes to existing issues have to be synchronized between Case360 and JIRA. Since this is not an out-of-the-box capability of JIRA, we looked and found a plugin that meets our requirement and worked well for us for the first several months.

Fast forward a couple of months. Seemingly out of the blue, we began to see performance issues with JIRA. Our production monitoring alerted Ops about decreased end-user performance with some users aborting actions due to very long response times. Nobody had called in yet - but the early warning system indicated that users would soon complain.

Ops and Dev Working Together
Looking at the infrastructure monitoring data showed Ops that the root cause for the slower performance was high garbage collection time on the JIRA server. The pattern of GC times as well as JVM heap consumption indicated a "classical" memory leak.

Ops started to investigate and worked with our performance engineering team to establish causality between the start of the issues and other changes, but came up blank. No new plugins had been installed recently, and no updates to the underlying operating system, the Java runtime, or JIRA had been applied within the last weeks.

Due to the increased memory usage, an Out of Memory Exception (OOM) was unavoidable. As it was still during business hours a "controlled" restart was also not the best option. The OOM unfortunately happened. In this case our monitoring solution automatically triggered a full memory dump that allowed us to view the heap's content at the time of the error. When analyzing the dump, we noticed a number of large object instances as shown in the following screenshot:

Automatically triggered Memory Dump shows objects that consumed most of the heap space

Looking at the class names of these instances, we were able to identify the actual culprit: the Salesforce synchronization plugin. The plugin had been in use for over half a year without any problem. It comes with a cache used for the tickets that were synchronized. With the number of tickets growing over time, this cache grew as well. Unfortunately, this cache was not limited, and when we finally reached a critical number of tickets and attachments, this cache caused JIRA to run out of memory.

The very high number of HashMap and HashMap Entry objects filled up JIRA's heap.

With this information, we were able to pinpoint the root cause and reach out to the developers of the third-party provider of the plugin. The detailed data we had available - both the memory dumps and the impact it had on end-user response time - avoided all collaboration and communication problems that you typically have. There was no finger pointing or going back and forth multiple times to provide more detailed log files. Within days (before another OOM could occur again), we had a fixed version, first deployed in our staging environment and tested by our performance team then later deployed in production, giving Ops the confidence that it would solve the problem.

Don't Do It the "Easy Way": Preventive JVM Reboots
What would've happened without proactively monitoring the application? It would have been the end users calling in and not the early warning alerts that gave our teams a head start with the analysis. JVM metrics and log messages alone would have not been useful to analyze this problem as no log output indicated an endless growing cache. Without this insight there was no immediate connection between the plugin and the start of the memory issues.

Reaching out to Atlassian and supplying the team with log files would've been the next step, adding additional turnaround time - time that our users have to spend dealing with sporadic outages. Even if they would've been able to point us in the right direction (to the plugin vendor), we would've lost more time there as the usual process of trying to reproduce the problem on their systems would've begun.

The most common solution we talked about in the introduction is to schedule restarts of JVMs during low traffic hours in order to prevent a major impact, continuing this until the problem finally gets fixed or simply do it forever. This is not "proactive." It is just the easy way for damage control.

Do It the "DevOps Way": Foster Collaboration and Be Proactive
As we learned from our own example - preventive restarting of application servers is not the only measure Ops has to fight problems within the application that impacts end users.

It requires a performance culture within the organization to put the right people, processes and priorities in place - supported by tooling that makes collaboration and root cause analysis easy. Having data readily available allows us to overcome all the typical collaboration and communication problems between those that are impacted by the problem (Ops) and those that have to fix the problem (Dev). With that you can ensure higher availability of your systems, resulting not only in happier users, but also freeing up Ops resources from troubleshooting.

More Stories By Wolfgang Gottesheim

Wolfgang Gottesheim has several years of experience as a software engineer and research assistant in the Java enterprise space. Currently he contributes to the strategic development of the dynaTrace enterprise solution as a Technology Strategy in the Compuware APM division’s Center of Excellence. He focuses on monitoring and optimizing applications in production.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
Large industrial manufacturing organizations are adopting the agile principles of cloud software companies. The industrial manufacturing development process has not scaled over time. Now that design CAD teams are geographically distributed, centralizing their work is key. With large multi-gigabyte projects, outdated tools have stifled industrial team agility, time-to-market milestones, and impacted P&L stakeholders.
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
"There's plenty of bandwidth out there but it's never in the right place. So what Cedexis does is uses data to work out the best pathways to get data from the origin to the person who wants to get it," explained Simon Jones, Evangelist and Head of Marketing at Cedexis, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5–7, 2018, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buye...
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
It is of utmost importance for the future success of WebRTC to ensure that interoperability is operational between web browsers and any WebRTC-compliant client. To be guaranteed as operational and effective, interoperability must be tested extensively by establishing WebRTC data and media connections between different web browsers running on different devices and operating systems. In his session at WebRTC Summit at @ThingsExpo, Dr. Alex Gouaillard, CEO and Founder of CoSMo Software, presented ...
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, introduced two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a multip...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
An increasing number of companies are creating products that combine data with analytical capabilities. Running interactive queries on Big Data requires complex architectures to store and query data effectively, typically involving data streams, an choosing efficient file format/database and multiple independent systems that are tied together through custom-engineered pipelines. In his session at @BigDataExpo at @ThingsExpo, Tomer Levi, a senior software engineer at Intel’s Advanced Analytics gr...