Industrial IoT Authors: Elizabeth White, Stackify Blog, Yeshim Deniz, SmartBear Blog, Liz McMillan

Related Topics: @CloudExpo, Java IoT, Industrial IoT, Microservices Expo, Machine Learning , Agile Computing, Cloud Security

@CloudExpo: Article

The DevOps Way to Solve JVM Memory Issues

Foster collaboration and be proactive

The killer in any IT operation is unplanned work. Unplanned work may go by many names: firefighting, war rooms, Sev 1 incidents. The bottom line is that Operations must stop whatever planned work it was doing to manage this drill. This means little or no normal work is being accomplished. It is a scenario most of you will be familiar with: your application servers are humming along happily until suddenly, without an obvious reason, memory usage starts to increase, soon followed by longer garbage collection suspensions that finally force you to restart the application. The operations team is typically unaware of the actual impact on end users (other than a service being down), and it additionally lacks data and time to further investigate the issue. As communication between the traditional silos of operations, testing and development teams is often less than ideal, a scheduled restart in a "low impact" timeframe is often the easiest solution and turns into something resembling a "Production Best Practice" over time. This adds to the workload of an operations team because unplanned work becomes unnecessary preventative work. It also becomes a suspect every time there is a problem with the application. Wouldn't it be better to actually fix the issues instead of just working around them? Shouldn't there be a general understanding across all teams responsible for an application to fix the problem as fast as possible and make sure that it's prevented in the future?

DevOps Fundamentals at Cloud Expo Event Calendar

In this blog we walk you through a case study where a memory leak in a third-party plugin impacted end user performance. Instead of hiding the problem with preventive JVM restarts, DevOps Best Practices were used, which fostered the collaboration between Ops, Test and Dev.

The Rise of Third-Party Plugins
While applications in the early days of computing were monolithic behemoths, their modern successors, no matter if desktop or browser based, usually provide extension points that allow developers to extend their functionality with plugins. Such plugins can be used both on the client and server side. Familiar examples from every day use include browser plugins for IE and Firefox such as Skype, Flash or Java, and Add-ons for Outlook or Excel. A popular server-side example is plugins for WordPress, the platform we use for this blog instance. We use plugins that automatically filter out spam comments and provide various ways for you to share our posts on your social network of choice.

From an application owner's perspective, the biggest benefit of a plugin-based architecture is the increase in flexibility - you can meet changing needs by adding new plugins, instead of worrying about upgrading a much larger system. But by using plugins, you grant a (more or less well-known) third party access to your data and systems, which frequently raises privacy concerns as well as security issues. An example of this is the Java browser plugin's gaping security holes. While best practices such as sandboxing help with these risks, these discussions typically focus on the client side; the performance impact of plugins on your application's server-side is often missed. We have covered the possible effects of client-side plugins before, and will focus on the server side in this blog post.

The Trigger for Operations
Let's get back to our memory issue that forced us - the R&D lab of Compuware APM - to get Dev and Ops together and work on a solution to regular scheduled restarts of our application servers. Across all Compuware APM product lines we use a Salesforce-based case management solution called Case360 to support our customers. Within our R&D organization we internally use Atlassian JIRA, a popular Java-based bug tracking solution that has grown into a platform for agile software development through its plugin ecosystem. New issues raised by customers as well as changes to existing issues have to be synchronized between Case360 and JIRA. Since this is not an out-of-the-box capability of JIRA, we looked and found a plugin that meets our requirement and worked well for us for the first several months.

Fast forward a couple of months. Seemingly out of the blue, we began to see performance issues with JIRA. Our production monitoring alerted Ops about decreased end-user performance with some users aborting actions due to very long response times. Nobody had called in yet - but the early warning system indicated that users would soon complain.

Ops and Dev Working Together
Looking at the infrastructure monitoring data showed Ops that the root cause for the slower performance was high garbage collection time on the JIRA server. The pattern of GC times as well as JVM heap consumption indicated a "classical" memory leak.

Ops started to investigate and worked with our performance engineering team to establish causality between the start of the issues and other changes, but came up blank. No new plugins had been installed recently, and no updates to the underlying operating system, the Java runtime, or JIRA had been applied within the last weeks.

Due to the increased memory usage, an Out of Memory Exception (OOM) was unavoidable. As it was still during business hours a "controlled" restart was also not the best option. The OOM unfortunately happened. In this case our monitoring solution automatically triggered a full memory dump that allowed us to view the heap's content at the time of the error. When analyzing the dump, we noticed a number of large object instances as shown in the following screenshot:

Automatically triggered Memory Dump shows objects that consumed most of the heap space

Looking at the class names of these instances, we were able to identify the actual culprit: the Salesforce synchronization plugin. The plugin had been in use for over half a year without any problem. It comes with a cache used for the tickets that were synchronized. With the number of tickets growing over time, this cache grew as well. Unfortunately, this cache was not limited, and when we finally reached a critical number of tickets and attachments, this cache caused JIRA to run out of memory.

The very high number of HashMap and HashMap Entry objects filled up JIRA's heap.

With this information, we were able to pinpoint the root cause and reach out to the developers of the third-party provider of the plugin. The detailed data we had available - both the memory dumps and the impact it had on end-user response time - avoided all collaboration and communication problems that you typically have. There was no finger pointing or going back and forth multiple times to provide more detailed log files. Within days (before another OOM could occur again), we had a fixed version, first deployed in our staging environment and tested by our performance team then later deployed in production, giving Ops the confidence that it would solve the problem.

Don't Do It the "Easy Way": Preventive JVM Reboots
What would've happened without proactively monitoring the application? It would have been the end users calling in and not the early warning alerts that gave our teams a head start with the analysis. JVM metrics and log messages alone would have not been useful to analyze this problem as no log output indicated an endless growing cache. Without this insight there was no immediate connection between the plugin and the start of the memory issues.

Reaching out to Atlassian and supplying the team with log files would've been the next step, adding additional turnaround time - time that our users have to spend dealing with sporadic outages. Even if they would've been able to point us in the right direction (to the plugin vendor), we would've lost more time there as the usual process of trying to reproduce the problem on their systems would've begun.

The most common solution we talked about in the introduction is to schedule restarts of JVMs during low traffic hours in order to prevent a major impact, continuing this until the problem finally gets fixed or simply do it forever. This is not "proactive." It is just the easy way for damage control.

Do It the "DevOps Way": Foster Collaboration and Be Proactive
As we learned from our own example - preventive restarting of application servers is not the only measure Ops has to fight problems within the application that impacts end users.

It requires a performance culture within the organization to put the right people, processes and priorities in place - supported by tooling that makes collaboration and root cause analysis easy. Having data readily available allows us to overcome all the typical collaboration and communication problems between those that are impacted by the problem (Ops) and those that have to fix the problem (Dev). With that you can ensure higher availability of your systems, resulting not only in happier users, but also freeing up Ops resources from troubleshooting.

More Stories By Wolfgang Gottesheim

Wolfgang Gottesheim has several years of experience as a software engineer and research assistant in the Java enterprise space. Currently he contributes to the strategic development of the dynaTrace enterprise solution as a Technology Strategy in the Compuware APM division’s Center of Excellence. He focuses on monitoring and optimizing applications in production.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

@ThingsExpo Stories
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
Digital Transformation and Disruption, Amazon Style - What You Can Learn. Chris Kocher is a co-founder of Grey Heron, a management and strategic marketing consulting firm. He has 25+ years in both strategic and hands-on operating experience helping executives and investors build revenues and shareholder value. He has consulted with over 130 companies on innovating with new business models, product strategies and monetization. Chris has held management positions at HP and Symantec in addition to ...
Cloud-enabled transformation has evolved from cost saving measure to business innovation strategy -- one that combines the cloud with cognitive capabilities to drive market disruption. Learn how you can achieve the insight and agility you need to gain a competitive advantage. Industry-acclaimed CTO and cloud expert, Shankar Kalyana presents. Only the most exceptional IBMers are appointed with the rare distinction of IBM Fellow, the highest technical honor in the company. Shankar has also receive...
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities - ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups.
The standardization of container runtimes and images has sparked the creation of an almost overwhelming number of new open source projects that build on and otherwise work with these specifications. Of course, there's Kubernetes, which orchestrates and manages collections of containers. It was one of the first and best-known examples of projects that make containers truly useful for production use. However, more recently, the container ecosystem has truly exploded. A service mesh like Istio addr...
Poor data quality and analytics drive down business value. In fact, Gartner estimated that the average financial impact of poor data quality on organizations is $9.7 million per year. But bad data is much more than a cost center. By eroding trust in information, analytics and the business decisions based on these, it is a serious impediment to digital transformation.
Predicting the future has never been more challenging - not because of the lack of data but because of the flood of ungoverned and risk laden information. Microsoft states that 2.5 exabytes of data are created every day. Expectations and reliance on data are being pushed to the limits, as demands around hybrid options continue to grow.
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
As IoT continues to increase momentum, so does the associated risk. Secure Device Lifecycle Management (DLM) is ranked as one of the most important technology areas of IoT. Driving this trend is the realization that secure support for IoT devices provides companies the ability to deliver high-quality, reliable, secure offerings faster, create new revenue streams, and reduce support costs, all while building a competitive advantage in their markets. In this session, we will use customer use cases...
Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
The IoT Will Grow: In what might be the most obvious prediction of the decade, the IoT will continue to expand next year, with more and more devices coming online every single day. What isn’t so obvious about this prediction: where that growth will occur. The retail, healthcare, and industrial/supply chain industries will likely see the greatest growth. Forrester Research has predicted the IoT will become “the backbone” of customer value as it continues to grow. It is no surprise that retail is ...
Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settlement products to hedge funds and investment banks. After, he co-founded a revenue cycle management company where he learned about Bitcoin and eventually Ethereal. Andrew's role at ConsenSys Enterprise is a mul...
The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of bus...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...
DXWorldEXPO LLC announced today that "Miami Blockchain Event by FinTechEXPO" has announced that its Call for Papers is now open. The two-day event will present 20 top Blockchain experts. All speaking inquiries which covers the following information can be submitted by email to [email protected] Financial enterprises in New York City, London, Singapore, and other world financial capitals are embracing a new generation of smart, automated FinTech that eliminates many cumbersome, slow, and expe...
Cloud Expo | DXWorld Expo have announced the conference tracks for Cloud Expo 2018. Cloud Expo will be held June 5-7, 2018, at the Javits Center in New York City, and November 6-8, 2018, at the Santa Clara Convention Center, Santa Clara, CA. Digital Transformation (DX) is a major focus with the introduction of DX Expo within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive ov...
DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.
DXWorldEXPO | CloudEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and tools they need to realize their goals. Sponsors of DXWorldEXPO | CloudEXPO benefit from unmatched branding, profile building and lead generation opportunities.
DXWorldEXPO LLC announced today that ICOHOLDER named "Media Sponsor" of Miami Blockchain Event by FinTechEXPO. ICOHOLDER give you detailed information and help the community to invest in the trusty projects. Miami Blockchain Event by FinTechEXPO has opened its Call for Papers. The two-day event will present 20 top Blockchain experts. All speaking inquiries which covers the following information can be submitted by email to [email protected] Miami Blockchain Event by FinTechEXPO also offers s...