Welcome!

Industrial IoT Authors: Elizabeth White, Liz McMillan, Mark Hauser, XebiaLabs Blog, Pat Romanski

Related Topics: Machine Learning , Java IoT, Industrial IoT, Microservices Expo, Microsoft Cloud, @CloudExpo

Machine Learning : Article

Diagnosing Stuck Transactions in Minutes

A Step-by-Step Guide

Does the following situation sound familiar? From one minute to the other, your production servers grind to a halt, terse emails are complemented by equally hectic phone calls, and the first order of business is to get back up and running. After the dust settles, you're usually left with a pile of log files and the assignment of figuring out what happened, why it happened, and what to do to keep it from happening again.

A common first step is trying to reproduce what has gone wrong. More often than not, this consumes a considerable amount of time that would be better spent on actually fixing the problem. In this first blog post of a series, I will present a Step-by-Step Guide to Diagnose Stuck Transactions within minutes and show how a modern APM Solution helps to pinpoint common production problems, without spending hours on reproducing it at first.

The Problem: Response Time Increases
One of our customers deployed a new version of their prime web application and, for the first couple of days, everything seemed fine. One evening, their operations team was alerted about significantly increasing response time, and upon further investigation they recognized that a Stuck Transaction was blocking their application. While restarting the server solved the problem from an operations perspective it is not a long term solution as currently processing transactions get canceled and users are thrown off the system.

Their APM solution automatically captures thread dumps in case of too long running or stuck transactions. These dumps assist developers with diagnosing the issue. Let's take this example and walk through one of the problems often seen in a production environment: Diagnosing Stuck Transactions and Identifying the Root Cause.

Step 1: Identify problematic JVM/CLR
To identify the correct thread dump to analyze, it's important to know which server was affected by the stuck transaction and at what time that happened.

The transaction flow indicates a problem on one of the application servers

When drilling to the actual transactions that flow through that application server and focusing on the timeframe just before the server got restarted, I noticed a timed-out PurePath that spent 100% in sync time. This means that all it did was wait for one or more monitors, which makes it a likely culprit for our stuck transaction.

The PurePath reveals the information about which Application Server (=Agent) was involved in that stuck transaction

Step 2: Identify blocked threads
Having identified the affected application server, we can browse the list of available thread dumps and open the corresponding one.

Thread dump overview on runnable, blocked, waiting and timed waiting threads

When looking at the thread dump, threads can be grouped by their state to get a quick overview on how many threads the application executed when the dump was performed and how many of them were actually blocked. In this case, we are looking at four threads as Figure 4 shows.

Out of the 600+ Threads in the application, 4 are blocked and subject for further investigation

Step 3: Identify root cause
Already the first one of these threads is the thread from our timed-out PurePath we looked at in Step 1 - identifiable by its ID. We also get the most important piece of information we need to identify the root cause: Which object is this thread waiting on, and who owns it?

We now know which method is blocking and that it is blocked because it waits for an object owned by another thread -> a potential deadlock?

Usually, the thread owning the object we are waiting for is highlighted in red. Given the number of threads in this dump, it's unlikely to spot it this way, but search for the ID of the owning thread giving us the following information:

The HistoryLayout thread owns a monitor object with two waiting threads, which causes the web request handler to block and run into a timeout for the end user - the deadlock is identified!

In this case, the method com.vaadin.ui.Table.unregisterPropertiesAndComponents is working on the same instance of ConfigurableReportsApplication that AbstractApplicationPortlet.handleRequest is waiting for. Having identified the thread, we have the full stack trace at hand in the lower pane, and can track down how this situation occurred.

Conclusion
Trying to reproduce such concurrency-related problems on a developer machine with typically requires a major effort. If your APM tool fully supports collaboration between teams, all analysis steps described above can be performed in your local environment, without direct access to production servers. Using an APM Solution like dynaTrace, it only took a couple of minutes to answer the crucial questions named above: we know exactly what happened and how it happened, and are thus also able to modify our application in a way to avoid this situation in the future.

More Stories By Wolfgang Gottesheim

Wolfgang Gottesheim has several years of experience as a software engineer and research assistant in the Java enterprise space. Currently he contributes to the strategic development of the dynaTrace enterprise solution as a Technology Strategy in the Compuware APM division’s Center of Excellence. He focuses on monitoring and optimizing applications in production.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
The age of Digital Disruption is evolving into the next era – Digital Cohesion, an age in which applications securely self-assemble and deliver predictive services that continuously adapt to user behavior. Information from devices, sensors and applications around us will drive services seamlessly across mobile and fixed devices/infrastructure. This evolution is happening now in software defined services and secure networking. Four key drivers – Performance, Economics, Interoperability and Trust ...
Most technology leaders, contemporary and from the hardware era, are reshaping their businesses to do software in the hope of capturing value in IoT. Although IoT is relatively new in the market, it has already gone through many promotional terms such as IoE, IoX, SDX, Edge/Fog, Mist Compute, etc. Ultimately, irrespective of the name, it is about deriving value from independent software assets participating in an ecosystem as one comprehensive solution.
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
SYS-CON Events announced today that CollabNet, a global leader in enterprise software development, release automation and DevOps solutions, will be a Bronze Sponsor of SYS-CON's 20th International Cloud Expo®, taking place from June 6-8, 2017, at the Javits Center in New York City, NY. CollabNet offers a broad range of solutions with the mission of helping modern organizations deliver quality software at speed. The company’s latest innovation, the DevOps Lifecycle Manager (DLM), supports Value S...
SYS-CON Events announced today that Juniper Networks (NYSE: JNPR), an industry leader in automated, scalable and secure networks, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Juniper Networks challenges the status quo with products, solutions and services that transform the economics of networking. The company co-innovates with customers and partners to deliver automated, scalable and secure network...
Five years ago development was seen as a dead-end career, now it’s anything but – with an explosion in mobile and IoT initiatives increasing the demand for skilled engineers. But apart from having a ready supply of great coders, what constitutes true ‘DevOps Royalty’? It’ll be the ability to craft resilient architectures, supportability, security everywhere across the software lifecycle. In his keynote at @DevOpsSummit at 20th Cloud Expo, Jeffrey Scheaffer, GM and SVP, Continuous Delivery Busine...
Bert Loomis was a visionary. This general session will highlight how Bert Loomis and people like him inspire us to build great things with small inventions. In their general session at 19th Cloud Expo, Harold Hannon, Architect at IBM Bluemix, and Michael O'Neill, Strategic Business Development at Nvidia, discussed the accelerating pace of AI development and how IBM Cloud and NVIDIA are partnering to bring AI capabilities to "every day," on-demand. They also reviewed two "free infrastructure" pr...
SYS-CON Events announced today that Hitachi, the leading provider the Internet of Things and Digital Transformation, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Hitachi Data Systems, a wholly owned subsidiary of Hitachi, Ltd., offers an integrated portfolio of services and solutions that enable digital transformation through enhanced data management, governance, mobility and analytics. We help globa...
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
SYS-CON Events announced today that Super Micro Computer, Inc., a global leader in compute, storage and networking technologies, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Supermicro (NASDAQ: SMCI), the leading innovator in high-performance, high-efficiency server technology, is a premier provider of advanced server Building Block Solutions® for Data Center, Cloud Computing, Enterprise IT, Hadoop/...
With billions of sensors deployed worldwide, the amount of machine-generated data will soon exceed what our networks can handle. But consumers and businesses will expect seamless experiences and real-time responsiveness. What does this mean for IoT devices and the infrastructure that supports them? More of the data will need to be handled at - or closer to - the devices themselves.
SYS-CON Events announced today that DatacenterDynamics has been named “Media Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY. DatacenterDynamics is a brand of DCD Group, a global B2B media and publishing company that develops products to help senior professionals in the world's most ICT dependent organizations make risk-based infrastructure and capacity decisions.
NHK, Japan Broadcasting, will feature the upcoming @ThingsExpo Silicon Valley in a special 'Internet of Things' and smart technology documentary that will be filmed on the expo floor between November 3 to 5, 2015, in Santa Clara. NHK is the sole public TV network in Japan equivalent to the BBC in the UK and the largest in Asia with many award-winning science and technology programs. Japanese TV is producing a documentary about IoT and Smart technology and will be covering @ThingsExpo Silicon Val...
The explosion of new web/cloud/IoT-based applications and the data they generate are transforming our world right before our eyes. In this rush to adopt these new technologies, organizations are often ignoring fundamental questions concerning who owns the data and failing to ask for permission to conduct invasive surveillance of their customers. Organizations that are not transparent about how their systems gather data telemetry without offering shared data ownership risk product rejection, regu...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the USA and Europe, we work with a variety of customers from emerging startups to Fortune 1000 companies.
Financial Technology has become a topic of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 20th Cloud Expo at the Javits Center in New York, June 6-8, 2017, will find fresh new content in a new track called FinTech.
@GonzalezCarmen has been ranked the Number One Influencer and @ThingsExpo has been named the Number One Brand in the “M2M 2016: Top 100 Influencers and Brands” by Analytic. Onalytica analyzed tweets over the last 6 months mentioning the keywords M2M OR “Machine to Machine.” They then identified the top 100 most influential brands and individuals leading the discussion on Twitter.
Cognitive Computing is becoming the foundation for a new generation of solutions that have the potential to transform business. Unlike traditional approaches to building solutions, a cognitive computing approach allows the data to help determine the way applications are designed. This contrasts with conventional software development that begins with defining logic based on the current way a business operates. In her session at 18th Cloud Expo, Judith S. Hurwitz, President and CEO of Hurwitz & ...
In his keynote at @ThingsExpo, Chris Matthieu, Director of IoT Engineering at Citrix and co-founder and CTO of Octoblu, focused on building an IoT platform and company. He provided a behind-the-scenes look at Octoblu’s platform, business, and pivots along the way (including the Citrix acquisition of Octoblu).