One of the main objectives of Performance Testing is to ensure an Application is in no way sub-optimal in terms of resource consumption. Resources generally mean CPU, Memory, I/O, Network. But then we also have other resources that lie towards the application's side, like Heap, GC, Threads, Sessions, Request Processors etc.
So what do you do when you notice something unusual about your Web Application, especially when your customers are finding the application to be performing sub-optimally or even inaccessible at times (which gets pretty worse when your application is mean to be visited from across the globe)?
Here's what we did when one of our clients' pre-staging environment got into a fender bender:
Our monitoring stack using vanilla linux commands were in place, tracking CPU, Memory, I/O, and Network metrics. Additionally, we used JStack to collect thread dumps from the application. A little digression: the application was written in Java using popular open-source frameworks (DWR, Hibernate, Spring).
Certain things that saved the day for us were:
Upon examining the thread dumps, we pretty much ended up looking at threads like this one:
And this one:
Conclusively, one of the frameworks had a bad HashMap implementation that when concurrently accessed, led to a circular dependency, which led to threads executing that part eventually loop over their access operations.
In a nutshell, if two people were trying to pass a corridor: Alphonse moves to his left to let Gaston pass, while Gaston moves to his right to let Alphonse pass, eventually ending up in a collision course between the two. That is precisely the definition of a livelock (courtesy: Oracle).
A touch of synchronization to the code of the framework that caused the livelock was all that was needed to avoid it.
Here's a wonderful blog post on the HashMap race condition that details through all levels of the livelock that we encountered. Be sure to check it out!.