Performance monitoring of a cloud deployment

An application go-live took a wrong turn - a dozen of connected clients took a cloud system down, that should cope with thousands of them. As I was assigned to analyze the problems I discovered that this is a composite problem created by (amongst other things) careless usage of Hibernate (an ORM framework for easier, vendor-agnostic database access) and way to much complex logic built 'into' the database - killing both the database and the web applications host by using up its CPU. After using Hibernate statistics for analyzing the transactions that are being sent to the database, I started by rewriting some SQL queries that didn’t perform well. Some complex Java/Hibernate logic even had a kind of memory leak in it which could be resolved by writing a single database stored procedure. A connectivity checking method for the clients was moved from saving ping data in the database to an efficient in memory solution using an in-memory data grid called Hazelcast. Nevertheless the CPU load stayed above an acceptable threshold. Using an application performance management tool I could find another root cause in a flawed function that was being called periodically. Looking at the CPU load graph one could see that the CPU went about each three minutes from under 10 to almost 100 percent. The load was created by one single thread with about 80% of network IO and 20% code execution, where multiple database calls were made, responses sorted and analized. Not only was the sorting flawed, also the database calls were being performed for all rows in one table and not just for a small subset of the entries. After a rewrite the CPU usage dropped to a single-digit percentage which kept the system in a healthy state.
1 answer

Interesting solution. These problems are very hard to detect. I would suggest that you also use a profiler next time which is integrated in jave (VisualVM)