Size matters, why and how to measure your heap

I have had to deal with memory problems in Java applications a few times. A lot has been written about this already, but this time I ran into a slightly different issue that surprised some of my colleagues so I decided to write about it here. Contrary to popular belief, a big JVM heap size is not always better when it comes to performance.

The problem

I came to the customer site to help them with their performance problems of a fairly large J2EE-system Web Service/Hibernate/MySQL system. They had several customers running the system, but only the largest was experiencing problems. The application suddenly froze and stopped processing transactions. All sorts of hypotheses were discussed, but no one could really for sure say what the problem was. And there was little data to work on.

Gathering intelligence

With a background as a researcher I know that little knowledge can be derived without data to work on. So the first step, often neglected, is to gather the data. This takes some time and effort. Depending on the problem at hand different data is needed, but here are some general areas:

1. Application architecture

2. Application environment (servers, memory, CPU’s, OS etc.)

3. Application configuration

4. Response time logs

5. Memory usage logs

6. CPU usage logs

7. Garbage Collection (GC) logs

8. Database response times (slow query logs)

I know there are some who think that knowledge can be gained by going into a cave and meditating, but I am not among them. Do your homework and the job will be done faster. Loads of time and money are wasted in endless discussions about what a performance problem is caused by, and even more in testing hypotheses that can be dismissed at an early stage if the data is available.
There is nothing wrong with hypotheses per se, but they should be verified and dismissed with the aid of data and testing, not through speculation alone. Don’t throw them away, but construct a prioritized list with the most plausible ones on top. Now try to find ways of gathering data so that the hypotheses can be verified or rejected. Here is an example of such a list:

Hypothesis	Argument	Counterargument	Probability	Action
Garbage Collection (GC) is the root of the problem	The problem seems correlated with GC activity according to GC-logs.	Why hasn’t the problem occurred before? Why not for other customers.	1	Change GC settings.
A memory leak is causing the problem.	The heap grows constantly until it reaches it’s maximum size.	The memory leak should occur also for other customers since the codebase and usage patterns should be the same. In the end the application should also crash.	2	Analyze the heap in detail in the beginning and end of the server life.
The database is overloaded.	A slow database can tie up resources in the server.	A slow database should affect all server instances, not just one.	3	Gather performance data from the DB-logs.

Performing the analysis

As soon as some data has been gathered the interesting work of analysis can begin. Oh, what joy to let your mind work on juicy bits of data and from it deduce some useful information!
In this case, as soon as the GC and memory logs had been collected one could see that the application was consuming a lot of heap space and a lot of time was spent doing Garbage Collection. There was a high likelihood that memory usage was the root of the problem. Initially I thought the problem was due to a simple memory leak, but the problem was more complicated than that. The heap was constantly growing, but the application didn’t crash. Memory was also reclaimed by GC at regular intervals. By correlating the GC-logs with the CPU-logs one could also see that the CPU usage had a huge peak when GC was performed when the heap was at it’s max size. This caused the application to freeze completely and performance went down the drain. With a 8Gb heap a full GC takes a lot of time. Why was such a big heap necessary? The application was supposed to be stateless? More data was needed to find out what was causing this. It was time to inspect the heap.
While there are free tools (such as VisualVM) available to analyse a Java heap, I’ve frequently found that a professional Memory Profiler can add quite a lot of value. VisualVM can quickly tell you that your heap, surprise, surprise, is mainly filled by Strings and Integers! A tool like YourKit for example adds a bit more intelligence by telling you which root-objects are holding the most memory. With this information at hand Hibernate was soon identified as the villain of the piece together with the application’s unfortunately mutating database queries (The HQL queries were very often not properly parametrized and therefore changed with user data.). This old version of Hibernate contained an unbounded QueryPlanCache which just grew and grew until the heap was full. Not a good idea.

Testing the hypothesis

Once a main hypotheses has been identified it’s time for verification before the fix goes into production. Having a good test suite including performance tests obviously helps here as you can ensure that the tests that worked before continue to work after the change was made (In our case an upgrade of Hibernate.).
When it comes to performance testing things can be trickier however and the problem can be difficult to reproduce. This time only one customer was experiencing the problem and then only under heavy load. In this case A/B testing can be a possible solution. Can the necessary change be deployed on only one server and the results compared with the others? This was in fact the solution we resorted to as the lack of a comprehensive test suite made an upgrade of Hibernate too time consuming in the short run. Instead we had to resort to treating the symptom through changes in memory and GC-settings.

Solution

When Garbage Collection takes up too much time there are several ways of dealing with this. If the application’s heap size is very big, it can lead to more time being spent in GC. The more garbage there is, the longer it will take to collect it. A bigger heap is therefore not always a good thing. The 8Gb heap was completely unnecessary here, since there was no real need for large caches of any kind, including the Hibernate one. Consequently we decreased the heap to half the size. To further reduce the problem of “stop the world” GC we also increased the time the JVM could spend doing GC so that it was allowed to do more work continually, in preference over doing it when absolutely necessary when the heap was full (-XX:GCTimeRatio=19 instead of the default -XX:GCTimeRatio=99).
The long-term solution however is to upgrade Hibernate with the next release.

Conclusions

When diagnosing your performance the size of your heap can definitely matter, and bigger is not always better. Too big and your application may become stuck. To find out what your problem really is measuring is also your key to success. Are there any downsides to this scientific method? In my case there certainly was: The problem was solved too fast and I’m now currently “in between assignments”!