Garbage collection in Java is one of it’s features that makes it easy to write bug-free code. Sometimes it is difficult to tune garbage collection (GC) for your application. Many times you encounter issues with GC pauses. This blog post describes one of our experiences with GC tuning.
One of the services of Sumo’s log analytics platform needs to maintains a cache of millions of byte-arrays with an average size of 20KB. Maintaining a cache of more than 50GB in JVM Heap is a garbage collection hell in itself. Cache of that size can cause close to a minute of GC pause with Parallel Collector. Garbage First Garbage Collector (G1GC) is supposed to fix that. If cache size is more than 90% of JVM heap, it results in too frequent GC. Cache object being a simple byte-array, we are able to store them in JVM off-heap DirectByteBuffer and access it from our code without any deserialization. This reduced our JVM heap size requirement to 2.5GB for supporting 50GB off-heap cache. The response time requirement for this service is sub-second, so as to be useful in a query. So we allocated 4GB heap size to JVM and used G1GC with default max pause of 200 ms.
With this configuration we found that the service was not responding for about 2 seconds every 2 minutes. Though we were not seeing any Full GC, the reason was stop-the-world pause of G1GC
The rest of the blog post uses simple test code that simulates our cache maintenance code. Using the test runs we show the behavior we have seen and the effect of our solution. It keeps a cache of 10 million DirectByteBuffer object. In each iteration it updates 1024 of them creating some garbage objects that needs cleanup. To reduce machine RAM needed for these tests, DirectByteBuffer size is set to only 4 Bytes. The tests were done on i2.xlarge instance of Amazon Web Service (AWS) which has 4 CPU cores and 30.5GB RAM. Following is the Java and Ubuntu version used.
[jitu-1]:jitendra@~$ java -version
java version "1.7.0_91"
OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
JVM heap size while running these tests was set to 4GB as we have in our production.
java -cp src/ -Xmx4096m -XX:MaxDirectMemorySize=6g -XX:+UseCompressedOops -XX:+UseFastAccessorMethods -XX:+AggressiveOpts -server -Xloggc:gc.log -XX:GCLogFileSize=10M -XX:NumberOfGCLogFiles=10 -XX:+UseGCLogFileRotation -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -verbose:gc -XX:+PrintReferenceGC -XX:+PrintGCDateStamps -XX:+UseG1GC Main
ref-proc was the issue we ran test using
-XX:+PrintReferenceGC flag. Following is what GC logs looks like using G1GC with default parameters.
[GC remark 2016-01-06T01:08:45.822-0800: 304.485:
[GC ref-proc2016-01-06T01:08:45.822-0800: 304.485:
[SoftReference, 0 refs, 0.0000190 secs]2016-01-06T01:08:45.822-0800: 304.485:
[WeakReference, 3 refs, 0.0000140 secs]2016-01-06T01:08:45.822-0800: 304.485: [FinalReference, 4 refs, 0.0000130 secs]2016-01-06T01:08:45.822-0800: 304.485:
[PhantomReference, 0 refs, 10482437 refs, 1.5939440 secs]2016-01-06T01:08:47.416-0800:306.079:
[JNI Weak Reference, 0.0000200 secs],
1.5940730 secs], 1.6026360 secs]
[Times: user=1.13 sys=0.50, real=1.60 secs]
In our test code this pause happens eight times in two minutes. You can see that remarking of PhantomReference took 1.6 seconds. Along with off-heap byte-buffer each DirectByteBuffer has a reference to Cleaner which extends PhantomReference. When a DirectByteBuffer is garbage collected, it’s Cleaner cleans up off-heap byte array.
-XX:+ParallelRefProcEnabled flag decreased the pause time of PhantomReference to 0.8 seconds. Still there were seven pauses in two minutes. These pause time is not acceptable to our requirements.
It was clear that G1GC is not the way to go with OpenJDK 1.7.0_91. So we decided to change GC algorithm to Parallel Collector (the default for server JVM). Following is the comparison of G1GC vs G1GC with parallel-ref-proc enabled vs Parallel Collector.
With Parallel GC, we never see Full-GC during entire cache maintenance duration. As young GC is quick, average GC pause time was only 40 ms.
We saw similar results in production. With Parallel GC, the JVM heap size of 4GB was enough to not trigger any Full-GC, and the average Young GC pause was close to 35 ms. We do have explicit Full-GC every hour because of Java RMI; It takes 8 seconds to finish on machines with 4 CPU cores, but is independent of Phantom References. This is huge improvement over G1GC we used earlier.
[Full GC2016-01-06T01:09:48.762-0800: 23.242:
[SoftReference, 0 refs, 0.0000510 secs]2016-01-06T01:09:48.762-0800: 23.242:
[WeakReference, 6 refs, 0.0000090 secs]2016-01-06T01:09:48.762-0800: 23.242:
[FinalReference, 13 refs, 0.0000070 secs]2016-01-06T01:09:48.762-0800: 23.242:
[PhantomReference, 0 refs, 10486164 refs, 0.1811840 secs]2016-01-06T01:09:48.944-0800: 23.423:
[JNI Weak Reference, 0.0000140 secs] [PSYoungGen: 465896K->136577K(932352K)] [ParOldGen: 1536297K->1830678K(2796032K)] 2002193K->1967256K(3728384K)
[Times: user=24.91 sys=0.41, real=7.17 secs]
This blog post shows that the remark phase of G1GC can introduce multi-second, stop-the-world pauses. It happens when JVM has millions of DirectByteBuffer objects. Each DirectByteBuffer has a PhantomReference, and G1GC takes too much time in ref-proc of PhantomReference. This remark phase is too frequent; it hits application performance. With Parallel GC, Full-GC would lead to higher duration of stop-the-world pauses. Right size JVM heap can avoid it. We were able to reduce total stop-the-world pauses from 60 seconds per hour to eight seconds.
For a quick benchmarking you can use a test code that simulates the production behavior using production configuration to do any final garbage collection tuning. For your application, choose the right garbage collection algorithms and JVM heap limit and leave the rest to Java GC ergonomics.
Any complex object that is stored off-heap in a DirectByteBuffer needs to be deserialized first and brought to JVM heap to be useful. This creates lot of garbage and GC.
Link to test code
Tool used GCViewer