Wednesday, May 22, 2013

Java GC tuning for High Frequency Trading apps

I am interested in Java performance issues for some time. Now we look not so much at the Java code performance but at Garbage Collection process performance. I am inspired by a lecture from a Warsaw JUG group, which was about Hot Spot in low latency Java. Lecturer Wojciech Kudla prepared a simple small application to simulate HFT system in GC point of view. All the source code with great GC Viewer (plugin for JVisualVM) is available on thier GitHub: heaptrasher.

In a nutshell heaptrasher is an application which is used to generate a lot of garbage to GC operate on. In addition it contains some code to generate histogram with latency statistics. Mentioned statistics data can be collected in two different ways. Worth mentioning is that the first one (array) allocate a lot of memory in Old Generation space and the second does not. I was later to find, it has a considerable impact on GC performance.

All charts bellow are from JVisualVM with mentioned GC Viewer and VisualGC plugins.

Test platform

  • CPU: i7-3612QM
  • RAM: 8GB
  • OS: Windows 7 Ultimate
  • JDK: 1.7.0_u21

A common feature of all the tests is to use parallel GC for Young Space collections. The GC for Old Generation doesn't matter because in this case Old Generation collection will never happen.

First test

Description

4GB heap, array as statistics data holder

Command line

java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:+UseParallelGC Main array

Garbage Collector pauses duration

Garbage Collections times

Message latency

 50,0000% of message latency is less than: 1466 nanos
 75,0000% of message latency is less than: 2443 nanos
 90,0000% of message latency is less than: 3421 nanos
 99,0000% of message latency is less than: 4888 nanos
 99,9000% of message latency is less than: 11241 nanos
 99,9900% of message latency is less than: 19549 nanos
 99,9990% of message latency is less than: 75263 nanos
 99,9999% of message latency is less than: 3480174 nanos
 

Summary

6,5 ms pauses? It isn't low latency :(. The graph is also very irregular. 99,9% messaged was processed in below 11241 nanos.

Second test

Description

4GB heap, direct statistics data collection

Command line

java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:+UseParallelGC Main direct

Garbage Collector pauses duration

Garbage Collections times

Message latency

 50,0000% of message latency is less than: 1466 nanos
 75,0000% of message latency is less than: 2443 nanos
 90,0000% of message latency is less than: 2933 nanos
 99,0000% of message latency is less than: 4887 nanos
 99,9000% of message latency is less than: 11240 nanos
 99,9900% of message latency is less than: 19060 nanos
 99,9990% of message latency is less than: 82104 nanos
 99,9999% of message latency is less than: 762892 nanos
 

Summary

Hmmm much better, isn't it? It appears that Old Generation is major low-latency apps enemy... The graph is irregular like in first case. Change does not affected the 99,9% of messages process time.

Third test

Description

4GB heap, direct statistics data collection, 3GB NewSize

Command line

java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:MaxNewSize=3g -XX:NewSize=3g -XX:+UseParallelGC Main direct

Garbage Collector pauses duration

Garbage Collections times

Message latency

 50,0000% of message latency is less than: 1466 nanos
 75,0000% of message latency is less than: 2443 nanos
 90,0000% of message latency is less than: 3421 nanos
 99,0000% of message latency is less than: 4888 nanos
 99,9000% of message latency is less than: 11241 nanos
 99,9900% of message latency is less than: 20038 nanos
 99,9990% of message latency is less than: 69887 nanos
 99,9999% of message latency is less than: 652930 nanos
 

Summary

In general changes does not affected a single collection time, but the entire collections count is more than 2 times less. It is interesting that increase Young Generation space does not affected single collection time. It is just because app generates a lot of short lived objects, that means GC has relative small graph of living objects to check. One more time 99,9% of messages are processed with the same latency.

Third test

Description

4GB heap, direct statistics data collection, 3GB NewSize, number_of_gc_threads=cpu_core_number-1

Command line

java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:MaxNewSize=3g -XX:NewSize=3g -XX:ParallelGCThreads=7 -XX:+UseGCTaskAffinity -XX:+BindGCTaskThreadsToCPUs Main direct

Garbage Collector pauses duration

Garbage Collections times

Message latency

 50,0000% of message latency is less than: 1466 nanos
 75,0000% of message latency is less than: 2443 nanos
 90,0000% of message latency is less than: 3421 nanos
 99,0000% of message latency is less than: 5375 nanos
 99,9000% of message latency is less than: 11729 nanos
 99,9900% of message latency is less than: 21015 nanos
 99,9990% of message latency is less than: 70865 nanos
 99,9999% of message latency is less than: 612855 nanos
 

Summary

It isn't a lot of changes in GC pause times. Must admit that chart is more regular. We have also the first time for a long time change in 99,9% messages processing time.

Fourth test

Description

4GB heap, direct statistics data collection, 3GB NewSize, number_of_gc_threads=cpu_core_number-1 with ParGCCardsPerStrideChunk

Command line

java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:MaxNewSize=3g -XX:NewSize=3g -XX:ParallelGCThreads=7 -XX:+UseGCTaskAffinity -XX:+BindGCTaskThreadsToCPUs -XX:ParGCCardsPerStrideChunk=32768 Main direct

Garbage Collector pauses duration

Garbage Collections times

Message latency

 50,0000% of message latency is less than: 1466 nanos
 75,0000% of message latency is less than: 2443 nanos
 90,0000% of message latency is less than: 3421 nanos
 99,0000% of message latency is less than: 5375 nanos
 99,9000% of message latency is less than: 10752 nanos
 99,9900% of message latency is less than: 21503 nanos
 99,9990% of message latency is less than: 76729 nanos
 99,9999% of message latency is less than: 629960 nanos
 

Summary

It look almost the same as last time but we have significant better performance in 99,9% messages process time.

Summary

Durning the presentation lecturer use Linux based platform. During presentation the lecturer was used Linux based platform. In Linux it look that individual switches has more impact for collections time. A little summary here:

Test number Collections number Collections time Avg. collection time 99,90% latency less than 99,99% latency less than
first126706,342 ms5,60 ms11241 nanos19549 nanos
second12692,002 ms0,73 ms11240 nanos19560 nanos
third5753,738 ms0,94 ms11729 nanos21015 nanos
fourth5746,508 ms0,81 ms10752 nanos21503 nanos

Warsaw JUG meeting

More info???

3 comments :

  1. Does your app intend to run for any extended period of time? If so, ignoring old gen and calling it GC tuning is like getting a girls phone number and saying you are engaged.

    If your app isn't going to run for a long time - just give it a heap so humongous that it never needs to do any GC (young or otherwise) and you would have no GC pauses

    ReplyDelete
  2. JVM Architecture Specification Basic The Heap Area Introduction, teach about the JVM Heap Area in details
    http://www.youtube.com/watch?v=c-A7ZzxjWUI

    JVM Architecture Specification Basic The Method Area explained, teach about the JVM method area
    http://www.youtube.com/watch?v=a5GzF2fSSCE

    ReplyDelete
  3. This is an excellent information. I was using Oracke JRockit in my trading app, it has a very nice GC behavior with low pauses. Unfortunately, it crashes a lot. I am testing the JVM switches that you shared.

    ReplyDelete