Thursday, March 24, 2022

Tricks of the Trade: Tuning JVM Memory for Large-scale Services

March 24, 2022

Here is the article. 

Key takeaways

Through our experience maintaining and improving query support for Uber’s data platform, we learned many critical lessons about what it takes to optimize successful JVM memory and GC tuning. We consolidated these learnings into more granular takeaways for improving JVM and GC performance at Uber’s scale, below:   

  1. Discern if JVM memory tuning is needed. JVM memory tuning is an effective way to improve performance, throughput, and reliability for large scale services like HDFS NameNode, Hive Server2, and Presto coordinator. When GC pauses exceeds 100 milliseconds frequently, performance suffers and GC tuning is usually needed. In our GC tuning scenario, we saw HDFS throughput increase ~50 percent, the HDFS latency decrease ~20 percent, and Presto’s weekly error rate drop from 2.5 percent to 0.73 percent.
  2. Choose the right total heap size. The total JVM heap size determines how often and how long the JVM spends collecting garbage. The actual size is shown as the maximum memory footprint in the verbose GC log. Incorrect heap size could cause poor GC performance and even trigger an out-of-memory exception. 
  3. Choose the right Young Generation heap size. The Young Generation size should be determined after the total heap size is chosen. After setting the heap size, we recommend benchmarking the performance metric against different Young Generation sizes to find the best setting. Typically, the Young Generation should be 20 to 50 percent of the total heap size, but if a service has a high object creation rate, the Young Generation’s size should be increased. 
  4. Determine the most impactful GC parameters. There are many GC parameters to tune, but usually changing one or two parameters makes significant impact while others are negligible. For example, we found changing Young Generation size and ParGCCardsPerStrideChunk improved the performance significantly, but we did not see much difference when changing TLABSize and ConcGCThreads. In tuning Presto GC, we found the String Deduplication setting dominates the performance impact. 
  5. Test next generation GC algorithms. In a large scale data infrastructure, critical services usually have very large JVM heap sizes, ranging from several hundreds of gigabytes to terabytes. Traditional GC algorithms have trouble handling this scale and experience long GC pause times. Next generation GC algorithms, such as C4, ZGC, and Shenandoah, show promising results. In our case, we saw that C4 reduces latency (P90 ~17ms) compared with CMS (P90 ~24ms). 

Moving forward

While we made great progress improving our services for performance, throughput, and reliability by tuning JVM garbage collection for a variety of large-scale services in our data infrastructure over the last two years, there is always more work to be done.  

For instance, we began integrating C4 GC into our HDFS NameNode service in production. With the encouraging and beneficial performance improvement in the staging environment as described above, we believe C4 will help prevent NameNode bottleneck issues and reduce request latency.  

GC tuning in distributed applications, particularly in Apache Spark, is another area we want to examine in the future. For example, the ingestion pipelines used in our data platform are built on top of Spark, and our Hive service also relies on Spark. JVM Profiler, an open source tool developed at Uber, can help us analyze GC performance in Spark so we can improve its performance. 

No comments:

Post a Comment