Thursday, March 24, 2022

Tricks of the Trade: Tuning JVM Memory for Large-scale Services

 

Running queries on Uber’s data platform lets us make data-driven decisions at every level, from forecasting rider demand during high traffic events to identifying and addressing bottlenecks in the driver sign-up process. Our Apache Hadoop-based data platform ingests hundreds of petabytes of analytical data with minimum latency and stores it in a data lake built on top of the Hadoop Distributed File System (HDFS). 

Our data platform leverages several open source projects (Apache Hive, Apache Presto, and Apache Spark) for both interactive and long running queries, serving the myriad needs of different teams at Uber. All of these services were built in Java or Scala and run on open source Java Virtual Machine (JVM).  

Uber’s growth over the last few years exponentially increased both the volume of data and the associated access loads required to process it, resulting in much more memory consumption from services. Increased memory consumption exposed a variety of issues, including long garbage collection (GC) pauses, memory corruption, out-of-memory (OOM) exceptions, and memory leaks

Refining this core area of our data platform ensures that decision-makers within Uber get actionable business intelligence in a timely manner, letting us deliver the best possible services for our users, whether it’s connecting riders with drivers, restaurants with delivery people, or freight shippers with carriers. 

Preserving the reliability and performance of our internal data services required tuning the GC parameters and memory sizes and reducing the rate at which the system generated Java objects. Along the road, it helped us develop best practices around tuning the JVM for our scale which we hope others in the community will find useful.

What is JVM garbage collection?

The JVM runs on a local machine and functions as an operating system to the Java programs written to execute in it. It translates the instructions from its running programs into instructions and commands that run on the local operating system. 

The JVM garbage collection process looks at heap memory, identifies which objects are in use and which are not, and deletes the unused objects to reclaim memory that can be leveraged for other purposes. The JVM heap consists of smaller parts or generations: Young Generation, Old Generation, and Permanent Generation. 

The Young Generation is where all new objects are allocated and aged, meaning their time in existence is monitored. When the Young Generation fills up, using its entire allocated memory, a minor garbage collection occurs. All minor garbage collections are “Stop the World” events, meaning that the JVM stops all application threads until the operation completes. 

The Old Generation is used to store long-surviving objects. Typically, a threshold is set for each Young Generation object, and when that age is met, the object gets moved to the Old Generation. Eventually, the Old Generation needs a major garbage collection, which can be either a full or partial “Stop the World” event depending on the type of garbage collection configured in the JVM program arguments. 

The Permanent Generation stores classes or interned character strings. It is not for objects that survived from the Old Generation to stay permanently. If this area is about to be full, there will be a GC, which is still counted as a major GC. 

The JVM garbage collectors include traditional ones like Serial GC, Parallel GC, Concurrent Mark Sweep (CMS) GC, Garbage First Garbage Collector (G1 GC), and several new ones like Zing/C4Shenandoah, and ZGC

The process of collecting garbage typically includes marking, sweeping, and compacting phases, but there can be exceptions for different collectors. Serial GC is a rudimentary garbage collector, which stops the application for the whole collecting process and collects garbage in a serial manner. Parallel GC does all the steps in a multi-threaded manner, increasing the collecting throughput. CMS GC attempts to minimize the pauses by doing most of the garbage collection work concurrently with the application threads. The G1 GC collector is a parallel, concurrent, and incrementally compacting low-pause garbage collector. 

With these traditional garbage collectors, the GC pause time usually increases when the JVM heap size is increased. This problem is more severe in large-scale services because they usually need to have a large heap, e.g., several hundreds of gigabytes. 

The new garbage collectors like Zing/C4, Shenandoah, and ZGC try to solve this problem, minimizing the pauses by running the collecting phases concurrently and incrementally more frequently than traditional collectors. Also, the pause time doesn’t increase as the heap size goes up, which is desirable for large scale services within a data infrastructure.   

In addition to garbage collectors, the object creation rate also impacts the frequency and duration of GC pauses. Here, the object creation rate defines how many objects with size in bytes are created for a given time range in seconds. It is easy to understand that when the rate goes higher, more objects are created and occupy the heap, triggering the GC more frequently and causing longer pauses.

In our experience with garbage collection on JVM, we’ve identified five key takeaways for other practitioners to leverage when working with such systems at Uber-scale.

No comments:

Post a Comment