Wednesday, January 5, 2022

Google search: bigtable field promotion -> The right Cloud Bigtable index makes all the difference.

Jan. 5, 2022

Introduction

It is the best investment to learn more about Google bigtable in the first few days of 2022. I came cross the blog and I like to read all technical articles. 

Here is the article.

The right Cloud Bigtable index makes all the difference.


Finding the problem


Performance wise, this simple design was creating a massive bottleneck for them.

While looking at some logging data, we quickly realized that our scans for data points near Portland, Oregon (USA) ended up also scanning and filtering through data points in Milan & Venice Italy. After some digging, researching, and asking around, we realized that there were three problems working in unison to cause this performance issue:

  1. These cities are very close to each other in terms of Latitude:
  2. Bigtable stories entries in their shards lexicographically by their index
  3. Bigtable doesn’t have explicit support for secondary indexing.

Let’s break this down a little more.

Bigtable stories entries in their shards lexicographically by their index, meaning that entries with lexicographically similar keys will be closer to each other in the tablets. Queries that access a single row, or a contiguous range of rows, execute quickly and efficiently. However if need something like a secondary index (like the “LONG” in a “LAT/LONG” pair) then your queries end up in a full table scan, which will be far slower. A full table scan is exactly what it sounds like — every row of your table is examined in turn, which, as your table sizes grow, your performance suffers more.

This is where point #1 above comes into play. Portland is very close to two other large cities in terms of latitude:



The above image shows Portland, Milan and Venice overlaid on a map, you can see the full interactive map here.

As such, their storage in rows tends to be very close to each other, despite being geographically very far away. This means a full table scan involves walking through all the table entries that share a latitude value across the globe — A much larger scan size than we were hoping for.

Building the right index

This continues to highlight an important aspect of Bigtable schema design: choosing a row key that facilitates common queries is of paramount importance to the overall performance of the system.

For Urban Fitness Forge, that meant trying to create an index such that entities close to each other physically were sorted close to each other in table space, in order to reduce the scan size.

Thankfully, this has been solved before. A space filling curve is a way to organize a 2D space such that adjacent indexes are pseudo similar in physical space (I like this explanation if you are looking for more).

For example, if we stored points in a 2D set using a raster index (index = Y*WIDTH+X) we’d end up with something that looked like this:




Meanwhile, a space filling curve does this:

You can see that in the first example (aka RASTER order) that while 6 looks physically close to 2, it’s a whole row away from it, in terms of index. However, in the Space Filling Curve, the closest index values to 6 are physically close as well.

The result is a modified index for entities which use a space-filling curve integrated with a timestamp value to create a single, integer based index.



For Urban Fitness Forge, this meant a massive performance improvement for queries that sampled row data via lat/long, as it allowed items which were geographically close to each other, to also be adjacent to each other in the table.

Your mileage may vary

Now, obviously using a Hilbert space filling curve is not the solution to every Cloud Bigtable index option, but it does highlight the important part: to get the best performance out of Cloud Bigtable, it’s essential to think carefully about how you compose your row key. By choosing the correct row key template, you can avoid a painful data migration process later.

To do this, start by asking how you’ll use the data that you plan to store. For example:

  • User information: Do you need quick access to information about connections between users (for example, whether user A follows user B)?
  • User-generated content: If you show users a sample of social-style content, such as status updates, how will you decide which status updates to display to a given user?
  • Time series data: Will you often need to retrieve the most recent N records, or records that fall within a certain time range? If you’re storing data for several kinds of events, will you need to filter based on the type of event?

By understanding your needs up front, you can ensure that your row key template, and your overall schema design, provide enough flexibility to query your data efficiently.

No comments:

Post a Comment