July 27, 2021
Here is the article on medium.com.
Getting Started with Bigtable on GCP
If you’ve never looked at it before Bigtable can seem a little unapproachable. In this tutorial we’ll get you past that and guide you through your first steps with Bigtable so you can start using this fully managed NoSQL database in your own projects.
Bigtable is designed for low latency data access, where scalability and reliability really matter. It’s actually the same technology behind the majority of Google products, including Gmail, Maps, YouTube; Each of which serves multi-billion users.
In this tutorial we’ll walk you through your first steps with Bigtable, how to use it and what you really need to know to get started including:
- Understanding the different types of NoSQL databases
- How to setup and interact with Bigtable
- How big table structures and manages data
- Common ways to query and access data
- Best practices around schema design
The 4 Types of NoSql Databases
In the NoSql realm there are generally 4 types of databases. You’ve got your Column based, Document based, Key-Value based and Graph based databases.
Bigtable falls into the Wide-Column based family along with others like Cassandra, and Hbase. To understand Wide-Column it helps to look first at traditional Column based. When we say Column based Nosql, what does that actually mean? Well traditional Relational databases are row based, meaning they’re optimized for returning rows of data. Consider a User database for example, a relational database would organize first name, last name, and address all near each other.
If you wanted to access the state and zip of many users, the database would have to jump around to pull all the fields. A column based database on the other hand is optimized for accessing data by column instead of row. So in our Users database example it would store all the names together, all the states together, all the zip codes together and so on. This makes reads much more efficient. To scan all the states, the database can stay within the same area on disk.
A Wide-Column datastore looks similar however they often group the columns into Column Families, a set of columns that are typically used together. These Column Families are further optimized on disk to ensure fast access.
This will be good to keep in mind for later when we start designing our schemas. Alright, lets get on with it and get hands dirty
Interacting with Bigtable
For this walkthrough I’ll be using the Bigtable command line tool, called CBT. All the concepts you see here can also be done programmatically in your language of choice by just pulling the right SDK. We’re just using the CLI for simplicity here.
Accessing your instance with CBT
With the Bigtable instance created and the CLI installed, this is a good point to access your instance from the CLI to ensure you’ve got everything setup.
Alright, to get started lets just list our instances
cbt listinstances
You should see the instance you just created in the output
Instance Name Info
— — — — — — — — —
my-instance my-instance
And there’s the instance we created. Now if we try to list the tables with cbt ls we get an error
cbt ls
Error will read
Missing -instance
it’s telling us we need to specify the instance with a flag. We could do that, but it’s gonna get annoying adding that in every time so let’s write it to an rc file as a default
echo instance = my-instance >> ~/.cbtrc
Now when we run it, no errors, but no response either since we haven’t created any tables yet.
cbt ls
Data structures & schema basics
Tables
For this example we’ll be creating a product catalog that might be used by a typical retailer. So in this step we’ll create a table called `catalog`
cbt createtable catalog
Calling `ls` one more time
cbt ls
and we see our table
catalog
Column Family
Earlier I mentioned that Bigtable stores data related to columns. To help organize the data and limit what you’re pulling back, columns are grouped into what’s called column families. These column families group the fields that are typically accessed in the same request to ensure more efficient access.
In our catalog example we may have product description fields and pricing or inventory fields. A listing of products may use data from the descriptors but not need all the store level inventory.
Lets go ahead and create a column family for those product descriptors
cbt createfamily catalog descr
And now we’ve got our column family in the table. Running the ls command again
cbt ls catalog
displays
Family Name GC Policy
— — — — — — — — — — -
descr <never>
Rows, Columns & Cells
Just like with relational databases we have a concept of rows columns and cells. Each row is identified by a unique key you provide. Cells are at the intersection of a row id and column id To access a specific cell you need to identify the location including Row Key, Column Family, and Column Qualifier
In our case the rowID will be a unique product sku and we’ll add a title for it in the descriptors column family
The format will be
cbt set <table> <rowID> <colFamily>:<colQualifier>=<value>
cbt set catalog sku123 descr:title=”Vintage Clock”
Now if we read our catalog table
cbt read catalog
we’ll see the value
sku123
descr:title @ 2020/03/19-16:08:47.765000
"Vintage Clock"
Notice that we didn’t explicitly create columns. With bigtable we have dynamic schemas that allow you to create columns on the fly.
Since Bigtable is what’s called a sparsely populated database, any empty fields don’t incur storage overhead, unlike relational databases.
Cell Versions
Big table has a concept of cell versions, allowing you to store multiple revisions of data in this same spot, indicated by time.
We just set the contents of the cell descr:title on row sku123 to “Vintage Clock”. Now run the command again with a different title.
cbt set catalog sku123 descr:title=”Antique Clock”
You may have expected a single record returned but run the command
cbt read catalog
you’ll see instead we have two records
sku123
descr:title @ 2020/03/19–16:11:07.097000
“Antique Clock”
descr:title @ 2020/03/19–16:08:47.765000
“Vintage Clock”
You’ll see that the catalog contains 2 versions of the cell descr:title, our original one with “Vintage Clock” and the update with “Antique Clock” At first multiple rows might seem alarming but this can be really handy in your system designs and audits.
Garbage Collection
Given you may not want to store every version ever created, Bigtable offers the ability to trash cell versions with a feature called Garbage Collection. Earlier we listed the column families on our table and you may have noticed GC Policy set to never. Leaving this as is will collect every version of the cell ever created.
cbt ls catalogFamily Name GC Policy
— — — — — — — — — — -
descr <never>
You can set the garbage collection policy based on the Time of the cell, Number of cells or a combination of the two. For example you could keep a month’s worth of changes, the last 5 versions or maybe up to 5 versions and within the last month.
For our example lets only keep one version
cbt setgcpolicy catalog descr maxversions=1
Now review the column families
cbt ls catalog
Notice the new policy listed
Family Name GC Policy----------- ---------descr versions() > 1
But when we read the table with no flags it still returns 2 cells, why is that?
cbt read catalogsku123
descr:title @ 2020/03/19–16:11:07.097000
“Antique Clock”
descr:title @ 2020/03/19–16:08:47.765000
“Vintage Clock”
Garbage collection is a data storage technique, not for limiting querying results. In fact, it can take up to a week before data that is eligible for garbage collection is actually removed.
In practice you won’t be pulling back all revisions of a cell anyway. Instead you’ll be doing something like the following which pulls the latest cell entry
cbt read catalog cells-per-column=1sku123
descr:title @ 2020/03/19–16:11:07.097000
“Antique Clock”
Querying and accessing Data
Bigtable has some fantastic lookup capabilities. To demonstrate them, lets first add some more data
cbt set catalog sku124 descr:title=”Vintage Record Player”
cbt set catalog sku125 descr:title=”Antique Chair”
cbt set catalog sku942 descr:title=”New Wireless Headphones”
cbt set catalog svc024 descr:title=”Antique Repair Service”
Let’s see what we have now.
We’ve added 3 more skus some sequential and one in the 900s. We’ve also added the last entry as a service rather than a product.
cbt read
Retrieve Single Entry
Previously we’ve been calling `cbt read` which returns a set of rows. Calling it now will return all the records we have in the system. If you know which row you’re interested specifically you can access it directly with `lookup`
cbt lookup catalog sku123
Additionally you can get even more specific indicating the exact columns you want
cbt lookup catalog sku123 columns=descr:title
Reading All Rows
Now let’s look at the readrows command to understand some of the ways we can query the data.
We covered this previously but as a foundation calling `cbt read` with no additional qualifiers will return all the values
cbt read catalog
Clearly something we wouldn’t want in a normal system. Thankfully Bigtable provides a few ways to get only the data we’re interested in.
Start & End
First it’s important to understand that Bigtable stores all its rows in ascending order based on the row id. Many of the features and patterns in bigtable revolve around this core concept. To see it in practice, the simplest way is to use `start` and `end` on the read command. Here we’re saying we want to start reading at sku124 and return all the rest of the rows.
cbt read catalog start=sku124
Or, read all the rows up to but excluding sku942
cbt read catalog end=sku942
You can combine them of course to get more targeted
cbt read catalog start=sku124 end=sku942
The values don’t need to be exact either, you can provide portions of the IDs
cbt read catalog start=sku12 end=sku9
This works because it’s comparing the lexical value of sku12 against the row ids in the database. Since sku12 comes before sku123 it will include 123. Since sku9 comes before sku942, it will exclude 942
That’s pretty cool, but there’s more
Prefix
You can use the prefix flag to pull only a subset of rows. In our dataset we have entries starting with sku and svc. let’s pull them separately. First the product `sku` records
cbt read catalog prefix=sku
Now the service `svc` records
cbt read catalog prefix=svc
Regex
Of course if you want to get fancy you can use standard regex. Pull any row starting with `s` then 3 of any characters followed by `24`
cbt read catalog regex=s.{3}24
Count
Finally we have count. It’s pretty self explanatory, cont returns only X number of rows that you indicate. This comes in handy when dealing with time series data and other scenarios.
cbt read catalog count=3
Schema Design
Tall Narrow Tables
Now that you’ve worked with Bigtable it’s a good time to discuss the schema design. Typically with Bigtable data sets you’ll want to focus on tall narrow tables vs short wide tables.
Continuing with our retail theme, let’s assume we’re tracking shipments to our customers
If you were interested in tracking the location of the shipment over time you might be interested in some elements such as:
- OrderID
- Shipping Company
- Vehicle ID
- Region
- GPS Location
- Timestamp
A short wide table might have rows for each shipping company, then columns for each vehicle ID and vehicle location. This would result in fewer rows but more columns
Instead it’s better to store this data in tall narrow tables. For example you would have a row for each time a vehicle reports data. This would result in many rows and fewer columns.
Avoid Hot Spots
A common challenge while dealing with time series data is a concept called hotspotting. When there are a bunch of writes for row keys right next to each other (like with time series data) you can create hot spots in your clusters that slow things down. When a row key for a time series includes a timestamp, all of your writes will target a single node, fill that node, and then move onto the next node. Ideally the writes would spread across all the nodes evenly.
To combat this you’ll want to create row keys that are non-contiguous.
For our shipping data if we simply stored data with a row starting with `timestamp` all the records would be contiguous. Instead we use a tactic of field promotion to move the fields from columns into the actual row key. A better row key might be `vehicle_id_#timestamp`. Since many vehicles will be reporting in a short time span, prefacing with the unique vehicle id will help spread the data out over the cluster.
Row Keys optimized for queries
The common way to sort and filter data in Bigtable is through the row key so it’s important to consider your queries when designing row keys. With the shipping data you might be more concerned with querying on the shipping company and therefore would need to include `shipping_co` in your row key. `shipping_co#vehicle_id_#timestamp`
You could then query bus line `UPS` with the prefix query `cbt read catalog prefix=UPS`
Depending on the various queries you need, you might find many fields promoted to the row key. With our bus data you might see a row key with most of the fields such as `region#shipping_co#timestamp#vehicle_id`
Cleanup
OK that’s it for this session. Let’s delete our instance and clean things up.
Delete the table instance & .cbtrc file
cbt deletetable catalog
cbt deleteinstance my-instance
rm ~/.cbtrc
So there you have it, a whirlwind tour of bigtable. I hope this gave you a little insight on how bigtable works and how you might use it in your next project. You can find more about it on cloud.google.com/bigtable
No comments:
Post a Comment