If you’ve never looked at it before Bigtable can seem a little unapproachable. In this tutorial we’ll get you past that and guide you through your first steps with Bigtable so you can start using this fully managed NoSQL database in your own projects.

如果您从没看过Bigtable,那么它似乎有些难以理解。 在本教程中,我们将指导您完成上述工作,并指导您完成Bigtable的第一步,以便您可以在自己的项目中开始使用此完全托管的NoSQL数据库。

Bigtable is designed for low latency data access, where scalability and reliability really matter. It’s actually the same technology behind the majority of Google products, including Gmail, Maps, YouTube; Each of which serves multi-billion users.

Bigtable是专为低延迟数据访问而设计的,其中可扩展性和可靠性至关重要。 实际上,与大多数Google产品(包括Gmail,地图,YouTube)背后的技术相同; 每个服务于数十亿用户。

In this tutorial we’ll walk you through your first steps with Bigtable, how to use it and what you really need to know to get started including:

在本教程中,我们将引导您完成使用Bigtable的第一步,如何使用它以及开始时真正需要了解的知识,包括:

  • Understanding the different types of NoSQL databases

    了解NoSQL数据库的不同类型
  • How to setup and interact with Bigtable

    如何设置Bigtable并与之交互
  • How big table structures and manages data

    大表如何构造和管理数据
  • Common ways to query and access data

    查询和访问数据的常用方法
  • Best practices around schema design

    模式设计最佳实践

NoSql数据库的4种类型 (The 4 Types of NoSql Databases)

In the NoSql realm there are generally 4 types of databases. You’ve got your Column based, Document based, Key-Value based and Graph based databases.

在NoSql领域中,通常有4种类型的数据库。 您已经有了基于列,基于文档,基于键值和基于图的数据库。

Bigtable falls into the Wide-Column based family along with others like Cassandra, and Hbase. To understand Wide-Column it helps to look first at traditional Column based. When we say Column based Nosql, what does that actually mean? Well traditional Relational databases are row based, meaning they’re optimized for returning rows of data. Consider a User database for example, a relational database would organize first name, last name, and address all near each other.

Bigtable与Cassandra和Hbase等其他公司一起属于基于Wide-Column的家族。 要了解Wide-Column,首先了解传统的基于Column的方法将有所帮助。 当我们说基于列的Nosql时,这实际上意味着什么? 传统的关系型数据库是基于行的,这意味着它们已针对返回数据行进行了优化。 例如,考虑一个用户数据库,一个关系数据库将组织名字,姓氏和地址,并且彼此靠近。

If you wanted to access the state and zip of many users, the database would have to jump around to pull all the fields. A column based database on the other hand is optimized for accessing data by column instead of row. So in our Users database example it would store all the names together, all the states together, all the zip codes together and so on. This makes reads much more efficient. To scan all the states, the database can stay within the same area on disk.

如果要访问许多用户的状态和邮政编码,则数据库将不得不跳来跳去所有字段。 另一方面,基于列的数据库经过优化,可以按列而不是按行访问数据。 因此,在我们的“用户”数据库示例中,它将所有名称,所有状态,所有邮政编码等存储在一起。 这使读取更加有效。 要扫描所有状态,数据库可以保留在磁盘上的同一区域内。

A Wide-Column datastore looks similar however they often group the columns into Column Families, a set of columns that are typically used together. These Column Families are further optimized on disk to ensure fast access.

宽列数据存储区看起来很相似,但是它们通常将列分组为列族,列族通常一起使用。 这些列系列在磁盘上进行了进一步优化,以确保快速访问。

This will be good to keep in mind for later when we start designing our schemas. Alright, lets get on with it and get hands dirty

在以后开始设计架构时,请记住这一点,这将是一个好主意。 好吧,让我们继续吧,弄脏双手

与Bigtable互动 (Interacting with Bigtable)

For this walkthrough I’ll be using the Bigtable command line tool, called CBT. All the concepts you see here can also be done programmatically in your language of choice by just pulling the right SDK. We’re just using the CLI for simplicity here.

在本演练中,我将使用称为CBT的Bigtable命令行工具。 您只需选择合适的SDK,就可以通过您选择的语言以编程方式完成此处看到的所有概念。 为了简化起见,我们仅使用CLI。

使用CBT访问您的实例 (Accessing your instance with CBT)

With the Bigtable instance created and the CLI installed, this is a good point to access your instance from the CLI to ensure you’ve got everything setup.

创建Bigtable实例并安装CLI后,这是从CLI访问实例以确保已完成所有设置的好方法。

Alright, to get started lets just list our instances

好了,开始吧,只列出我们的实例

cbt listinstances

You should see the instance you just created in the output

您应该在输出中看到刚创建的实例

Instance Name    Info
— — — — — — — — —
my-instance my-instance

And there’s the instance we created. Now if we try to list the tables with cbt ls we get an error

还有我们创建的实例。 现在,如果我们尝试使用cbt ls列出表,则会出现错误

cbt ls

Error will read

错误将显示

Missing -instance

it’s telling us we need to specify the instance with a flag. We could do that, but it’s gonna get annoying adding that in every time so let’s write it to an rc file as a default

告诉我们我们需要用一个标志指定实例。 我们可以这样做,但是每次都会添加它会很烦人,因此我们默认将其写入rc文件

echo instance = my-instance >> ~/.cbtrc

Now when we run it, no errors, but no response either since we haven’t created any tables yet.

现在,当我们运行它时,没有错误,但是也没有响应,因为我们还没有创建任何表。

cbt ls

数据结构和架构基础 (Data structures & schema basics)

桌子 (Tables)

For this example we’ll be creating a product catalog that might be used by a typical retailer. So in this step we’ll create a table called `catalog`

在此示例中,我们将创建典型零售商可能使用的产品目录。 因此,在这一步中,我们将创建一个名为“ catalog”的表

cbt createtable catalog

Calling `ls` one more time

再叫一次“ ls”

cbt ls

and we see our table

我们看到我们的桌子

catalog

列族 (Column Family)

Earlier I mentioned that Bigtable stores data related to columns. To help organize the data and limit what you’re pulling back, columns are grouped into what’s called column families. These column families group the fields that are typically accessed in the same request to ensure more efficient access.

之前我曾提到Bigtable存储与列相关的数据。 为了帮助组织数据并限制您要回退的内容,将列分为所谓的列族。 这些列系列将通常在同一请求中访问的字段进行分组,以确保更有效的访问。

In our catalog example we may have product description fields and pricing or inventory fields. A listing of products may use data from the descriptors but not need all the store level inventory.

在我们的目录示例中,我们可能具有产品描述字段以及定价或库存字段。 产品清单可以使用来自描述符的数据,但不需要所有商店级别的库存。

Lets go ahead and create a column family for those product descriptors

让我们继续为这些产品描述符创建一个列族

cbt createfamily catalog descr

And now we’ve got our column family in the table. Running the ls command again

现在,我们在表中有了列族。 再次运行ls命令

cbt ls catalog

displays

显示

Family Name     GC Policy
— — — — — — — — — — -
descr <never>

行,列和单元格 (Rows, Columns & Cells)

Just like with relational databases we have a concept of rows columns and cells. Each row is identified by a unique key you provide. Cells are at the intersection of a row id and column id To access a specific cell you need to identify the location including Row Key, Column Family, and Column Qualifier

就像关系数据库一样,我们有行列和单元格的概念。 每行由您提供的唯一键标识。 单元格位于行ID和列ID的交集处要访问特定的单元格,您需要标识位置,包括行键,列族和列限定符

In our case the rowID will be a unique product sku and we’ll add a title for it in the descriptors column family

在我们的例子中,rowID将是唯一的产品sku,我们将在描述符列系列中为其添加标题

The format will be

格式为

cbt set <table> <rowID> <colFamily>:<colQualifier>=<value>

cbt set <表> <行ID> <colFamily>:<colQualifier> = <值>

cbt set catalog sku123 descr:title=”Vintage Clock”

Now if we read our catalog table

现在,如果我们阅读目录表

cbt read catalog

we’ll see the value

我们将看到价值

sku123
descr:title @ 2020/03/19-16:08:47.765000
"Vintage Clock"

Notice that we didn’t explicitly create columns. With bigtable we have dynamic schemas that allow you to create columns on the fly.

注意,我们没有显式创建列。 使用bigtable,我们拥有动态架构,可让您动态创建列。

Since Bigtable is what’s called a sparsely populated database, any empty fields don’t incur storage overhead, unlike relational databases.

由于Bigtable是所谓的稀疏填充数据库,因此与关系数据库不同,任何空字段都不会产生存储开销。

单元版本 (Cell Versions)

Big table has a concept of cell versions, allowing you to store multiple revisions of data in this same spot, indicated by time.

大表具有单元格版本的概念,可让您在同一位置存储多个数据修订版,以时间表示。

We just set the contents of the cell descr:title on row sku123 to “Vintage Clock”. Now run the command again with a different title.

我们只是将sku123行上的descr:title单元的内容设置为“ Vintage Clock”。 现在,使用另一个标题再次运行该命令。

cbt set catalog sku123 descr:title=”Antique Clock”

You may have expected a single record returned but run the command

您可能期望返回一条记录,但是运行命令

cbt read catalog

you’ll see instead we have two records

您会看到我们有两条记录

sku123
descr:title @ 2020/03/19–16:11:07.097000
“Antique Clock”
descr:title @ 2020/03/19–16:08:47.765000
“Vintage Clock”

You’ll see that the catalog contains 2 versions of the cell descr:title, our original one with “Vintage Clock” and the update with “Antique Clock” At first multiple rows might seem alarming but this can be really handy in your system designs and audits.

您会看到目录包含descr:title单元的2个版本,我们的原始版本带有“ Vintage Clock”,而更新版本带有“ Antique Clock”。乍一看似乎令人震惊,但这在您的系统设计中确实非常方便和审核。

垃圾收集 (Garbage Collection)

Given you may not want to store every version ever created, Bigtable offers the ability to trash cell versions with a feature called Garbage Collection. Earlier we listed the column families on our table and you may have noticed GC Policy set to never. Leaving this as is will collect every version of the cell ever created.

鉴于您可能不希望存储曾经创建的每个版本,Bigtable提供了使用“垃圾收集”功能删除单元格版本的功能。 之前我们在表中列出了列族,您可能已经注意到GC策略设置为永不。 保留原样将收集曾经创建的单元的每个版本。

cbt ls catalogFamily Name      GC Policy
— — — — — — — — — — -
descr <never>

You can set the garbage collection policy based on the Time of the cell, Number of cells or a combination of the two. For example you could keep a month’s worth of changes, the last 5 versions or maybe up to 5 versions and within the last month.

您可以基于单元的时间,单元数或两者的组合来设置垃圾收集策略。 例如,您可以保留一个月的更改价值,最近5个版本,或者在上个月内最多可​​以保留5个版本。

For our example lets only keep one version

对于我们的示例,仅保留一个版本

cbt setgcpolicy catalog descr maxversions=1

Now review the column families

现在查看列族

cbt ls catalog

Notice the new policy listed

注意列出的新政策

Family Name     GC Policy-----------     ---------descr           versions() > 1

But when we read the table with no flags it still returns 2 cells, why is that?

但是,当我们读取没有标志的表时,它仍然返回2个单元格,为什么呢?

cbt read catalogsku123
descr:title @ 2020/03/19–16:11:07.097000
“Antique Clock”
descr:title @ 2020/03/19–16:08:47.765000
“Vintage Clock”

Garbage collection is a data storage technique, not for limiting querying results. In fact, it can take up to a week before data that is eligible for garbage collection is actually removed.

垃圾收集是一种数据存储技术,不用于限制查询结果。 实际上,可能要花一周的时间才能真正删除符合垃圾收集条件的数据。

In practice you won’t be pulling back all revisions of a cell anyway. Instead you’ll be doing something like the following which pulls the latest cell entry

实际上,无论如何您都不会撤回单元的所有修订。 相反,您将执行类似以下的操作,以提取最新的单元格条目

cbt read catalog cells-per-column=1sku123
descr:title @ 2020/03/19–16:11:07.097000
“Antique Clock”

查询和访问数据 (Querying and accessing Data)

Bigtable has some fantastic lookup capabilities. To demonstrate them, lets first add some more data

Bigtable具有一些出色的查找功能。 为了演示它们,让我们先添加一些数据

cbt set catalog sku124 descr:title=”Vintage Record Player”
cbt set catalog sku125 descr:title=”Antique Chair”
cbt set catalog sku942 descr:title=”New Wireless Headphones”
cbt set catalog svc024 descr:title=”Antique Repair Service”

Let’s see what we have now.

让我们看看我们现在有什么。

We’ve added 3 more skus some sequential and one in the 900s. We’ve also added the last entry as a service rather than a product.

我们在顺序中添加了3个skus,在900年代中添加了一个。 我们还将最后一项添加为服务而不是产品。

cbt read

检索单项 (Retrieve Single Entry)

Previously we’ve been calling `cbt read` which returns a set of rows. Calling it now will return all the records we have in the system. If you know which row you’re interested specifically you can access it directly with `lookup`

以前我们一直在调用`cbt read`,它返回一组行。 现在调用它将返回我们在系统中拥有的所有记录。 如果您知道您对哪一行特别感兴趣,则可以使用“ lookup”直接访问它

cbt lookup catalog sku123

Additionally you can get even more specific indicating the exact columns you want

此外,您可以得到更具体的指示,以指示所需的确切列

cbt lookup catalog sku123 columns=descr:title

读取所有行 (Reading All Rows)

Now let’s look at the readrows command to understand some of the ways we can query the data.

现在,让我们看一下readrows命令,以了解一些查询数据的方法。

We covered this previously but as a foundation calling `cbt read` with no additional qualifiers will return all the values

之前我们已经介绍过此方法,但是作为一个调用`cbt read`的基础,没有其他限定符将返回所有值

cbt read catalog

Clearly something we wouldn’t want in a normal system. Thankfully Bigtable provides a few ways to get only the data we’re interested in.

显然,在正常系统中我们是不需要的。 值得庆幸的是,Bigtable提供了一些方法来仅获取我们感兴趣的数据。

开始和结束 (Start & End)

First it’s important to understand that Bigtable stores all its rows in ascending order based on the row id. Many of the features and patterns in bigtable revolve around this core concept. To see it in practice, the simplest way is to use `start` and `end` on the read command. Here we’re saying we want to start reading at sku124 and return all the rest of the rows.

首先,重要的是要了解Bigtable根据行ID升序存储其所有行。 bigtable中的许多功能和模式都围绕着这个核心概念。 要在实践中看到它,最简单的方法是在读取命令上使用“ start”和“ end”。 在这里,我们要说的是要从sku124开始阅读,并返回其余所有行。

cbt read catalog start=sku124

Or, read all the rows up to but excluding sku942

或者,读取直到但不包括sku942的所有行

cbt read catalog end=sku942

You can combine them of course to get more targeted

您当然可以结合使用以获得更多针对性

cbt read catalog start=sku124 end=sku942

The values don’t need to be exact either, you can provide portions of the IDs

值也不必精确,您可以提供ID的一部分

cbt read catalog start=sku12 end=sku9

This works because it’s comparing the lexical value of sku12 against the row ids in the database. Since sku12 comes before sku123 it will include 123. Since sku9 comes before sku942, it will exclude 942

之所以可行,是因为它将sku12的词法值与数据库中的行ID进行了比较。 由于sku12在sku123之前,所以将包括123。由于sku9在sku942之前,因此将不包括942

That’s pretty cool, but there’s more

这很酷,但是还有更多

字首 (Prefix)

You can use the prefix flag to pull only a subset of rows. In our dataset we have entries starting with sku and svc. let’s pull them separately. First the product `sku` records

您可以使用前缀标志仅拉出一部分行。 在我们的数据集中,我们有以sku和svc开头的条目。 让我们分开拉它们。 首先记录产品`sku`

cbt read catalog prefix=sku

Now the service `svc` records

现在服务`svc`记录了

cbt read catalog prefix=svc

正则表达式 (Regex)

Of course if you want to get fancy you can use standard regex. Pull any row starting with `s` then 3 of any characters followed by `24`

当然,如果您想花哨的话,可以使用标准的正则表达式。 拉出以“ s”开头的任何行,然后拉出任意3个字符,再接“ 24”

cbt read catalog regex=s.{3}24

计数 (Count)

Finally we have count. It’s pretty self explanatory, cont returns only X number of rows that you indicate. This comes in handy when dealing with time series data and other scenarios.

最后,我们数了。 这很容易解释,续只返回您指定的X行数。 在处理时间序列数据和其他场景时,这非常方便。

cbt read catalog count=3

模式设计 (Schema Design)

高窄桌 (Tall Narrow Tables)

Now that you’ve worked with Bigtable it’s a good time to discuss the schema design. Typically with Bigtable data sets you’ll want to focus on tall narrow tables vs short wide tables.

现在,您已经与Bigtable合作,现在是讨论架构设计的好时机。 通常,对于Bigtable数据集,您将需要关注高窄表和短宽表。

Continuing with our retail theme, let’s assume we’re tracking shipments to our customers

继续我们的零售主题,假设我们正在跟踪向客户的发货

If you were interested in tracking the location of the shipment over time you might be interested in some elements such as:

如果您有兴趣跟踪一段时间内的发货地点,则可能对以下一些元素感兴趣:

  • OrderID

    订单编号
  • Shipping Company

    航运公司
  • Vehicle ID

    车辆编号
  • Region

    地区
  • GPS Location

    GPS定位
  • Timestamp

    时间戳记

A short wide table might have rows for each shipping company, then columns for each vehicle ID and vehicle location. This would result in fewer rows but more columns

一个简短的宽表可能每个运输公司都有一行,然后是每个车辆ID和车辆位置的列。 这将导致更少的行但更多的列

Instead it’s better to store this data in tall narrow tables. For example you would have a row for each time a vehicle reports data. This would result in many rows and fewer columns.

相反,最好将此数据存储在狭窄的高表中。 例如,每当车辆报告数据时,您都会有一行。 这将导致许多行和更少的列。

避免热点 (Avoid Hot Spots)

A common challenge while dealing with time series data is a concept called hotspotting. When there are a bunch of writes for row keys right next to each other (like with time series data) you can create hot spots in your clusters that slow things down. When a row key for a time series includes a timestamp, all of your writes will target a single node, fill that node, and then move onto the next node. Ideally the writes would spread across all the nodes evenly.

处理时间序列数据时常见的挑战是称为热点的概念。 当大量的行键写紧挨着写时(例如时间序列数据),您可以在集群中创建热点,从而减慢速度。 当时间序列的行键包含时间戳时,所有写操作都将针对单个节点,填充该节点,然后移至下一个节点。 理想情况下,写入将平均分布在所有节点上。

To combat this you’ll want to create row keys that are non-contiguous.

为了解决这个问题,您将需要创建不连续的行键。

For our shipping data if we simply stored data with a row starting with `timestamp` all the records would be contiguous. Instead we use a tactic of field promotion to move the fields from columns into the actual row key. A better row key might be `vehicle_id_#timestamp`. Since many vehicles will be reporting in a short time span, prefacing with the unique vehicle id will help spread the data out over the cluster.

对于我们的运输数据,如果我们仅以“ timestamp”开头的行存储数据,则所有记录都是连续的。 相反,我们使用字段升级策略将字段从列移动到实际的行键中。 更好的行键可能是`vehicle_id_#timestamp`。 由于许多车辆将在短时间内报告,因此以唯一的车辆ID开头将有助于在整个群集中分布数据。

行键针对查询进行了优化 (Row Keys optimized for queries)

The common way to sort and filter data in Bigtable is through the row key so it’s important to consider your queries when designing row keys. With the shipping data you might be more concerned with querying on the shipping company and therefore would need to include `shipping_co` in your row key. `shipping_co#vehicle_id_#timestamp`

在Bigtable中排序和过滤数据的常用方法是通过行键,因此在设计行键时考虑您的查询很重要。 有了运输数据,您可能会更关心在运输公司上进行查询,因此需要在行键中包含`shipping_co`。 `shipping_co#vehicle_id_#timestamp`

You could then query bus line `UPS` with the prefix query `cbt read catalog prefix=UPS`

然后,您可以使用前缀查询“ cbt read catalog prefix = UPS”来查询总线“ UPS”。

Depending on the various queries you need, you might find many fields promoted to the row key. With our bus data you might see a row key with most of the fields such as `region#shipping_co#timestamp#vehicle_id`

根据您需要的各种查询,您可能会发现许多提升为行键的字段。 使用我们的公交车数据,您可能会看到带有大多数字段的行键,例如“ region#shipping_co#timestamp#vehicle_id”

清理 (Cleanup)

OK that’s it for this session. Let’s delete our instance and clean things up.

好的,这是本次会议。 让我们删除实例并进行清理。

Delete the table instance & .cbtrc file

删除表实例和.cbtrc文件

cbt deletetable catalog
cbt deleteinstance my-instance
rm ~/.cbtrc

So there you have it, a whirlwind tour of bigtable. I hope this gave you a little insight on how bigtable works and how you might use it in your next project. You can find more about it on cloud.google/bigtable

这样就可以进行大表的旋风之旅。 我希望这能使您对大表如何工作以及如何在下一个项目中使用它有所了解。 您可以在cloud.google/bigtable上找到有关它的更多信息。

翻译自: https://medium/google-cloud/getting-started-with-bigtable-on-gcp-adfb896e0b26

更多推荐

在gcp上开始使用bigtable