Spanner is a distributed database Google initiated a while ago to build a highly available and highly consistent database for its own workloads. Spanner was initially built to be a key/value and was in a completely different shape than it is today and it had different goals. Since the beginning, it had transactional capabilities, external consistency and was able to failover transparently. Over time, Spanner adopted a strongly typed schema and some other relational database features. In the last years, it added SQL support*. Today we are improving both the SQL dialect and the relational database features simultaneously. Sometimes there is confusion whether Spanner supports SQL or not. The short answer is yes. The long answer is this article.

Early years

Spanner was initially built for internal workloads and no one was able to see Google is going to initiate a cloud business and externalize Spanner. If you don’t know much about Google, our internal stack is often considered as a different universe. At Google, all systems including storage and database services provide their own proprietary APIs and clients. Are you planning to use your favorite ORM library when you join to Google? Unfortunately, it’s not possible. Our infrastructure services provide their own Stubby/gRPC APIs and client libraries. This is a minor disadvantage if you care about API familiarity but it’s a strong differentiator because we can provide more expressive APIs that represent the differentiated features of our infrastructure.

Differentiated features are often poorly represented in common APIs. One size doesn’t fit all. Common APIs can only target the common features. Distributed databases are already vastly different than traditional relational databases. I’ll give two examples in this article to show how explicit APIs make a difference. Distributed systems fail differently and more often. In order to deal with this problem, distributed systems implement retrying mechanisms. Spanner clients transparently retries transactions when we fail to commit them. This allows us not to surface every temporary failure to the user. We transparently retry with the right backoff strategy.

In the following snippet, you see some Go code starting a new read-write transaction. It takes a function where you can query and manipulate data. When there is an abort or conflict, it retries the function automatically. ReadWriteTransaction documents this behavior and documents that the function should be safe to retry (e.g. telling the developers don’t hold application state). This allows us to communicate the unique reality of distributed databases to the user. We can also provide capabilities like auto-retries which are harder to implement in traditional ORMs.

import "cloud.google.com/go/spanner"

_, err := client.ReadWriteTransaction(ctx, func(ctx context.Context, txn *spanner.ReadWriteTransaction) error {
    // User code here.
})

Another example is the isolation levels. Spanner implements an isolation level better than the strictest isolation level (serializable) described in the SQL standard. Spanner doesn’t allow you to pick anything less strict for read/write transactions. But for multi-regional setups and read-only transactions, providing the strongest isolation is not always feasible. Our limits are tested by the speed of the light. For users who are ok with slightly stale data, Spanner has capabilities to provide stale reads. Stale reads allow users to read the version available in the regional replica. They can set how much staleness they can tolerate. For example, the transaction below can tolerate up to 10 seconds.

import "cloud.google.com/go/spanner"
client.ReadOnlyTransaction().
    WithTimestampBound(spanner.MaxStaleness(10*time.Second)).
    Query(ctx, query)

Staleness API allows us to explicitly express how snapshot isolation works and how Spanner can go and fetch the latest data if replica is very out of date. It also allows us to highlight how multi-regional replication is a hard problem and even with a database like Spanner, you may consider compromising from consistency for better latency characteristics in a multi-regional setup.

F1

F1 was the original experiment for the first steps towards having SQL support in Spanner. F1 is a distributed database at Google that is built on top of Spanner. Unlike Spanner, it supported:

  • Distributed SQL queries
  • Transactionally consistent secondary indexes
  • Change history and stream

It implemented these features in a coordination layer on top of Spanner and handed off everything else to Spanner.

F1 was built to support our Ads products. Given the nature of the ads business and the complexity of our Ads products, being able to write and run complex queries was critical. F1 made Spanner more accessible to business-logic heavy systems.

Spanner in Cloud

Fast forward, Google Cloud launched Spanner for our external customers. When it was first released, it only had SQL support to query data. It lacked INSERT, UPDATE and DELETE statements.

Given it didn’t fully a SQL database back then, it also lacked driver support for JDBC, database/sql and similar. Driver support became a possibility when Cloud Spanner implemented a Data Manipulation Language (DML) support for inserts, updates and deletes.

Today, Cloud Spanner supports both DDLs (for schema) and DMLs (for data). Cloud Spanner uses a SQL dialect used by Google. ZetaSQL, a native parser and analyzer of this dialect has open sourced a while ago. As of today, Cloud Spanner also provides a query analysis tool.

An example query analysis from Spanner’s query analyzer.

Next?

There are current challenges originated from the differences present in our SQL dialect. This is an area we are actively trying to improve. Not just we don’t want our users to deal with a new SQL flavor, the current situation is also slowing down our work on ORM integrations. Some of the ORM frameworks hardcode SQL when generating queries and giving drivers little flexibility to override behavior. In order to avoid any inconsistencies, we are trying to close the differences with popular dialects.

Dialect differences are not the only problem affecting our SQL support. One other significant gap is the lack of some of the common traditional database features. Spanner never supported features like default values or autogenerated IDs. As we are improving the dialect differences, it’s always in our radar to simultaneously address these significant gaps.


(*) The initial work on Spanner’s querying work is published as Spanner: Becoming a SQL System.