SQL Support

Spanner is a distributed database Google initiated a while ago to build a highly available and highly consistent database for its own workloads. Spanner was initially built to be a key/value and was in a completely different shape than it is today and it had different goals. Since the beginning, it had transactional capabilities, external consistency and was able to failover transparently. Over time, Spanner adopted a strongly typed schema and some other relational database features. In the last years, it added SQL support*. Today we are improving both the SQL dialect and the relational database features simultaneously. Sometimes there is confusion whether Spanner supports SQL or not. The short answer is yes. The long answer is this article.

Early years

Spanner was initially built for internal workloads and no one was able to see Google is going to initiate a cloud business and externalize Spanner. If you don’t know much about Google, our internal stack is often considered as a different universe. At Google, all systems including storage and database services provide their own proprietary APIs and clients. Are you planning to use your favorite ORM library when you join to Google? Unfortunately, it’s not possible. Our infrastructure services provide their own Stubby/gRPC APIs and client libraries. This is a minor disadvantage if you care about API familiarity but it’s a strong differentiator because we can provide more expressive APIs that represent the differentiated features of our infrastructure.

Differentiated features are often poorly represented in common APIs. One size doesn’t fit all. Common APIs can only target the common features. Distributed databases are already vastly different than traditional relational databases. I’ll give two examples in this article to show how explicit APIs make a difference. Distributed systems fail differently and more often. In order to deal with this problem, distributed systems implement retrying mechanisms. Spanner clients transparently retries transactions when we fail to commit them. This allows us not to surface every temporary failure to the user. We transparently retry with the right backoff strategy.

In the following snippet, you see some Go code starting a new read-write transaction. It takes a function where you can query and manipulate data. When there is an abort or conflict, it retries the function automatically. ReadWriteTransaction documents this behavior and documents that the function should be safe to retry (e.g. telling the developers don’t hold application state). This allows us to communicate the unique reality of distributed databases to the user. We can also provide capabilities like auto-retries which are harder to implement in traditional ORMs.

import "cloud.google.com/go/spanner"

_, err := client.ReadWriteTransaction(ctx, func(ctx context.Context, txn *spanner.ReadWriteTransaction) error {
    // User code here.
})

Another example is the isolation levels. Spanner implements an isolation level better than the strictest isolation level (serializable) described in the SQL standard. Spanner doesn’t allow you to pick anything less strict for read/write transactions. But for multi-regional setups and read-only transactions, providing the strongest isolation is not always feasible. Our limits are tested by the speed of the light. For users who are ok with slightly stale data, Spanner has capabilities to provide stale reads. Stale reads allow users to read the version available in the regional replica. They can set how much staleness they can tolerate. For example, the transaction below can tolerate up to 10 seconds.

import "cloud.google.com/go/spanner"
client.ReadOnlyTransaction().
    WithTimestampBound(spanner.MaxStaleness(10*time.Second)).
    Query(ctx, query)

Staleness API allows us to explicitly express how snapshot isolation works and how Spanner can go and fetch the latest data if replica is very out of date. It also allows us to highlight how multi-regional replication is a hard problem and even with a database like Spanner, you may consider compromising from consistency for better latency characteristics in a multi-regional setup.

F1

F1 was the original experiment for the first steps towards having SQL support in Spanner. F1 is a distributed database at Google that is built on top of Spanner. Unlike Spanner, it supported:

  • Distributed SQL queries
  • Transactionally consistent secondary indexes
  • Change history and stream

It implemented these features in a coordination layer on top of Spanner and handed off everything else to Spanner.

F1 was built to support our Ads products. Given the nature of the ads business and the complexity of our Ads products, being able to write and run complex queries was critical. F1 made Spanner more accessible to business-logic heavy systems.

Spanner in Cloud

Fast forward, Google Cloud launched Spanner for our external customers. When it was first released, it only had SQL support to query data. It lacked INSERT, UPDATE and DELETE statements.

Given it didn’t fully a SQL database back then, it also lacked driver support for JDBC, database/sql and similar. Driver support became a possibility when Cloud Spanner implemented a Data Manipulation Language (DML) support for inserts, updates and deletes.

Today, Cloud Spanner supports both DDLs (for schema) and DMLs (for data). Cloud Spanner uses a SQL dialect used by Google. ZetaSQL, a native parser and analyzer of this dialect has open sourced a while ago. As of today, Cloud Spanner also provides a query analysis tool.

An example query analysis from Spanner’s query analyzer.

Next?

There are current challenges originated from the differences present in our SQL dialect. This is an area we are actively trying to improve. Not just we don’t want our users to deal with a new SQL flavor, the current situation is also slowing down our work on ORM integrations. Some of the ORM frameworks hardcode SQL when generating queries and giving drivers little flexibility to override behavior. In order to avoid any inconsistencies, we are trying to close the differences with popular dialects.

Dialect differences are not the only problem affecting our SQL support. One other significant gap is the lack of some of the common traditional database features. Spanner never supported features like default values or autogenerated IDs. As we are improving the dialect differences, it’s always in our radar to simultaneously address these significant gaps.


(*) The initial work on Spanner’s querying work is published as Spanner: Becoming a SQL System.

High Availability Writes

Spanner is a relational database with 99.999% availability which is roughly 5 mins a year. Spanner is a distributed system and can span multiple machines, multiple datacenters (and even geographical regions when configured). It splits the records automatically among its replicas and provides automatic failover. Unlike traditional failover models, Spanner doesn’t failover to a secondary cluster but can elect an available read-write replica as the new leader.

In relational databases, providing both high availability and high consistency in writes is a very hard problem. Spanner’s synchronous replication, the use of dedicated networking and Paxos voting provides high availability without compromising consistency.

High availability of reads vs writes

In traditional relational databases (e.g. MySQL or PostgreSQL), scaling and providing higher availability to reads is easier than writes. Read-only replicas provide a copy of the data read-only transactions can retrieve from. Data is replicated to the read-only replicas from a read-write master either synchronously or asynchronously.

In synchronous models, master synchronously writes to the read replicas at each write. Even though this model ensures that read-only replicas always have the latest data, it makes the writes quite expensive (and causes availability issues for writes) because the master has to write to all available replicas before it returns.

In asynchronous models, read-only replicas get the data from a stream or a replication log. Asynchronous models make writes faster but introduce a lag between the master and the read-only replicas. Users have to tolerate the lag and should be monitoring it to identify replication outages. The asynchronous writes make the system inconsistent because not all the replicas will have the latest version of the until asynchronous synchronization is complete. The synchronous writes make data consistent by ensuring all replicas got the change before a write succeeds.

HA cluster

Horizontally scaling reads by adding more read replicas is only part of the problem. Scaling writes is a harder problem. Having more than one master is introducing additional problems. If a master is having outage, other(s) can keep serving writes without users experiencing downtime. This model requires replication of writes among masters. Similar to read replication, multi-master replication can be implemented asynchronously or synchronously. If implemented synchronously, it often means less availability of writes because a write should replicate in all masters and they should be all available before it can succeed. As a tradeoff, multi-master replication is often implemented with asynchronous replication but it negatively impacts the overall system by introducing:

  • Looser consistency characteristics that violate ACID promises.
  • Increased risk of timeouts and communication latency.
  • Necessity for conflict resolution between two or more masters if conflicting updates happened but not communicated.

Due to the complexity and the failure modes multi-master replication introduces, it’s not a commonly preferred way of providing high availability in practice.

As an alternative, high-availability clusters are a more popular choice. In this model, you’d have an entire cluster that can take over when the primary master goes down. Today, cloud providers implement this model to provide high availability features for their managed traditional relational database products.

HA cluster

Topology

Spanner doesn’t use high availability clusters but approaches to the problem from a different angle. A Spanner cluster* contains multiple read-write, may contain some read-only and some witness replicas.

  • Read-write replicas serve reads and writes.
  • Read-only replicas serve reads.
  • Witnesses don’t serve data but participate in leader election.

Read-only and witness replicas are only used for multi-regional Spanner clusters that can span across multiple geographical regions. Single region clusters only use read-write replicas. Each replica lives in a different zone in the region to avoid single point of failure due to zonal outages.

Split leader in writes

Splits

Spanner’s replication and sharding capabilities come from its splits. Spanner splits data to replicate and distribute them among the replicas. Split happens automatically when Spanner detects high read or high write load among the records. Each split is replicated and has a leader replica.

When a write arrives, Spanner finds the split the row is in. Then, we look for the leader of that split and route the write to the leader. This is true even in multi-region setups where user is geographically closer to another non-leader read-write replica. In the case of an outage of the leader, an available read-write replica is elected as the leader and user’s write is served from there.

Split leader and replicas

In order for a write to succeed, a leader needs to synchronously replicate the change to the other replicas. But isn’t this impacting the availability of the writes negatively? If writes need to wait for all replicas to succeed, a replica can be a single point of failure because writes wouldn’t succeed until all replicas replicate the change.

This is where Spanner does something better. Spanner only requires a majority of the Paxos voters to successfully write. This allows writes to succeed even when a read-write replica goes down. Only the majority of the voters are required not all of the read-write replicas.

Synchronous replication

As mentioned above, synchronous replication is hard and impacts the availability of the writes negatively. On the other hand when replication happens asynchronously, they cause inconsistencies, conflicts and sometimes data loss. For example, when a master becomes unavailable due to a networking issue, it may still have committed changes but might have not delivered them to the secondary master. If the secondary master updates the same records after a failover, data loss can happen or conflict resolution may be required. PostgreSQL provides a variety of replication models with different tradeoffs. The tradeoffs summary below can give you a very high level idea of how many different concerns to worry about when designing replication models. A summary of various PostgreSQL replication models and their tradeoffs:

A summary of various PostgreSQL replication models and their tradeoffs.

Spanner’s replication is synchronous. Leaders have to synchronously communicate with other read/write replicas about the change and confirm it in order for a write to succeed.

Two-phase commit (2PC)

While writes only affecting a single split uses a simpler and faster protocol, if two or more splits are required for a write transaction, two-phase commit (2PC) is executed. 2PC is infamously known as “the anti-availability protocol” because it requires participation from all the replicas and any replica can be a single point of failure. Spanner still serves writes even if some of the replicas are unavailable, because only a majority of voting replicas are required in order to commit a write.

Network

Spanner is a distributed system and is inherently affected by problems that are impacting distributed systems in general. Networking itself is a factor of outages in distributed systems. On the other hand, Google cites only 7.6% of the Spanner failures were networking related. Spanner’s 99.999% availability is not highly affected from networking outages. This is mostly because it runs on Google’s private network. Years of operational maturity, reserved resources, having control over upgrades and hardware makes networking not a significant source of outages. Eric Brewer’s earlier article explains the role of networking in this case more in detail.

Colossus

Spanner’s durability guarantees come from Google’s distributed file system, Colossus. Spanner also mitigates some more risk by depending on Colossus. The use of Colossus allows us to have the file storage decoupled from the database service. Spanner is a “shared nothing” architecture and because any server in a cluster can read from Colossus, replicas can recover quickly from whole-machine failures.

Colossus also provides replication and encryption. If a Colossus instance goes down, Spanner can still work on the data via the available Colossus instances. Colossus encrypts data and this is why Spanner provides encryption at rest by default out of the box.

Colossus replication

Spanner read-write replicas hands off the data to Colossus where data is replicated for 3 times. Given there are three read-write replicas in a Spanner cluster, this means the data is replicated for 9 times.

Automatic Retries

As repeatedly mentioned above, Spanner is a distributed system and is not magic. It experiences more internal aborts and timeouts than traditional databases when writing. A common strategy in distributed systems in order to deal with partial and temporary failures is to retry. Spanner client libraries provide automatic retries for read/write transactions. In the following Go snippet, you see the APIs to create a read-write transaction. The client automatically retries the body if it fails due to aborts or conflicts:

import "cloud.google.com/go/spanner"
_, err := client.ReadWriteTransaction(ctx, func(ctx context.Context, txn *spanner.ReadWriteTransaction) error {
    // User code here.
})

One of the challenges of developing ORM framework support for Google Cloud Spanner was the fact most ORMs didn’t have automatic retries, therefore their APIs didn’t give developers a sense that they shouldn’t maintain any application state in the scope of a transaction. In contrast, Spanner libraries care a lot of retries and make an effort to automatically deliver them without creating extra burden to the user.

Spanner approaches to sharding and replication differently than traditional relational databases. It utilizes Google’s infrastructure and fine-tunes several traditionally hard problems to provide high availability without compromising consistency.


  • (*) Google Cloud Spanner’s terminology for a cluster is an instance. I avoided to use “instance” because it is an overloaded term and might mean “replica” for the majority of the readers of this article.
  • (**) The write is routed to the split leader. Read the Splits section for more.

Transactions

You need a transaction when you want to do a series of reads/writes yet you have to perform them as a single unit. Problems like transferring money from an account to another and reading a value to multiply it by a number and write the result require a series of reads and writes; therefore they require transactions.

Spanner is a transactional database with quite a few interesting capabilities:

  • It can replicate data in a globally distributed way automatically for availability and locality.
  • It can automatically shard data among multiple machines and even data centers.
  • It versions data and you can go back in time to read a value at a particular timestamp.
  • It provides external consistency of read/write transactions. This allows two transactions with read/writes to the same value to run in parallel and succeed if operations are serializable.
  • It provides globally consistent reads at a particular timestamp, so you can read in the past.

Spanner provides some of these capabilities by the way it satisfies the ACID properties and by its clock source, TrueTime.

Transaction Types

Spanner provides two transaction types:

  • Read-only transactions: Allows user to read in a transaction from a particular timestamp. These transactions are lock-free and you should use them if you only need to read.
  • Read/write transactions: Allows users to read and buffer writes in a transactions. Spanner doesn’t allow you to see the result of the writes until you commit the transaction.

Note how the Spanner clients allow you to buffer mutations:

import "cloud.google.com/spanner"

client.ReadWriteTransaction(ctx, func(ctx context.Context, t *spanner.ReadWriteTransaction) error {
    var likes int64
    // Read likes column...
    m := spanner.Update(
	    "tweets",
	    []string{"id", "likes"},
	    []interface{}{id, likes})
    return t.BufferWrite([]*spanner.Mutation{m})
})

ACID

ACID stands for atomicity, consistency, isolation and durability. ACID provides a minimal promise for application developers about what their database satisfies so they can focus on their problems instead of solving database problems. Without ACID or similar contracts, application developers wouldn’t have a guidance on what’s their responsibility versus what their database provides. Spanner satisfies ACID, and actually provides a stronger contract.

Atomicity means all the operations in a transactions should either fail or succeed. Spanner provides this by failing a transaction if any of the writes in that transaction fails. Spanner doesn’t allow you to read the results of a write in a transaction, instead commits the write when committing the transaction. If a single write fails, it fails the transaction.

Consistency means data should be consistent before and after a transaction. Spanner satisfies this. With strong reads, Spanner allows users to see the same data globally.

Isolation is provided by certain properties and behavior on the read-only and read/write transactions. Spanner’s isolation levels are slightly different from tradition databases. See the Isolation Levels for details.

  • In read/write transactions, all reads return from a particular timestamp. If anything is mutated that was read in the transactions, transaction fails. All writes need to be committed for transactions to commit. And all writes in the transaction are only visible once the transaction commits.
  • In read-only transactions, all reads return from a particular timestamp. These transactions are lock-free and should be used for performance gain if no writes are needed.

Transactions can read from the current state or sometime in the past. An example of a read-only transaction that reads from a minute ago:

client.ReadOnlyTransaction().
	WithTimestampBound(spanner.ExactStaleness(1 * time.Minute)).
	Query(...)

Durability means committed transactions shouldn’t be lost and Spanner satisfies this.

Isolation Levels

The SQL standard defines four isolation levels, from lowest-level of isolation to highest, they are read uncommitted, read committed, repeated read and serializable. Spanner provides an isolation level stricter than serializable by default and users are given an option to read stale data with a timestamp-bound for performance.

Spanner is a first in external consistency. Spanner can run multiple transactions in parallel (in different data centers) and still can check if they are serializable. If you start transaction A and read and write v1, B can read and write v1 if it reads v1 after A is committed. Spanner checks if everything is in the chronological order and both transactions can succeed. Spanner can do this even if the transaction are running in different data centers. This is a truly unique feature that allows Spanner to identify as a globally distributed database.

Timestamp-bound reads can also be considered as an isolation-level. Spanner allows this feature for geo-distributed cases for latency benefits. If you can handle a bit of stale data when reading, you can read stale data.

In the following example, user is reading with a maximum staleness of 30 seconds:

client.ReadOnlyTransaction().
	WithTimestampBound(spanner.MaxStaleness(30 * time.Second)).
	Query(...)

Strong & Stale Reads

Read-only transactions and single calls provides two types of reads:

  • Strong reads reads and queries at a timestamp where all previously committed transactions are visible. Two consecutive strong read-only transactions might return different results.
client.Single().Query(...)

// Or explicitly set the strong-read bound
// in a read-only transaction:
client.ReadOnlyTransaction().       
    WithTimestampBound(spanner.StrongRead()).
    Query(...)
  • Stale reads can be done for latency gains in geo-distributed cases (if you have a multi-region Spanner). You can read from a particular timestamp or a maximum staleness limit.

Following example reads with maximum staleness of 30 seconds:

client.ReadOnlyTransaction().
	WithTimestampBound(spanner.MaxStaleness(30 * time.Second)).
	Query(...)

Garbage Collection

Spanner garbage collects old versions of the data in the background. Google Cloud Spanner has a policy to garbage collect stale data if it is older than an hour. This means, you can go back in time an hour and read. The following query is going to fail because the data is already garbage collected.

client.ReadOnlyTransaction().
	WithTimestampBound(spanner.ExactStaleness(2 * time.Hour)).
	Query(...)

TrueTime

As it’s mentioned above, parts of the reason why external consistency of read/write transactions are possible is because of the clock source Spanner uses. TrueTime is explained in detail in Google’s whitepaper on Spanner, but in a nutshell, it does these two things:

  • TrueTime uses two different sources: GPS and atomic clocks. These clocks have different fail modes, hence using both of them is increasing the reliability.
  • TrueTime returns the time as an interval. The time could be in fact anywhere between the lower bound and the upper bound. Spanner then adds latency by waiting until it is certain the current time is beyond a particular time. This method adds some latency to the system especially when the uncertainty advertised by GPS masters are high but provides correctness even in a globally distributed situation.

Spanner provides a stricter contract than ACID and a stronger isolation level than serializable databases while being a globally distributed database. It relies on TrueTime, versioning and garbage collection, as well as Google’s networking infrastructure to make it happen.

Hello

tl;dr: I’m joining to the Google’s Cloud Spanner team and I’ll use this medium to document my day-to-day work on the team to share insights about Spanner.

Over the last few months, I was trying establish a work style where we can be more instrumental on big-picture blockers and design problems for our customers. There is a strong affinity between data problems and overall architectural tradeoffs. This led me to think I should focus my work on stateful aspects and collaborate more often with storage systems folks at Google because of the potential overlap. Even though I haven’t produced any results, reaching out to other teams led me to a completely different opportunity on the Spanner team.

As an existing user of Spanner, I’m extremely excited for this opportunity. Spanner is a unique piece of our infrastructure, a first in distributing data at global scale while supporting consistent distributed transactions. Our cloud product has adoption from wide range of customers, users are coming to Spanner regardless of their cloud provider. At Google, Spanner is our irreplaceable building block.

Spanner is an established project with some of the most talented people. My role will be uniquely on building things around Spanner to make it more accessible. I’ll initially focus on:

  • Contributing ORM integrations and framework support.
  • Working on perf, benchmarking and debugging tools.
  • Building migration tools.
  • Working on probes, prototyping and evaluation projects.
  • Helping users with design tradeoffs, discovering strengths and limitations.
  • Sharing insights and implementation details.

Spanner is lucky to have existing ecosystem and users who are already building open source integrations and tools. I’m happy that our team will provide support and visibility for these projects as well.

I started this medium to document the most interesting aspects of our day-to-day work, our progress and stuff that hasn’t been captured before. I’m planning to establish strong relations with some of our users for collaboration. Stay tuned for more!

One more thing…

My snarky fashion act became a reality. There has been a number of people who were asking for this t-shirt. Now, I have the perfect opportunity to ask the team to make it our official swag.