High Availability Writes

Spanner is a relational database with 99.999% availability which is roughly 5 mins a year. Spanner is a distributed system and can span multiple machines, multiple datacenters (and even geographical regions when configured). It splits the records automatically among its replicas and provides automatic failover. Unlike traditional failover models, Spanner doesn’t failover to a secondary cluster but can elect an available read-write replica as the new leader.

In relational databases, providing both high availability and high consistency in writes is a very hard problem. Spanner’s synchronous replication, the use of dedicated networking and Paxos voting provides high availability without compromising consistency.

High availability of reads vs writes

In traditional relational databases (e.g. MySQL or PostgreSQL), scaling and providing higher availability to reads is easier than writes. Read-only replicas provide a copy of the data read-only transactions can retrieve from. Data is replicated to the read-only replicas from a read-write master either synchronously or asynchronously.

In synchronous models, master synchronously writes to the read replicas at each write. Even though this model ensures that read-only replicas always have the latest data, it makes the writes quite expensive (and causes availability issues for writes) because the master has to write to all available replicas before it returns.

In asynchronous models, read-only replicas get the data from a stream or a replication log. Asynchronous models make writes faster but introduce a lag between the master and the read-only replicas. Users have to tolerate the lag and should be monitoring it to identify replication outages. The asynchronous writes make the system inconsistent because not all the replicas will have the latest version of the until asynchronous synchronization is complete. The synchronous writes make data consistent by ensuring all replicas got the change before a write succeeds.

HA cluster

Horizontally scaling reads by adding more read replicas is only part of the problem. Scaling writes is a harder problem. Having more than one master is introducing additional problems. If a master is having outage, other(s) can keep serving writes without users experiencing downtime. This model requires replication of writes among masters. Similar to read replication, multi-master replication can be implemented asynchronously or synchronously. If implemented synchronously, it often means less availability of writes because a write should replicate in all masters and they should be all available before it can succeed. As a tradeoff, multi-master replication is often implemented with asynchronous replication but it negatively impacts the overall system by introducing:

  • Looser consistency characteristics that violate ACID promises.
  • Increased risk of timeouts and communication latency.
  • Necessity for conflict resolution between two or more masters if conflicting updates happened but not communicated.

Due to the complexity and the failure modes multi-master replication introduces, it’s not a commonly preferred way of providing high availability in practice.

As an alternative, high-availability clusters are a more popular choice. In this model, you’d have an entire cluster that can take over when the primary master goes down. Today, cloud providers implement this model to provide high availability features for their managed traditional relational database products.

HA cluster

Topology

Spanner doesn’t use high availability clusters but approaches to the problem from a different angle. A Spanner cluster* contains multiple read-write, may contain some read-only and some witness replicas.

  • Read-write replicas serve reads and writes.
  • Read-only replicas serve reads.
  • Witnesses don’t serve data but participate in leader election.

Read-only and witness replicas are only used for multi-regional Spanner clusters that can span across multiple geographical regions. Single region clusters only use read-write replicas. Each replica lives in a different zone in the region to avoid single point of failure due to zonal outages.

Split leader in writes

Splits

Spanner’s replication and sharding capabilities come from its splits. Spanner splits data to replicate and distribute them among the replicas. Split happens automatically when Spanner detects high read or high write load among the records. Each split is replicated and has a leader replica.

When a write arrives, Spanner finds the split the row is in. Then, we look for the leader of that split and route the write to the leader. This is true even in multi-region setups where user is geographically closer to another non-leader read-write replica. In the case of an outage of the leader, an available read-write replica is elected as the leader and user’s write is served from there.

Split leader and replicas

In order for a write to succeed, a leader needs to synchronously replicate the change to the other replicas. But isn’t this impacting the availability of the writes negatively? If writes need to wait for all replicas to succeed, a replica can be a single point of failure because writes wouldn’t succeed until all replicas replicate the change.

This is where Spanner does something better. Spanner only requires a majority of the Paxos voters to successfully write. This allows writes to succeed even when a read-write replica goes down. Only the majority of the voters are required not all of the read-write replicas.

Synchronous replication

As mentioned above, synchronous replication is hard and impacts the availability of the writes negatively. On the other hand when replication happens asynchronously, they cause inconsistencies, conflicts and sometimes data loss. For example, when a master becomes unavailable due to a networking issue, it may still have committed changes but might have not delivered them to the secondary master. If the secondary master updates the same records after a failover, data loss can happen or conflict resolution may be required. PostgreSQL provides a variety of replication models with different tradeoffs. The tradeoffs summary below can give you a very high level idea of how many different concerns to worry about when designing replication models. A summary of various PostgreSQL replication models and their tradeoffs:

A summary of various PostgreSQL replication models and their tradeoffs.

Spanner’s replication is synchronous. Leaders have to synchronously communicate with other read/write replicas about the change and confirm it in order for a write to succeed.

Two-phase commit (2PC)

While writes only affecting a single split uses a simpler and faster protocol, if two or more splits are required for a write transaction, two-phase commit (2PC) is executed. 2PC is infamously known as “the anti-availability protocol” because it requires participation from all the replicas and any replica can be a single point of failure. Spanner still serves writes even if some of the replicas are unavailable, because only a majority of voting replicas are required in order to commit a write.

Network

Spanner is a distributed system and is inherently affected by problems that are impacting distributed systems in general. Networking itself is a factor of outages in distributed systems. On the other hand, Google cites only 7.6% of the Spanner failures were networking related. Spanner’s 99.999% availability is not highly affected from networking outages. This is mostly because it runs on Google’s private network. Years of operational maturity, reserved resources, having control over upgrades and hardware makes networking not a significant source of outages. Eric Brewer’s earlier article explains the role of networking in this case more in detail.

Colossus

Spanner’s durability guarantees come from Google’s distributed file system, Colossus. Spanner also mitigates some more risk by depending on Colossus. The use of Colossus allows us to have the file storage decoupled from the database service. Spanner is a “shared nothing” architecture and because any server in a cluster can read from Colossus, replicas can recover quickly from whole-machine failures.

Colossus also provides replication and encryption. If a Colossus instance goes down, Spanner can still work on the data via the available Colossus instances. Colossus encrypts data and this is why Spanner provides encryption at rest by default out of the box.

Colossus replication

Spanner read-write replicas hands off the data to Colossus where data is replicated for 3 times. Given there are three read-write replicas in a Spanner cluster, this means the data is replicated for 9 times.

Automatic Retries

As repeatedly mentioned above, Spanner is a distributed system and is not magic. It experiences more internal aborts and timeouts than traditional databases when writing. A common strategy in distributed systems in order to deal with partial and temporary failures is to retry. Spanner client libraries provide automatic retries for read/write transactions. In the following Go snippet, you see the APIs to create a read-write transaction. The client automatically retries the body if it fails due to aborts or conflicts:

import "cloud.google.com/go/spanner"
_, err := client.ReadWriteTransaction(ctx, func(ctx context.Context, txn *spanner.ReadWriteTransaction) error {
    // User code here.
})

One of the challenges of developing ORM framework support for Google Cloud Spanner was the fact most ORMs didn’t have automatic retries, therefore their APIs didn’t give developers a sense that they shouldn’t maintain any application state in the scope of a transaction. In contrast, Spanner libraries care a lot of retries and make an effort to automatically deliver them without creating extra burden to the user.

Spanner approaches to sharding and replication differently than traditional relational databases. It utilizes Google’s infrastructure and fine-tunes several traditionally hard problems to provide high availability without compromising consistency.


  • (*) Google Cloud Spanner’s terminology for a cluster is an instance. I avoided to use “instance” because it is an overloaded term and might mean “replica” for the majority of the readers of this article.
  • (**) The write is routed to the split leader. Read the Splits section for more.

Transactions

You need a transaction when you want to do a series of reads/writes yet you have to perform them as a single unit. Problems like transferring money from an account to another and reading a value to multiply it by a number and write the result require a series of reads and writes; therefore they require transactions.

Spanner is a transactional database with quite a few interesting capabilities:

  • It can replicate data in a globally distributed way automatically for availability and locality.
  • It can automatically shard data among multiple machines and even data centers.
  • It versions data and you can go back in time to read a value at a particular timestamp.
  • It provides external consistency of read/write transactions. This allows two transactions with read/writes to the same value to run in parallel and succeed if operations are serializable.
  • It provides globally consistent reads at a particular timestamp, so you can read in the past.

Spanner provides some of these capabilities by the way it satisfies the ACID properties and by its clock source, TrueTime.

Transaction Types

Spanner provides two transaction types:

  • Read-only transactions: Allows user to read in a transaction from a particular timestamp. These transactions are lock-free and you should use them if you only need to read.
  • Read/write transactions: Allows users to read and buffer writes in a transactions. Spanner doesn’t allow you to see the result of the writes until you commit the transaction.

Note how the Spanner clients allow you to buffer mutations:

import "cloud.google.com/spanner"

client.ReadWriteTransaction(ctx, func(ctx context.Context, t *spanner.ReadWriteTransaction) error {
    var likes int64
    // Read likes column...
    m := spanner.Update(
	    "tweets",
	    []string{"id", "likes"},
	    []interface{}{id, likes})
    return t.BufferWrite([]*spanner.Mutation{m})
})

ACID

ACID stands for atomicity, consistency, isolation and durability. ACID provides a minimal promise for application developers about what their database satisfies so they can focus on their problems instead of solving database problems. Without ACID or similar contracts, application developers wouldn’t have a guidance on what’s their responsibility versus what their database provides. Spanner satisfies ACID, and actually provides a stronger contract.

Atomicity means all the operations in a transactions should either fail or succeed. Spanner provides this by failing a transaction if any of the writes in that transaction fails. Spanner doesn’t allow you to read the results of a write in a transaction, instead commits the write when committing the transaction. If a single write fails, it fails the transaction.

Consistency means data should be consistent before and after a transaction. Spanner satisfies this. With strong reads, Spanner allows users to see the same data globally.

Isolation is provided by certain properties and behavior on the read-only and read/write transactions. Spanner’s isolation levels are slightly different from tradition databases. See the Isolation Levels for details.

  • In read/write transactions, all reads return from a particular timestamp. If anything is mutated that was read in the transactions, transaction fails. All writes need to be committed for transactions to commit. And all writes in the transaction are only visible once the transaction commits.
  • In read-only transactions, all reads return from a particular timestamp. These transactions are lock-free and should be used for performance gain if no writes are needed.

Transactions can read from the current state or sometime in the past. An example of a read-only transaction that reads from a minute ago:

client.ReadOnlyTransaction().
	WithTimestampBound(spanner.ExactStaleness(1 * time.Minute)).
	Query(...)

Durability means committed transactions shouldn’t be lost and Spanner satisfies this.

Isolation Levels

The SQL standard defines four isolation levels, from lowest-level of isolation to highest, they are read uncommitted, read committed, repeated read and serializable. Spanner provides an isolation level stricter than serializable by default and users are given an option to read stale data with a timestamp-bound for performance.

Spanner is a first in external consistency. Spanner can run multiple transactions in parallel (in different data centers) and still can check if they are serializable. If you start transaction A and read and write v1, B can read and write v1 if it reads v1 after A is committed. Spanner checks if everything is in the chronological order and both transactions can succeed. Spanner can do this even if the transaction are running in different data centers. This is a truly unique feature that allows Spanner to identify as a globally distributed database.

Timestamp-bound reads can also be considered as an isolation-level. Spanner allows this feature for geo-distributed cases for latency benefits. If you can handle a bit of stale data when reading, you can read stale data.

In the following example, user is reading with a maximum staleness of 30 seconds:

client.ReadOnlyTransaction().
	WithTimestampBound(spanner.MaxStaleness(30 * time.Second)).
	Query(...)

Strong & Stale Reads

Read-only transactions and single calls provides two types of reads:

  • Strong reads reads and queries at a timestamp where all previously committed transactions are visible. Two consecutive strong read-only transactions might return different results.
client.Single().Query(...)

// Or explicitly set the strong-read bound
// in a read-only transaction:
client.ReadOnlyTransaction().       
    WithTimestampBound(spanner.StrongRead()).
    Query(...)
  • Stale reads can be done for latency gains in geo-distributed cases (if you have a multi-region Spanner). You can read from a particular timestamp or a maximum staleness limit.

Following example reads with maximum staleness of 30 seconds:

client.ReadOnlyTransaction().
	WithTimestampBound(spanner.MaxStaleness(30 * time.Second)).
	Query(...)

Garbage Collection

Spanner garbage collects old versions of the data in the background. Google Cloud Spanner has a policy to garbage collect stale data if it is older than an hour. This means, you can go back in time an hour and read. The following query is going to fail because the data is already garbage collected.

client.ReadOnlyTransaction().
	WithTimestampBound(spanner.ExactStaleness(2 * time.Hour)).
	Query(...)

TrueTime

As it’s mentioned above, parts of the reason why external consistency of read/write transactions are possible is because of the clock source Spanner uses. TrueTime is explained in detail in Google’s whitepaper on Spanner, but in a nutshell, it does these two things:

  • TrueTime uses two different sources: GPS and atomic clocks. These clocks have different fail modes, hence using both of them is increasing the reliability.
  • TrueTime returns the time as an interval. The time could be in fact anywhere between the lower bound and the upper bound. Spanner then adds latency by waiting until it is certain the current time is beyond a particular time. This method adds some latency to the system especially when the uncertainty advertised by GPS masters are high but provides correctness even in a globally distributed situation.

Spanner provides a stricter contract than ACID and a stronger isolation level than serializable databases while being a globally distributed database. It relies on TrueTime, versioning and garbage collection, as well as Google’s networking infrastructure to make it happen.

Hello

tl;dr: I’m joining to the Google’s Cloud Spanner team and I’ll use this medium to document my day-to-day work on the team to share insights about Spanner.

Over the last few months, I was trying establish a work style where we can be more instrumental on big-picture blockers and design problems for our customers. There is a strong affinity between data problems and overall architectural tradeoffs. This led me to think I should focus my work on stateful aspects and collaborate more often with storage systems folks at Google because of the potential overlap. Even though I haven’t produced any results, reaching out to other teams led me to a completely different opportunity on the Spanner team.

As an existing user of Spanner, I’m extremely excited for this opportunity. Spanner is a unique piece of our infrastructure, a first in distributing data at global scale while supporting consistent distributed transactions. Our cloud product has adoption from wide range of customers, users are coming to Spanner regardless of their cloud provider. At Google, Spanner is our irreplaceable building block.

Spanner is an established project with some of the most talented people. My role will be uniquely on building things around Spanner to make it more accessible. I’ll initially focus on:

  • Contributing ORM integrations and framework support.
  • Working on perf, benchmarking and debugging tools.
  • Building migration tools.
  • Working on probes, prototyping and evaluation projects.
  • Helping users with design tradeoffs, discovering strengths and limitations.
  • Sharing insights and implementation details.

Spanner is lucky to have existing ecosystem and users who are already building open source integrations and tools. I’m happy that our team will provide support and visibility for these projects as well.

I started this medium to document the most interesting aspects of our day-to-day work, our progress and stuff that hasn’t been captured before. I’m planning to establish strong relations with some of our users for collaboration. Stay tuned for more!

One more thing…

My snarky fashion act became a reality. There has been a number of people who were asking for this t-shirt. Now, I have the perfect opportunity to ask the team to make it our official swag.