Different Views on Availability
When discussing "database availability," there are two different issues: data availability and service availability. In other words, whether the data is available is a separate concern from the availability of the service that provides access to that data. Many people equate a database with the disk where data is stored, or assume a database is transient with data available only when the service is running. As new architectures evolve with distribution and redundancy in their core, it's interesting to step back and think about these questions from a different perspective.
In a traditional relational database management system (RDBMS), data availability relies on having access to the disk where the database is made durable. As long as the disk (or disk array, or NAS) is available, then the data is available to the software and can therefore be made available to applications. If the disk fails, the data is no longer available, and the user may have no choice but to recover it from a backup or previous snapshot.
Following that view, it makes sense to maintain a replica of the data. Essentially, every time an update is made to the primary state on disk, the same change is also made to a copy at another location. Now if the disk gets lost, there will be a window of time where the data can’t be accessed. However, eventually the replica copies can become the new primary data and the user can get back to the available state. The catch of course, is whether the replica has all of the data in the correct state. Synchronous replication guarantees a lock-step copy but at the cost of additional latency for each transaction. That may be acceptable in tightly coupled networks, but when replicating to a remote data center, those WAN latencies are probably unacceptable.
"Having all data available in the database is great"
In a distributed deployment, by contrast, there are likely to be many users updating disjointed data at the same time. For instance, in social applications there are natural clusters of activity, and if the users are geographically distributed then the access patterns will also be distributed. Assuming these scenarios, it's natural to split storage from a single location to multiple, disjointed locations. Now, failure of a single disk can make a subset of the overall data unavailable. In some systems that may be unacceptable, but others could tolerate this if they don't need access to that information at a given point in time.
The important thing to remember here is that replication of data isn't the same as data availability, especially in the case where the local operations can run in-memory even if the durable service in the data center fails.
For the data management architectures we're building today, we need to look at the rules a little differently. Increasingly we're focusing on in-memory capability, asynchronous communications and global scale. Durability is not the same as availability. Replicating durable data may give someone a heightened feeling of safety in the case of catastrophic failures, and it's definitely an important element in any system. Data can still be available, even when the user can't get to its durable form at-rest. It is critical to understand the data availability model first and then use that to define the availability of the data service as a whole.
Having all data available in the database is great. If however, that database isn't completely available to the applications that rely on it, then that's not so great. In other words, if the service isn't available, then the data isn't available either.
Again, starting with a single machine host running a database server, availability is a pretty simple thing to think about. The service is either there or it's not. As a service scales out, it may be addressing one or more key capabilities. A replicated service, for instance, may only be available as a hot standby or a source for read operations. A service that scales via shredding the underlying data, or by limiting which set of activities it will accept at any given endpoint, may at any given time be providing partial availability to the data as a whole but appear fully available as a service.
There are various ways people think about service availability. It goes without saying that topics like CAP (consistency availability and partition tolerance) factors heavily into it. In practice, most people want a system that survives failures by providing complete availability, even if it’s at reduced capacity. Complete availability means no loss of functionality and total availability of the underlying data.
The traditional coupling of disk and service in the database community makes it harder to think about availability this way. Making it harder still are all the models for consistency, rules around visibility and requirements of durability.
Architectures and software are changing at a rapid pace, and global cloud capacity is helping to drive it. Familiar notions of what makes data or a service available are nothing like what they were even five years ago. This article should not be taken as a treatise on the subject but as a challenge to go off and think a little about what challenges we're looking at today. How would we like our systems to evolve and what are the trade-offs that we need to face? Whether we're building something brand new or modernizing legacy systems, these are the questions we should be able to answer, or at least have some theories about.