Book Review: Database Reliability Engineering – Designing and Operating Resilient Database Systems

DBRE cover

Hello and welcome to yet another book review. Databases have been called the “killer application of IT” and it is true that in, almost any, computing environment today, one or more databases are in play. Having said that, in-depth knowledge of these database systems used to reside with the DBAs of an organization. Today, with the roles being in flux, if you are an SRE chances are you have to deal with databases, quite often without the luxury of a dedicated DBA. Databases themselves have proliferated as well, with NoSQL paradigm entering the market and various combinations of the CAP theorem in effect, depending on the use case. So, it was about time that a dedicated volume appeared in the market that deals with how to apply SRE principles within a database context. Let’s start with the table of contents:

  1. Introducing Database Reliability Engineering
  2. Service-Level Management
  3. Risk Management
  4. Operational Visibility
  5. Infrastructure Engineering
  6. Infrastructure Management
  7. Backup and Recovery
  8. Release Management
  9. Security
  10. Data Storage, Index and Replication
  11. Datastore Field Guide
  12. A Data Architecture Sampler
  13. Making the case for DBRE

Substituting Reliability Engineer, as opposed to Administrator, gives this book a distinct flavor. REs (be it SREs or DBREs) come from the software domain and strive to apply software engineering principles to the operational domain – eliminating toil as they go. In addition, the cornerstone of RE is interfacing with other domains (software engineering, network engineering and yes, DBA come to mind), thus from the get go, the book stresses the need that, while the technical aspects of the book might already be known to a good DBA, there are organizational and cultural aspects to be considered as well (as in, “tear down these silos”).

The book kicks off with an introduction to the concepts that will be discussed, including a Maslow-like hierarchy of DBRE needs (the authors point out that it is totally fine to move between levels at will/need),moves on introducing traditional SRE concepts such as SLOs and how do they apply to the real world (and more importantly, how do they evolve). Risk and how to manage it gets a treatment, as well as a chapter on how to define operational visibility. No treatise on the subject would be complete without a discussion of the underlying infrastructure concepts. Recent developments such as containerization get fairly accurate and fair coverage, as well as more traditional approaches. Backup and recovery gets an extensive chapter, as this is perhaps the most important topic when dealing with databases (for certain companies, large scale dataloss could mean end of business, period). Release management, including CI/CD for databases is discussed, signifying the application of principles carried over from the software domain to the database world. Once all these topics are discussed, there is a chapter on security, including well-known attacks, such as SQL injection (which we should have gotten rid of by now!) and mitigations, including judicious use of cryptography. These chapters in my opinion form the first dimension on the book, which tends to be quite operations-heavy (and rightly so). The book then makes a foray into more traditional territory, discussing topics such as replication topologies, a datastore field guide, architectural patterns for distributed databases and finally, closes off quite nicely with a chapter on DBRE culture.

Now that we have an overview of the structure of the book (and it is a really well structured book), the big question is “does it deliver?”. In my opinion, yes, the authors keep a nice conversational style in what could have been quite some dry-writing. The authors are well known figures in the SRE (or is it DBRE?) world and the splice the text with quite a few anecdotes and external examples. Also the need for proper visibility and traceability is brought front and center (in fact the notion of establishing SLOs is centered around measurable data), I really liked that touch. The human factor is discussed in a few places in the book, which more often than not tends to be overlooked. Even skimming through the book (or speed-reading it) can yield results, given that there are a lot of visual aids. Another nice touch is that in the discussion of security, DREAD and STRIDE are discussed, which is nice to see these mentioned outside of infosec specific literature. The first chapters, as said before, are ops-heavy and they contain a wealth of information even for seasoned reliability engineers (at the very least as a refresher), while later chapters deal more with data, helping the reader to navigate the ever-increasing sprawl of database solutions.

Overall, I will recommend this book to anyone, regardless of skill level, who has to deal with databases in everyday work. This short review might not really do justice to the book, in every chapter (even the introductory one) there are broad discussion topics that one can have really detailed conversations about. Closing, the approach of the authors to apply Reliability Engineering practices in the database world is a valid one – if the advice and methodology contained in the book is followed, a lot of headaches will be preemptively removed and everybody, engineers, owners and customers will be happy. The book lends itself to repeated readings, be it back-to-back or specific chapters, and I cannot recommend it enough.