Book Review: Database Reliability Engineering – Designing and Operating Resilient Database Systems

DBRE cover

Hello and welcome to yet another book review. Databases have been called the “killer application of IT” and it is true that in, almost any, computing environment today, one or more databases are in play. Having said that, in-depth knowledge of these database systems used to reside with the DBAs of an organization. Today, with the roles being in flux, if you are an SRE chances are you have to deal with databases, quite often without the luxury of a dedicated DBA. Databases themselves have proliferated as well, with NoSQL paradigm entering the market and various combinations of the CAP theorem in effect, depending on the use case. So, it was about time that a dedicated volume appeared in the market that deals with how to apply SRE principles within a database context. Let’s start with the table of contents:

  1. Introducing Database Reliability Engineering
  2. Service-Level Management
  3. Risk Management
  4. Operational Visibility
  5. Infrastructure Engineering
  6. Infrastructure Management
  7. Backup and Recovery
  8. Release Management
  9. Security
  10. Data Storage, Index and Replication
  11. Datastore Field Guide
  12. A Data Architecture Sampler
  13. Making the case for DBRE

Substituting Reliability Engineer, as opposed to Administrator, gives this book a distinct flavor. REs (be it SREs or DBREs) come from the software domain and strive to apply software engineering principles to the operational domain – eliminating toil as they go. In addition, the cornerstone of RE is interfacing with other domains (software engineering, network engineering and yes, DBA come to mind), thus from the get go, the book stresses the need that, while the technical aspects of the book might already be known to a good DBA, there are organizational and cultural aspects to be considered as well (as in, “tear down these silos”).

The book kicks off with an introduction to the concepts that will be discussed, including a Maslow-like hierarchy of DBRE needs (the authors point out that it is totally fine to move between levels at will/need),moves on introducing traditional SRE concepts such as SLOs and how do they apply to the real world (and more importantly, how do they evolve). Risk and how to manage it gets a treatment, as well as a chapter on how to define operational visibility. No treatise on the subject would be complete without a discussion of the underlying infrastructure concepts. Recent developments such as containerization get fairly accurate and fair coverage, as well as more traditional approaches. Backup and recovery gets an extensive chapter, as this is perhaps the most important topic when dealing with databases (for certain companies, large scale dataloss could mean end of business, period). Release management, including CI/CD for databases is discussed, signifying the application of principles carried over from the software domain to the database world. Once all these topics are discussed, there is a chapter on security, including well-known attacks, such as SQL injection (which we should have gotten rid of by now!) and mitigations, including judicious use of cryptography. These chapters in my opinion form the first dimension on the book, which tends to be quite operations-heavy (and rightly so). The book then makes a foray into more traditional territory, discussing topics such as replication topologies, a datastore field guide, architectural patterns for distributed databases and finally, closes off quite nicely with a chapter on DBRE culture.

Now that we have an overview of the structure of the book (and it is a really well structured book), the big question is “does it deliver?”. In my opinion, yes, the authors keep a nice conversational style in what could have been quite some dry-writing. The authors are well known figures in the SRE (or is it DBRE?) world and the splice the text with quite a few anecdotes and external examples. Also the need for proper visibility and traceability is brought front and center (in fact the notion of establishing SLOs is centered around measurable data), I really liked that touch. The human factor is discussed in a few places in the book, which more often than not tends to be overlooked. Even skimming through the book (or speed-reading it) can yield results, given that there are a lot of visual aids. Another nice touch is that in the discussion of security, DREAD and STRIDE are discussed, which is nice to see these mentioned outside of infosec specific literature. The first chapters, as said before, are ops-heavy and they contain a wealth of information even for seasoned reliability engineers (at the very least as a refresher), while later chapters deal more with data, helping the reader to navigate the ever-increasing sprawl of database solutions.

Overall, I will recommend this book to anyone, regardless of skill level, who has to deal with databases in everyday work. This short review might not really do justice to the book, in every chapter (even the introductory one) there are broad discussion topics that one can have really detailed conversations about. Closing, the approach of the authors to apply Reliability Engineering practices in the database world is a valid one – if the advice and methodology contained in the book is followed, a lot of headaches will be preemptively removed and everybody, engineers, owners and customers will be happy. The book lends itself to repeated readings, be it back-to-back or specific chapters, and I cannot recommend it enough.

Charity Majors: Ο Χαρτης Δικαιωματων (και ευθυνων) του Μηχανικου

The following post was originally written by Charity Majors and is being translated into Greek with her permission. For the original post in English, click here. Additionally, I am NOT a professional translator so, while I did my best, feel free to drop me a line or a comment if something is mistranslated. With these out of the way, let’s go!

Η ισχυς εχει τον δικο της τροπο να ρεει προς τους διευθυντες προσωπικου με τον καιρο, ασχετως ποσες φορες θα επαναλαβεις το “η διευθυνση προσωπικου δεν ειναι προαγωγη αλλα αλλαγη καριερας”.

Ειναι φυσιολογικο, οπως η ροη του νερου προς τα κατω. Οι διευθυντες προσωπικου εχουν προσβαση στις αξιολογησεις και σε αλλες προσωπικες πληροφοριες, απαιτουμενες για να κανουν τη δουλεια τους, και τεινουν να ειναι πιο εμπειροι στην επικοινωνια. Οι διευθυντες προσωπικου διευκολουνουν για πολλες ληψεις αποφασεων και για τη δρομολογηση ανθρωπων, δεδομενων και πραγματων και ειναι πολυ ευκολο να προσπεσουν στο να παιρνουν ολες τις αποφασεις αντι να ενδυναμωνουν τους ανθρωπους να τις παιρνουν αυτοι. Μερικες φορες απλα θες να μοιρασεις εργασιες και να διαταξεις ο καθενας να κανει οπως ειπωθηκε (ε; μονο εγω;;)

Μα αν αφησεις ολη την ισχυ να γλιστρυσει προς του διευθυντες μηχανικων, αρκετα συντομα δεν ειναι και τοσο ωραιο να εισαι μηχανικος. Τωρα εχεις ανθρωπους να γινονται διευθυντες για ολους τους λαθος λογους, η ολοι λενε πως θελουν να γινουν διευθυντες, η μηχανικους απλα να χανουν επαφη και απλα να παραδιδουν την εργασια τους (η να παραιτουνται). Ολοι θελουμε αυτονομια και αντικτυπο, ολοι λαχταρουμε μια θεση στο τραπεζι. Χρειαζεται να εργαστεις σκληροτερα για να κρατησεις αυτες τις θεσεις για μη-διευθυντες.

Ετσι, στο πνευμα των δικαιωματων και ευθυνων του Συνταγματος, εδω ειμαι μερικες απο τις δεσμευσεις που κανουμε προς τους μηχανικους μας στην Honeycomb, και μερικες απο τις προσδοκιες που εχουμε για διευθυντικους και μηχανικους ρολους. Μερικες αντικατοπτριζονται, και αλλες ειναι πολυ διαφορετικες.

(Παρεπιπτοντως, το βρισκω πολυ βοηθητικο να απεικονιζω το οργανογραμμα αναποδα – τοποθετωντας τους διευθυντες κατω απο τις ομαδες, σαν δομη υποστηριξης αντι να ειναι αγκιστρωμενοι απο πανω).


  • Πρεπει να εισαι ελευθερος να βαλεις το κεφαλι κατω και να συγκεντρωθεις, και να εμπιστευεσαι οτι ο διευθυντης σου θα σου υπενθυμισει ευγενικα ποτε χρειαζεσαι (η θα ηθελε να σε συμπεριλαβει).
  • Οι τεχνικες αποφασεις πρεπει να προερχονται απο τους μηχανικους και οχι απο τους διευθυντες
  • Σου αξιζει να ξερεις ποσο καλα αποδιδεις, και να ακους νωρις και συχνα αν δεν ικανοποιεις τις προσδοκιες
  • Το On-Call δεν πρεπει να επιδρα σημαντικα στη ζωη σου, υπνο, η στην υγεια (περα απο το οτι κουβαλας τις συσκευες μαζι σου). Αν επηρεαζει, θα το φτιαξουμε.
  • Οι αξιολογησεις κωδικα πρεπει να εκτελουνται σε λιγοτερο απο 24 ωρες, υπο κανονικες συνθηκες
  • Πρεπει να εχεις ενα μονοπατι καριερας που σε προκαλει και συνεισφερει στους στοχους της προσωπικης σου ζωης, με την στηριξη και καθοδηγηση που χρειαζεσαι για να φτασεις εκει
  • Πρεπει να διαλεγεις τη δικη σου δουλεια, με τη συμβουλη του διευθυντη σου και βασισμενο στους επιχειρηματικους μας στοχους. Δεν ειναι δημοκρατια, αλλα θα εχεις μια φωνη στη διαδικασια σχεδιασμου μας.
  • Πρεπει να μπορεις να κανεις τη δουλεια σου εντος και εκτος γραφειου. Οταν δουλευεις απομακρυσμενα, η ομαδα σου θα επικοινωνει και θα σε στηριζει.


  • Κανε προοδο στα εργα σου καθε εβδομαδα. Να εισαι διαφανης.
  • Κανε προοδο στην καριερα σου καθε τριμηνο. Σπρωξε τα ορια σου.
  • Οικοδομησε μια σχεση εμπιστοσυνης και αμοιβαιας τρωτοτητας με τον διευθυντη σου και την ομαδα μηχανικων και επενδυσε σε αυτη τη σχεση.
  • Γνωριζε που εισαι: ποσο καλα αποδιδεις, ποσο γρηγορα εξελισσεσαι;
  • Ανεπτυξε την τεχνικη σου κριση και τις ηγετικες ικανοτητες. Ανελαβε την ιδιοκτησια και να εισαι αξιολογησιμος για τα μηχανικα αποτελεσματα. Ζητα βοηθεια οποτε χρειαζεσαι και δωσε βοηθεια οποτε σου ζητηθει
  • Δωσε αξιολογηση νωρις και συχνα, λαβε αξιολογηση με χαρη. Εξασκησου στο να λες “οχι” και να ακους “οχι”. Ασε τους ανθρωπους να ανακαλουν και να ξαναπροσπαθουν αν κατι δεν ειπωθει σωστα.
  • Ανελαβε την ιδιοκτησια του χρονου σου και ενεργα διευθετησε το ημερολογιο σου. Ξοδεψε τις πιστωσεις προσοχης προσεκτικα


  • Στρατολογησε, προσελαβε και εκπαιδευσε την ομαδα σου. Καλλιεργησε μια αισθηση ομονοιας και ομαδικοτητας, καθως και πραγματικη συναισθηματικη ασφαλεια
  • Νοιασου για καθε μηχανικο στην ομαδα σου. Υποστηριξε τους στην πορεια της καριερας τους, προσωπικους στοχους, ισορροπια εργασιας/προσωπικης ζωης καθως και δυναμικες εντος και διαμεσου της ομαδας.
  • Αξιολογησε συχνα και νωρις. Λαβε αξιολογηση με χαρη. Παντα πες την σκληρη αληθεια αλλα με αγαπη.
  • Κινησε μας αδυσωπητα μπροστα, προσεχοντας για overengineering και δουλεια που δεν συνεισφερει στους στοχους μας. Εξασφαλισε υπερκαλυψη κρισιμων περιοχων.
  • Ανελαβε την ιδιοκτησια της τριμηνης διαδικασιας σχεδιασμου της ομαδας σου και γινε υπολογος για τους στοχους που θετεις. Διεθεσε πορους με το να επικοινωνεις τις προτεραιοτητες και να στρατολογησεις αρχηγους μηχανικων.
  • Προσθεσε επικεντρο ή αισθηση επειγοντοσυνης οπου χρειαζεται.
  • Ανελαβε την ιδιοκτησια του χρονου και της προσοχης σου. Να εισαι διαθεσιμος. Ενεργα διευθετησε το ημερολογιο σου. Προσπαθησε να μην κανεις τα συναισθηματα σου προβλημα των αλλων (αλλα γειρε προς τον διευθυντη σου και την ομυγηρη σου για υποστηριξη).
  • Δωσε προτεραιοτητα στην δικη σου προσωπικη αναπτυξη και φροντιδα. Υποδειγματοποιησε τις αξιες και τα χαρακτηριστικα που θελουμε τους μηχανικους μας να ακολουθησουν.
  • Μεινε τρωτος.



Θα ηθελα να ακουσω απο οποιονδηποτε αλλο εχει μια λιστα οπως αυτη.

Article Review: Containers will not fix your broken culture (and other hard truths)

First things first, if you do not know what is ACM Queue (or even worse, do not know what ACM is), click on the links provided. ACM relatively recently has reformed and now presents articles by industry experts, especially in the Queue magazine (you get an article from Queue with every Communications of the ACM magazine but there is more, much more). (disclaimer – while I am a paying ACM member, I make no profit or have no further affiliation with the organization (i.e. I am not an official Ambassador).
With that out of the way, let’s focus in the article in question. The author is Bridget Kromhout, currently working for Microsoft. The main idea of the article is that solution to difficult, seemingly technical problems, can be best resolved by examining the interactions with others. The main ideas discussed therein are the following

  • Tech is not a panacea
  • Good team interactions: Build, because you can’t buy
  • Tech, like Soylent Green, is made of people
  • Good fences make good neighbors
  • Avoiding sadness-as-a-service

The article is extremely well written. One thing I liked the most is that it includes links to definition you might or might have not heard. The key take away idea of the article is that we tend to think technology and enforce technology rules in an increasingly complex distributed system world, whereas the key is communication between individuals and teams, peers or otherwise. It also coins a phrase that unfortunately will ring true for a lot of people in the audience of this blog “on-call PTSD” and even manages to kill one of my favorite interview questions, and these are only the first two pages. The article also states “we succeed when share responsibility and have agency” – Amen to that, personally I have seen more than a few dysfunctional environments where responsibilities were shrugged off routinely. So to sum it up (and keep this review proportional to the length of the article), Bridget states the value of communication, brings in a ton of references to support her case (making the article simultaneously well research but not falling into the trap of being esoteric) and, at the same time, emphasizes the need of technology. Highly recommended reading!

Book Review: The Practice Of Cloud System Administration Volume 2 – Designing And Operating Large Distributed Systems

Hello everyone with another book review. This time, I will be reviewing a book that I consider a classic. As always, let’s start with the list of contents:
Part I Design: Building it

  • Designing in a distributed world
  • Designing for Operations
  • Selecting a Service Platform
  • Application Architectures
  • Design Patterns for Scaling
  • Design Patterns for Resiliency

Part II Operations: Running it

  • Operations in a Distributed World
  • DevOps Culture
  • Service Delivery: The Build Phase
  • Service Delivery: The Deployment Phase
  • Upgrading Live Services
  • Automation
  • Design Documents
  • Oncall
  • Disaster Preparedness
  • Monitoring Fundamentals
  • Monitoring Architecture and Practice
  • Capacity Planning
  • Creating KPIs
  • Operational Excellence

Part III Appendices

  • Assessments
  • The Origins and Future of Distributed Computing and Clouds
  • Scaling Terminology and Concepts
  • Templates and Examples
  • Recommended Reading

overall a bit over 500 beautifully printed pages (as you would come to expect from Addison-Wesley).
As you can see from the ToC, the breadth of information contained in this book is tremendous, every chapter can easily expand into a book on its own (and indeed, there are volumes that expand on a lot of the topics), however this book achieves to give the astute reader a ton of information, heck it is almost like the information is condensed – just add water. The authors do not fell into the pit of sticking with a particular technology, they maintain a level of abstraction, that in my opinion is about right, not too abstract (that would limit the potential of the book to be applied in real world situations) and, yet, not tied to a particular technology (i.e. this book came before container orchestration frameworks became as popular as they are today but you will not notice) that would instantly severely date the book. The format of the book is similar for all chapters, first an attention-grabbing introduction, then a nice discussion of the topic at hand and finally exercises, so the reader can follow up with what has been discussed – most of them are open ended. After all, large scale distributed systems have a common set of characteristics, no matter what the implementation details are or purpose.
The potential audience of this book are both SREs and their managers. In particular, Part II of the book contains a ton of information relevant to both sides of the equation. If you manage SREs, you’d better be at least acquainted with the material and this book is more than a fine introduction. If you need a book on how to use AWS/Azure/GCP or their specifics, this volume will NOT meet your expectations, as discussed this book is more like a framework.
In case, this is not obvious by now, I consider this book a must-read for anyone dealing with modern distributed systems, be it SRE, SWE or Engineering Manager. I cannot praise this book enough, it is extremely well written, in certain cases it goes against the trends and how can you go wrong with a book that considers a zombie outbreak a valid reason for a datacenter outbreak?
Further resources:
Companion Website
Thomas Limoncelli’s Twitter
PS. A book that everybody is recommending (and asking me about it, in a variety of contexts) is Google’s SRE book. If you have not read this book by now, then you can start by going there to enjoy the book in its entirety. While the Google SRE book is an extremely useful resource, and without wanting to create a false dichotomy, it kind of overshadows this volume, which, in my humble opinion is a better choice in certain regards. Specifically, while both books have an strong Google influence (one is coming from Google, the author of the other was a Google SRE), I find that the “Practice of …” is a more focused volume, something perhaps to be expected given that it is written by “only” three authors. So, do yourself a favour, read both books, there is a wealth of information contained therein.

Book Review: Systems Performance: Enterprise and the cloud

Welcome back for another book review. This time, I am going to review a book that I have bought when it came out, in late 2013. I have always wanted to do a review of this one but it seems I had two options:

  1.  Write a short review that probably does not do the book justice.
  2. Postpone the review for a more suitable time, when $IRL and $DAYJOB allow …

I opted for the second option, as I consider this book to be indispensable (yes, this is going to be a positive review). So, here is the table of contents:

  1. Introduction
  2. Methodology
  3. Operating Systems
  4. Observability Tools
  5. Applications
  6. CPUs
  7. Memory
  8. File Systems
  9. Disks
  10. Network
  11. Cloud Computing
  12. Benchmarking
  13. Case Study
  14. Appendices (which you SHOULD read)

Wow, a lot of contect, huh? (something to be expected, given that the book is more than 700+ pages). Do not let the size daunts you however. Chapters are self-contained, as the author understands that the book might be read under pressure, and contain useful exercises at the end.
What really makes this book stands out, is not the top-notch technical writing or abundance of useful one-liners, is the fact that the author moves forward and suggests a methodology for troubleshooting and performance analysis, as opposed to the ad-hoc methods of the past (or best case scenario a checklist and $DEITY forbid the use of “blame someone else methodology”). In particular the author suggests the USE methodology, USE standing for Utilization – Saturation – Errors, to methodically and accurately analyze and diagnose problems. This methodology (which can be adapted/expanded at will, last time I checked the book was not written in stone), is worth the price of the book alone.
The author correctly maintains that you must have an X-ray (so to speak) of the system at all times. By utilizing tools such as DTrace (available for Solaris and BSD) or the Linux equivalent SystemTap, much insight can be gained from the internals of a system.
Chapters 5-10 are self-explanatory: the author presents what the chapter is about, common errors and common one-liners used to diagnose possible problems. As said before, chapters aim to be self contained and can be read while actually troubleshooting a live system so no lengthy explanations there. At the end of the chapter, the bibliography section provides useful pointers towards resources for further study, something that is greatly appreciated. Finally, the exercises can be easily transformed to interview questions, which is another bonus.
Cloud computing and the special considerations that is presenting is getting its own chapter and the author tries to keep it platform agnostic (even if employed by a “Cloud Computing” company), which is a nice touch. This is followed by a chapter on useful advice on how to actually benchmark systems and the book ends with a, sadly too short, case study.
The appendices that follow should be read, as they contain a lot of useful one-liners (as if the ones in the book were not enough), concrete examples of the USE method, a guide of porting dtrace to systemtap and a who-is-who in the world of systems performance.
So how to sum up the book? “Incredible value” is one thought that comes to mind, “timeless classic” is another. If you are a systems {operator|engineer|administrator|architect}, this book is a must-have and should be kept within reach at all times. Even if your $DAYJOB does not have systems on the title, the book is going to be useful, if you have to interact with Unix-like systems on a frequent basis.
PS. Some reviews of this book complain about the binding of the book. In three physical copies that I have seen before my eyes, binding was of the highest quality so I do not know if this complain is still valid.

Conference review: Distributed Matters Berlin 2015

“Kept you waiting, huh?” – to start the post with a pop culture reference.
Yesterday, I was privileged enough to attend Distributed Matters Berlin 2015. The focus of the conference is, you guessed it, distributed systems, often within a NoSQL context. It was hosted at the awesome KulturBrauerei, a refurbished brewery. The format of the conference was 45 minute presentations, including Q&A, thankfully followed by a 15 minute break between talks, in two tracks. The overall level of the presentations was above the average and given that you could only attend one at a time, it made for a hard choice.
Owing to the greatness of Berlin taxi drivers (you know what I am talking about if you used a taxi in Berlin recently), I managed to attend only half of the keynote by @aphyr, so I am not going to comment on this one. My main takeaway is “always, always read the documentation carefully”.
The next presentation I attended was NoSQL meets Microservices, by Michael Hackstein. This one was labelled as beginner. It presented the main paradigms of the NoSQL landscape (KV/Graph/Document), certain topologies and then a presentation of the new-ish ArangoDB, a NoSQL based on V8 Javascript that claims to support all three paradigms at once, eliminating the need for multiple network hops. Overall, it was well presented, if a tad on the product side, and it served nicely to kickoff my conference experience.
After the coffee break, where I was lucky enough to meet some old colleagues from $DAYJOB-1, I attended A tale of queues, from ActiveMQ over Hazelcast to Disque. @xeraa presented his journey with various queueing solutions. He kicked off by stating that the hard problem in distributed systems is exactly once delivery and guaranteed delivery. He then presented the landscape of existing message queues, giving the rationale behind deciding what to use and, more importantly, what not to use. The talk was quite technical, giving me a lot of pointers for future research, overall a solid talk, well done!
It was followed by @pcalcado and No Free Lunch, Indeed: Three Years of Microservices at Soundcloud. Phil has amazing presentation skills and described the journey of Soundcloud from a monolithic Ruby on Rails app, towards a microservices oriented architecture. What I liked most about this presentation was not just the great technical content but also the honestly. Evolving your architecture is no trivial task and the road to it is full of potential pitfalls. Phil was kind enough to share some of his hard gained experience with us, greatly appreciated.
The lunch break was BAD, ’nuff said. Too long a queue and the food, by the time I got there, the good stuff was gone.
After the lunch, I attended Scalable and Cost Efficient Server Architecture by Matti Palosuo. One of the more solid talks, this no-frills presentation did what said on the tin: presented the service infrastructure behind EA’s Sim City Build It mobile title. Dealing with mobile, casual games  presents a unique challenge service-wise and Matti covered all angles in his presentation, diving deep into specifics of their implementation.
The next presentation was Containers! Containers! Containers! And now? by Michael Hausenblas. I am not going to comment a lot on this one, since it had no slides and it was more like a tech demo. Mesos is an AMAZING product and I would have preferred some technical discussion, as opposed to a hands-on demo, but hey! this is just me.
Microservices with Netflix OSS and Spring Cloud by Arnaud Cogoluegnes was the next presentation that I attended. It focused on FOSS software by Netflix and how it can be utilized by the form of Java decorators within an application context. Useful and well presented, the only thing I personally did not like was certain slides full of code but this does not take away from the value of the presentation. Bonus point is that, for a Java engineer, this presentation was immediately actionable, with some nice coding takeaways.
Before proceeding with the next presentation, the astute reader of this blog should have noticed by now a pattern forming: microservices. The topic of the next talk was no exception Microservices – stress-free and without increased heart-attack risk by Uwe Friedrichsen. I really loved this talk. Uwe has a strong opinion regarding microservices (and the experience to back it up). In a nutshell, while microservices can be viable, one should keep a clear head and not fall into the trap of hype-driven architecture. This was my favorite talk of the conference and without further ado, here are the slides. I cannot speak more highly about this presentation so please, have a look at the slides. It was extremely nice to deconstruct the microservices hype and present a realistic case.
It was time for the last talk. The choice was between Antirez’s disque implementation talk and Just Queue it! by Marcos Placona. I decided to give the underdog a chance, given that almost everyone went to Antirez’s presentation (which I am sure it was excellent) and went to Marcos’ presentation instead. I was not disappointed, Marcos described his experience with using MQ while migrating a project and gave another overview of the MQ landscape.
After that, I had some food and some orange juice and decided to call it a day. Overall, it was quite a nice conference, good talks, not a lot of marketing and I will definitely visit the next one, if I am able. Met some interesting people as well and grabbed a lot of pointers for future research. Kudos to the organizers.
See you in DevOps Days Berlin 2015.

Book Review: DevOps Troubleshooting

Hello everyone and welcome back for another book review at woktime. Today’s edition is a short review of a short book called “DevOps Troubleshooting: Linux Server Best Practices”. Without further ado, below is the Table Of Contents

  1. Troubleshooting best practices
  2. Why is the server so slow? Running out of CPU, RAM and Disk I/O
  3. Why won’t the system boot? Solving boot problems
  4. Why can’t write to the disk? Solving full or corrupt disk issues
  5. Is the server down? Tracking down the source of network problems
  6. Why won’t the hostnames resolve? Solving DNS server issues
  7. Why didn’t my email go through? Tracing email problems
  8. Is the website down? Tracking down web server problems
  9. Why is the database slow? Tracking down database problems
  10. It’s the hardware’s fault? Diagnosing common hardware problems

So let’s start at the title. “DevOps” can be an overloaded term – it means different things to different people and unfortunately an “according-to-Hoyle” definition does not exists. I belong in the train of thought that DevOps is more of a cultural movement within an organization than say, a specific job title, so the title of the book “DevOps troubleshooting” is meaningless (I would have strongly preferred the term “Linux Systems Troubleshooting”, as it would have been more accurate for reasons that I am going to explain below).
The author is clearly experienced within the realm of Linux administration and he attempts to cover a broad range of topics. The book is approximately 205 pages long, which means that it will never get too deep within a subject, opting instead to cover as many topics as possible. The writing style of the author is quite readable and he goes out of his way to explain things in relative detail and on the really plus side of the book, there are no glaring errors – proofreaders and the author really did went the extra mile to ensure that content was accurate in the vast number of examples this book is providing.
However, my gripe with the book is that the material covered is really basic. Granted, the intended audience is not a veteran system administrator or engineer – this book by its own admission is aimed towards developers or QA personnel that, owing to some definition of DevOps, are thrown into operational duties. The author makes an effort NOT to use random based troubleshooting, however a complete methodology is never introduced.
Overall, this is a well-written book that provides value to a non-operations member of a team doing operations or for a novice system administrator. Its small size makes it portable enough to be carried around as a level-1 reference, however for system level debugging there are better options out there (keep watching this space for the definite follow up on this sentence).

Book Review: PostgreSQL Replication

So for my series of System Engineering books, I will proceed with a short review of PostgreSQL Replication by Packt. The reason this book came to be a part of my collection is that while there is a lot of information regarding PostgreSQL replication out there, a lot of it is out of date, given the overhaul of the replication system in PostgreSQL 9.X. Without further ado, here is the list of contents of the book.

  • Understanding Replication Concepts
  • Understanding the PostgreSQL Transaction Log
  • Understanding Point-In-Time Recovery
  • Setting up asynchronous replication
  • Setting up synchronous replication
  • Monitoring your setup
  • Understanding Linux High-Availability
  • Working with pgbouncer
  • Working with PgPool
  • Configuring Slony
  • Using Skytools
  • Working with Postgres-XC
  • Scaling with PL/Proxy
    The book gets straight into business with an introduction of replication concepts, and why this is a hard problem that cannot be a one-size fits all solution. Topics such as master-master replication and sharding are addressed as well. After this short introduction, specifics of PostgreSQL are examined, with a heavy focus on XLOG and related internals. The book goes into a nice balanced amount of detail, detailed enough to surpass the trivial level but not overwhelming (and thank $DEITY, we are spared source code excerpts, although a few references would be nice for those that are willing to dig further into implementation details), providing a healthy amount of background information. With that out of the way, a whole chapter is devoted to the topic of Point-In-Time-Recover (PITR for now on). PITR is an invaluable weapon in the arsenal of any DBA and gets a fair and actionable treatise, actionable meaning that you will walk away from this chapter with techniques you can start implementing right away.With the theory and basic vocabulary defined, the book then dives into replication. Concepts are explained, as well as drawbacks of each technique, alongside with specific technical instructions on how to get there, including a Q&A on common issues that you may encounter in the field.
    PostgreSQL has a complex ecosystem and once the actual built-in replication mechanisms are explained, common tools are presented (with the glaring omission of Bucardo unfortunately). This is where the book falters a bit, given the excellent quality of the replication related chapters. The presentation of the tools is not even nor deep in all cases – my gripe is that the Linux-HA chapter stops when it starts to get interesting. Having pointed this out, still these chapters can be better written and more concise than information scattered around in the web. I have paid particular attention to the PgPool chapter, which does not cover PgPool-HA (hint: there is more than one way to do it). These chapters assume no previous exposure to the ecosystem so they serve as a gentle (and again, actionable) introduction to the specific tools but I would have preferred them to be 10-15 pages longer each, providing some additional information, especially on the topic of high-availability. Even as-is, these chapters will save you a lot of time searching and compiling information, filling in a few blanks along the way, so, make no mistake, they are still useful. Bonus points for covering PostgreSQL-XC, which is somewhat of an underdog.
    A small detail is that examples in the book tend to focus on Debian-based systems so if you are administering a Red Hat derivative you should adapt the examples slightly, taking into consideration the differences in the packaging of PostgreSQL. Overall, the book goes for a broad as opposed to deep approach and can server as a more than solid introductory volume. Inevitably, there is an overlap with the official PostgreSQL manuals, which is to be expected given that they are great. The quality of the book is on par with other Packt Publishing titles, making this an easy to read book that will save you a lot of time for certain use cases.