Notes about the Building Microservices
These are notes for my future self about Sam Newman's book, Building Microservices 2nd edition.
Microservices are an approach to distributed applications that use finely-grained services and change, deploy and release them independently. Microservices have become the default go-to architecture when designing system, which Newman finds hard to justify #surprised and want to share his view on why.
Chapter 1 - What are Microservices?
Microservices are independently releasable services modeled around a business domain. They are kind of SOA, but the one which is opinionated about how service boundaries should be drawn and where independent deployability is the key.
Microservices should implement information hiding and expose their API in some way, e.g. via REST/JSON API or by emitting events. They should have their own database schema. To highlight that the services are as independent as possible, Newman draws them as hexagons, a homage to Alistair Cockburn's Hexagonal architecture. Don't worry about the size.
The obvious alternative is the opposite - the Monolith. The basic monolith is a single-process application where all code is packed together. In modular monolith, the code is divided into separate modules within the monolith. It can even have modular database, where some of the modules could have their own schema. A common anti-pattern is a distributed monolith - the one which has all the disadvantages of monoliths, but none of the advantages of microservices - all have to be deployed together.
The main problem of monoliths is deployment contention #suprised. The releases have to be coordinated between teams. However, there are some obvious advantages of monoliths, like simplicity, so they are a valid architectural choice.
Microservices require a lot of new technologies. E.g. log aggregation tools can be considered a prerequisite for implementing microservices. Newman is fan of Humio, Jaeger, Lightstep and Honeycomb, which analyze your requests based on correlation ID and displays where you might have some performance issues. We should rush to Kubernetes before the management of deployments has become an issue for us. Running our own Kubernetes cluster is a significant amount of work. If we need to stream data, Kafka has become a technology of choice. The interesting features are in the serverless world, e.g. message brokers, databases and storage solutions.
Probably the biggest reason to adopt microservices is to allow more developers work on the same system without getting into each others' way.
Chapter 2 - How to model Microservices
Microservices are just another form of modular decomposition. The principle of information hiding is about hiding as many details as possible behind a module. We gain improved development time, comprehensibility and flexibility. All the rules about modular decomposition apply here. We want high cohesion - i.e. what changes together stays together. And low coupling - change in one service should not require a change in another. By having high cohesion and low coupling, the resulting structure is stable. The stability is obviously of great importance on the boundaries of microservices.
There are four kinds of coupling. Domain coupling is a situation where one service needs to talk to another one based on domain business logic means. E.g. Order service needs to talk to Warehouse. This type of interaction is obviously unavoidable, but we still want to keep it at minimum. If a microservice needs to talk to lots of other service, we might have centralized too much logic in it. Moreover, if the call from the service is synchronous, there is a temporal coupling in play as well - the downstream service(s) have to be online for call to be successful. Pass-through coupling is a situation where data is passed to a service just because some downstream service needs it. One way to deal with it is to have the upstream service talk directly with the downstream one. Another way is to make the required data part of middle service's contract. One final way is to completely hide the data schema from the middle service, which would consider it as a generic object only. Common coupling occurs when two microservice make use of a common data. This can be solved by having one service, which might be a new third one, to own the state of the changed data. Content coupling is when an upstream service changes the internal state of the downstream service. This is something to just avoid.
There are 3 key concepts of DDD useful for modelling microservices. Ubiquitous language, i.e. we should have the common language the users use in the code. The Aggregate, i.e. something that has a state, identity, a life cycle. Bounded context, i.e. container for one or more aggregates. When starting out with microservice, we should build them around bounded contexts. When we are more comfortable with them, we can split them apart and model them around aggregates.
A useful modelling practice might be Event Storming. We get everyone in a room together. Participants identify domain events, e.g. "Order places" or "Payment Received". Next, participants identify the commands that cause these events, i.e. the decisions made by human to do something. With events and commands, aggregates come next. Then they are grouped into bounded contexts. Bounded contexts most commonly follow the organizational structure of the company.
There are some alternatives to DDD when splitting the services. Volatility-based decomposition splits the services based on frequency of changes. Data-based decomposition is about data ownership. The need to use specific technology can be yet another driver. According to Conway's law any decomposition which has a service owned by multiple teams will not yield us the desired outcomes, so one of the alternative drivers can be organizational. That said, the organizational silos have to be broken and spread to new teams. Layering has its place, inside microservices, i.e. inside individual teams. Newman recommends starting with DDD and organizational service boundaries #notable.
Chapter 3 - Splitting the Monolith
Microservices are not the goal. We don't "win" by having them. The migration has to be incremental #important. The big rewrites result in big bang. We shouldn't focus on getting rid of monolith, but focus instead on the benefits we are expecting to gain. Premature decomposition can also lead us to a wrong boundaries. We can try tools like CodeScene for volatility-based decomposition.
We can decompose code first, leaving data where it was. Or we can decompose data first to a separate schema. We can use some patterns for splitting the monolith. E.g. Strangler Fig Pattern tells us to have an upstream interception layer, which delegates the new calls to the new microservice, while delegating old calls to the monolith for functionalities not yet migrated to microservices. We can also use Feature Toggles to have some backup plan. We will have to keep database integrity on the service boundaries level, join operations will be replaced with service calls. We will loose the ACID safety of the transactions. There is one legitimate use-case for a shared database and that is reporting database, where we push data for the reporting system. Most teams use Flyway or Liquibase for upgrade scripts and database management.
Chapter 4 - Microservices Communication Styles
Getting communication right is problematic. People gravitate to certain technologies without considering why. We should think about performance. What is OK within a process, e.g. 1000 calls of a method, might not be OK in inter-process communication. We should also watch out for big data transfers over the network. When doing a backwards-incompatible change, we might have to do a lockstep deployment of the client and the service. The nature of errors can be different - we will have server crashes, the responses won't return, some things will happen too early or too late, responses will be wrong or situations, when we couldn't even agree what the problem was.
There are many choices for inter-process communication. We should consider them all. E.g. we should implement a website using single page app technology as Angular or React. There are the following styles of microservices communication:
- Synchronous blocking, when a microservice call another service and waits for a response. This is request-response kind of communication typically implemented by REST of HTTP or RPC. Advantage here is that this is simple and familiar. Disadvantage is the temporal coupling - the downstream service has to be online. This pattern starts to be problematic in longer chains of downstream calls. This also causes resource contention - the threads from the thread pool are usually locked while waiting. We can try making some of the calls in the chain asynchronous, therefore removing them from the critical path.
- Asynchronous nonblocking means that the service continues in operation agnostic to whether the downstream call was received or not. Even the request-response communication style can be implemented using this pattern but then the response could return to one of the other nodes of the microservice. The advantage is that the services are decoupled temporarily. The disadvantage is level of complexity and the range of choice. The obvious use-cases are long chains of calls or long running processes.
- Request-response styles is when the upstream service is expecting some response from the downstream service. Asynchronous nonblocking request-response communication is typically implemented by queue-based brokers. These sorts of communication always require some sort of timeout handing. They work well in situations where some additional processing is needed based on the response or when some mitigation action, like retry, is required.
- Event-driven style works by emitting events from services which subscribed services receive and act upon. This is usually implemented by topic-based brokers or REST over HTTP (Atom). This is kind of inversion of responsibility for handling the event. RabbitMQ is a popular implementation choice for this communication pattern. We should keep our pipes dumb and endpoints smart. Events should contain all the data needed to perform the downstream action. This communication style is useful whenever the information wants to be broadcast and we are happy with inverting the intent. Newman gravitates to this style almost as a default. Just beware of the increased complexity this style adds.
- Common data is when services communicate via some shared data, e.g. database or filesystem. There is usually some polling involved. Two common examples are the data lake and data warehouse. The advantage is this is quite simple to implement, the data size is not so much of a concern and the technologies of the services could be very different. Disadvantage is the need for polling mechanism, which means this is not useful in low-latency situations. Also the common data introduces more coupling.
Newman recommends to decide first whether the request-response or event-driven communication is the right one for the situation. A service will typically mix communication styles and this is usually a norm.
Part II - Implementation
Chapter 5 - Implementing Microservice Communication
We should choose technology which makes backwards compatibility of our APIs easy. Our APIs should use explicit schema and be technology agnostic. Our services should be simple for consumers. They should hide their implementation details. These are couple of popular choices:
- RPC, like SOAP and gRPC. Their main selling point is the ease of generating client code. Disadvantage is technology coupling, as they typically rely on a specific platform. Just beware that remote calls are not like local calls. Some of the RPC protocols can be quite brittle. Newman quite likes qRPC in situations where he is in control of both client and server.
- REST over HTTP is the obvious choice for synchronous request-response communication. The main challenges here are the lack of good documentation tools - although swagger tries to fix this issue and the big payloads can have a performance hit.
- GraphQL for querying the backend which does not have to be released so often. The clients are typically GUIs with mobile as obvious choice. However this can have big performance impact on the backend if the queries are not made with caution. Another issue is that while GraphQL handles reads quite well, it is not designed for writes. Also using GraphQL should not allow us to slip into thinking that microservices are just thin layers above our database.
- Message brokers for asynchronous communication via queues and topics. These are typically middleware boxes which sit between the microservices providing queues, topics, or both. Topics are good for event-based collaboration while queues are good for request-response communication. One of the selling points is the guarantee of delivery. Yet another might be guarantee of order of messages. Some brokers provide transactions on write. One other controversial marketed advantage might be just-once delivery, but this is quite hard to achieve, so we should always ask how this is done. Popular choices are RabbitMQ, ActiveMQ and Kafka.
- Kafka helps move large amounts of data by its streaming capabilities. It has message permanence and built-in support for stream processing.
Handling Change Between Microservices
We should generally avoid breaking changes with:
- Expansion changes is when we add new things to APIs and don't remove old things.
- Tolerant reader is when consumers are quite flexible in what they take.
- Right technology means we should choose technology which can handle backwards compatibility better.
- Explicit interface means we should have a schema.
- Catch accidental breaking changes early. There are tools like Protolock, json-schema-diff-validator, Confluent Schema Registry and similar.
If we absolutely have to make breaking changes, then we can do:
- Lockstep deployment of the changes service and all the client services together. This obviously breaks the independent deployment advantage of microservices, so it should be use scarcely. If we use it too often, we might have a distributed monolith problem.
- Coexist multiple microservice versions, which means running the whole two boxes in parallel. This means that every bugfix would have to be done in two places.
- Emulate the old interface, which means having the old version of the endpoint in the same service. This is the generally preferred approach. The old interface can be gradually deprecated and eventually removed. Some companies take quite extreme measures, e.g. turning off the old interface after a year or adding sleeps in the old interface to persuade clients to upgrade to the new one.
Code Reuse in Microservices
When we share code between the services using libraries, we have to accept that multiple versions of the libraries will be in use. Some teams develop client libraries for their services, which have a potential risk of logic leaking into the client code. The client library should only contain code to handle the underlying transport protocol, service discovery and failure, not any business logic. Common patterns for service discovery are:
- Domain Name System (DNS). The more we cache these, the more stale they can become. We would usually use Load Balancer to avoid stale DNS entries.
- Dynamic Service Registries, e.g. ZooKeeper manage configuration, synchronizing data between services, leader election, message queues and a naming service. ZooKeeper is often used as a general configuration store. In reality there are so many tools better suited for dynamic service registration that Newman suggest not using ZooKeeper for this purpose. Consul provides configuration management and service discovery. Consult uses REST HTTP API for everything. Other option is etcd and Kubernetes.
Service Meshes and API Gateways
Generally speaking, API Gateways handle north-south traffic, i.e. they handle outside world trying to communicate with our microservices. They largely function as reverse proxies. Using API Gateways designed for third-party use is usually a huge overkill. When using Kubernetes, we can use many tools in this field, e.g. Ambassador.
On the other side, Service Meshes handle east-west communication between our microservices. With Service Meshes the common functionality associated with inter-service communication is pushed into the mesh. This reduces the functionality the service has to implement internally while providing a lot of consistency how certain things are done. The Service Meshes usually consist of Mesh Proxies on the perimeters of the microservices and a Service mesh control plane, which coordinates the proxies. The main problem with Service Meshes is we are putting them on our critical path. The field has matured though. For organizations which have more microservices, they might be well worth the look.
We should generally pick technology based on the problem we are solving. We can also consider humane registries for documentation of our APIs.
Chapter 6 - Workflow
A microservices is of course free to use classic ACID transactions to its own database. We should just say No to distributed transactions as they add too much complexity. Sagas are preferred instead. Instead of ACID rollback, we have to think about explicit rollbacks and even reorder the saga's steps to make rollbacks easier. We can architect the saga toward orchestration or choreography. Orchestration is more classic, but you have the single point of most logic. Choreography is when each service hands off the work to the next service and is more suitable in event-driven architectures.
Chapter 7 - Build
There are two main concepts of microservice build. Continuous Integration (CI) and Continuous Delivery (CD).
With Continuous Integration, the goal is to synchronize everyone with each other. We should check in to mainline at least once per day. We should have suite of automated tests to validate our changes. When the build is broken, the #1 priority should be fixing the build. We should use an appropriate branching model. The long-lived branches are problematic, because we keep them away from the mainline for too long. We should start using feature flags and merge with mainline more frequently instead. The "GitFlow" pattern is more suitable to open-source development, where you don't trust other committers, than to normal companies' flow #interesting.
Continuous Delivery is all about build pipelines. The concept is that every commit is treated as a release candidate. The checks to every commit are executed automatically without human interaction. The artifact created should be build once and once only. The same artifact should be then deployed to each environment.
With Multirepo, each microservice has its own source code repository. The common code can be shared using library repositories. Cross-service changes require multiple commits. Making cross-service changes too often is a design smell and maybe we should think about merging our microservice together.
With Monorepo, the ability to reuse code and to make cross-service changes easily are the main reasons to adopt this pattern. With a small number of developers, say 20, this pattern of collective code ownership could work fine. The challenge comes when we start to have more developers and the need of stronger code ownership. Some tools don't offer enough control over the code ownership in Monorepo. E.g. GitHub has CODEOWNERS file, but many tools don't have anything. Google is often cited as an example company for using Monorepo, but it has 100+ people in platform team creating developer tools just to handle the Monorepo.
There is a pattern variation where each team has its own monorepo.
In Newman's experience the advantages of Monorepos don't outweight the challenges that come at scale.
Chapter 8 - Deployment
The logical view of our microservices can hide the wealth of what is going on under the hoods. For example using a Load Balancer. Or distributing instances across multiple different Data Centers.
We should generally avoid sharing databases between services. We can scale database access by using read replicas. The same physical database can hold multiple schemas with access rights for each service. Teams that use on-premises databases tend to reuse physical database for multiple schemas while teams working on cloud tend to don't care and start a new database for each service.
We will generally have more environments to deploy our services to. As we go from local to production environment, the feedback gets slower, but the environment production-like gets higher (e.g. number of nodes in clusters, etc.).
There are five principles of microservice deployment:
- Isolated execution means that we can start and stop each microservice individually without affecting the others. There is a tradeoff between stronger isolation and lower cost + faster provisioning. The classic options are Physical machine, Virtual machine or a Container.
- Focus on automation start to become important at scale, when we start to have more microservices.
- Infrastructure as code means that infrastructure code should be shared using source version control tools.
- Zero-downtime deployment means services should not stop while deployment.
- Desired state management is about defining the target environment and letting the tools to get to the desired state. Things like number of instances and memory requirements can be defined as the desired state. The fully automated deployment for each service is a prerequisite for the desired state management. With GitOps the desired state management is stored as a code.
The service can be deployed into:
- Physical machine with no virtualization. This can be very ineffective and is almost not seen today.
- Virtual machine. The physical machine is split into multiple virtual ones. E.g. Netflix uses AWS EC2 for its services instead of Containers.
- Container, which might be managed by container orchestration tool like Kubernetes. The containers have become the de facto choice for running microservices these days. Normally the containers on the same machine use the same kernel. You don't have the same isolation as with virtual machines. With Docker we started to have the concept of "image" for containers along with the nice set of tools.
- Application container which manages other application instances in the same runtime. Examples are Weblogic or Tomcat. The lack of isolation is the main reason this pattern is not used too much with microservices.
- Platform as a Service (PaaS), like Heroku, Goolg App Enging or AWS Beanstalk. This is a higher-level abstraction where you deploy the application to a specific environment which offers set of system services, like database access, caching and more. Heroku remains the golden standard in this field. The smarted the PaaS solutions try to be, the more they go wrong.
- Function as a Service (FaaS) like AWS Lambda or Azure Functions. This is the only real contender with Kubernets when it comes to popularity with microservices. The serverless here means that developers should stop worrying about the servers at all. You can only control the memory allocated to the functions. The function invocations have to be stateless. The challenge is a spin-up time, which can be significant for languages such as Java. We can map functions to services or e.g. to aggregates.
The rule of thumb?
- If it ain't broke, don't fix it. Use whatever technology works for you.
- Give up as much control over infrastructure as you feel happy with and then give up some more. Maybe you will like the PaaS more than you think. Both Heroku and Zeit are excellent tools for this.
- Otherwise expect Kubernetes in your future.
Kubernetes is a tool for container orchestration. It means the operator says which thing should run and the Kubernetes takes care for allocation of resources and running the thing. Kubernetes consists of two concepts. Nodes and Control plane. The nodes could be run on physical or virtual machines under the hood. Node consists of one or more Pods, which contain one or more Containers which will be deployed together. Commonly Pods will contains just one Container. A Service is a stable routing endpoint, a way to map multiple Pods to a useful network address. The challenge with Kubernetes is how multi-tenancy is handled. Various teams might want different degree of control over the resources. One option is to use OpenShift. Another is to consider a federated model. Moving away from one Kubernetes cluster to another might well also require rewriting the platform for the new destination. Kubernetes is not terribly developer friendly. Running your own Kubernetes cluster is not for the faint of heart. We should consider using a fully managed cluster instead offered by all main cloud vendors (or even consider a higher-level abstraction like PaaS or FaaS).
Blue-green deployment is when we deploy the new version of the service (blue) alongside the old one (green). We test that the blue service is working as expected and then we redirect users to it. Other patterns for Progressive Delivery include Canary Release and Feature Toggles. Canary Release is when you pilot some feature to just a small group of users and the progressively increase the count of users for it. Feature Toggles are the feature flags mentioned earlier.
Chapter 9 - Testing
The challenge of testing microservices is of course their distributed nature. In the past, the testing was predominantly done in pre-production. This chapter ignores the manual Exploratory Testing, i.e. how can I break the system? kind of testing.
According to the Test Pyramid, there should be: Many unit tests with fast feedback loop and low confidence that we didn't break anything. They have smaller-than-the-whole-microservice scope. Less service tests with slightly slower feedback loop, but higher confidence that we didn't break a thing. The service tests test the whole microservice while mocking the rest of the world.
And much less UI tests with long feedback loops and high confidence of not breaking anything. The standard way of choosing which versions of services to test is to test the next versions of services. This does not mean that releasing them together is OK. That is never acceptable. There will be probably some flaky tests in this category. The reason is that there are too many moving parts. It is essential to either fix them immediately or remove them immediately to fix them later. The team programming the functionality should be also responsible for writing end-to-end tests. The more distanced the team writing tests and team developing a service are the more bad things happen. The idea of shared UI tests undermines the independent deployability. It is quite difficult to reason about usefulness of an end-2-end test versus the burden it entails. When UI tests slow down our ability to release small changes they can end up doing more harm than good.
A common anti-pattern is a Test Snow Cone, or inverted pyramid. When the build breaks it stays that way for a long time if the build takes 1 hour or more.
Newman suggests using Contract Tests and Consumer-Driven Contracts (CDCs) tests as an alternative to UI tests. The team consuming a target service writes tests how it expects it to behave #interesting. Therefore they belong to the service tests group more than end-to-end tests group. There are many tools for this option, e.g. Pact, Spring Cloud Contract. We can view the end-to-end tests as a training wheels until we learn a better approach. Most teams experienced in microservices replace them with Contract Tests or some kind of progressive delivery over time.
Developers should only run those services they work on locally. Everything else should be stubbed.
Generally speaking the tests are useful for getting fast feedback about our software quality. But they are not the only option. We can also apply testing to our production environment. The simplest example is a ping test, i.e. the basic health check whether our service is running at all. Smoke tests are done typically right after the deployment to verify some basic scenarios. Canary releases can be viewed as a form of testing by limiting the scope of the changes. Injecting a fake user behavior, e.g. creating a fake user or a fake order. We have to be extra careful with these tests.
Operations generally optimize for mean time between failures (MTBF) or mean time to repair (MTTR). Optimizing for MTTR can be as simple as good monitoring and fast recovery in case of failure. We should get good at MTTR as well.
There are also Cross-Functional Tests, e.g. Performance Tests. The Performance Tests usually cannot be run during every checkout. We should monitor their results and have some idea of target values to check against. Robustness Tests are testing what happens e.g. when one of the downstream services is unavailable. They are quite tricky to implement but it can be worth the while. They are especially useful when developing shared functionality between a lot of microservices.
Chapter 10 - From Monitoring to Observability
We should give more focus to Observability - the ability to ask our system about various things to analyze what is going wrong. We need to have an aggregated monitoring and tools to slice and dice this data to get a bigger picture. We should monitor CPU, memory, logs, we should have a health check endpoint. We should aggregate this data. There are 3 pillars of observability - metrics, logging and distributed tracing. There are actually more parts of observability to consider:
- Log aggregation into a single place to look at. This is actually a prerequisite for building microservices. We should pick a common and sensible log format to make things easier. We should continue by adding a correlation ID to our log lines to make them more useful. Beware that we cannot get precise timing of events using logs only. To get that we need something like distributed tracing. There are many tools helping with this, e.g. open-source Kibana or commercial Humio or Datalog.
- Metrics aggregation to capture the big picture and maybe even scale our applications. There are tools in this area as well, e.g. open-source Prometheus. For high-cardinality data Neman recommends Honeycomb or Lightstep, which are often seen as distributed tracing systems, but can handle metrics aggregation as well.
- Distributed tracing to detect latency problems. E.g. Honeycomb shows us a diagram flow of each of the calls so we can detect where the most of time was spent. Another tool is open-source Jeager. Commercial tools are Honeycomb and Lightstep.
- Are you doing OK? Checking various SLAs SLOs and others. SLOs are service-level objectives. We can even implement error budgets for teams to give them clear view how well they are doing regarding their SLOs.
- Alerting - what should we alert on and what does a good alert look like? Often too many alerts can cause significant issues. A problem in one area can cause problem in another. Alerts should be relevant, unique, timely, prioritized, understandable, diagnostic (clear what is wrong), advisory (help to understand what actions to take) and focusing (draw attention to most important issues). We should maybe try a more holistic approach when to wake up people at 3am.
- Semantic monitoring - think differently about what should wake us up at 3am. Think higher-level. E.g. we should monitor that we are selling roughly the same amount of orders than weeks before, new customers ca register to join, etc. We are not talking about low-level stuff as disk monitoring. This is kind of monitoring where the product owner should come in and help defining it.
- Testing in production. With synthetic transactions we insert fake user behavior to our system. This behavior has known inputs and known expected outputs. This has proven to be better indicator of issues than lower-level monitoring. We can use tools for automated UI tests to implement them. With A/B testing we deploy two or more versions of the functionality and we present two or more groups of users either one or the second. Then we measure their performance. Canary release is a pilot of a new functionality to limited number of users. Parallel run means there are two implementations of the same functionality and their result is compared. Smoke tests are performed right after deployment. Chaos engineering is popularized by Netflix and their Chaos monkey, which kills some of the services right in production.
Observability is area where standardization can be very important. It is a fast-emerging space with many young tools. The tools should provide temporal (how does this compare to a minute, hour, day, or week ago?), relative (how does this compare to other parts of the system?), relational (dependency on other things) and proportional contexts (how many users are impacted?). While the automated tools are emerging, the expert in the system is currently still a human.
Chapter 11 - Security
Newman aims to be consciously incompetent - i.e. he wants to know his limits. Principle of Least Privilege - when granting access we should give the minimum possible rights to a thing for minimum needed time. There are 5 functions of Cybersecurity:
- Identify potential attackers, what their targets are and where are we most vulnerable. The targets might be user credentials, secrets, etc. We should monitor their creation, distribution, storage, monitoring and rotation. We can limit the scope of the secrets therefore limiting the impact of their compromitation.
- Protect the key assets from the hackers. We should also patch our systems. When using backups, we should test that they really work.
- Detect if the attack happened.
- Respond when we found out something bad happened.
- Recover in the wake of an incident.
We can Implicitly Trust services in our demilitarized zone or we can apply Zero Trust and treat everything as potentially compromised. Services can use mutual TLS for authorization and authentication. We can implement a Single-Sign ON solution using a gateway and a Identity Provider.
Chapter 12 - Resiliency
Resilience of the microservices is often cited as a major reason for adopting them. Resiliency is about robustness - ability to absorb expected perturbation, rebound - ability to recover from trauma, graceful extensibility - ability to deal with unexpected, and sustained adaptability - ability to adapt to changing requirements. The ability to deliver resiliency is not about software alone, but also about people maintaining it. The scope of this book is limited to improve robustness only.
We should define our SLAs, i.e. response time/latency, e.g. we should respond in 2s for 90% of the 200 concurrent requests per second, availability and durability of data. We should think about criticality of each of our capabilities.
The systems which just act slow are much harder to deal with than the system which just fail fast. In a distributed system, latency kills. There are three common fixes:
- Time-Outs we should get right. We should continually tweak the time-outs settings to achieve better robustness.
- Bulkheads, i.e. using separate connection pools for separate services. This way we limit the impact of one service going down #interesting.
- Circuit Breakers are recommended to use for all downstream calls.
We can also start to isolate our microservices more, however this comes with tradeoffs of higher complexity and cost. Another way to improve robustness is redundancy, i.e. to have more redundant things that do the same thing. Middleware or message brokers can help with guaranteed delivery. We should implement idempotent endpoints, i.e. where the result is not impacted by number of the same calls. GET and PUT should be idempotent.
CAP theorem says we can tradeoff consistency, availability and partition tolerance with each other. Chaos Engineering is a discipline of experimenting on a system to achieve trust that the system will behave correctly in turbulent times. We should also test our people and processes for realistic but fictional situations. Netflix has Chaos Monkey (takes down a random server), Chaos Gorilla (takes down a random data center) and Latency Monkey (slows connectivity in a network).
We should never start blaming people for failures. This starts innocent but can grow into a culture of fear. We should read Blameless Post-Mortems and a Just Culture by Johl Allspaw.
Chapter 13 - Scaling
There are 4 axes of scaling - vertical scaling, horizontal scaling, data partitioning and functional decomposition.
Virtual Scaling is about getting a bigger machine. This is the easiest kind of scaling to implement. It won't work when our code is not multi-core ready, it will not improve our robustness and sometimes it is too pricey compared with multiple slower machines.
Horizontal Duplication is about duplicating a service (or a monolith) to multiple nodes in a cluster. There is usually an upstream load balancer involved, which delegates the workload. Alternatively we can have a queue of work and competing jobs receiving work from it. Another example are database read replicas. The advantage is that horizontal scaling is still relatively straightforward. The limitations is the cost when we replicate the big monolith when there is only a small part of it needing scaling. Another limitation is sometimes the need of a sticky session for application to work correctly. This should be avoided. We should strive for Autoscaling infrastructure, which spins up as many instances of our services as needed.
Data Partitioning is basically about applying a function to all data we work with and compute a partition according to the function. Then we can use to proper database. The simplest example could be partitioning customer data by the name A-M and N-Z. Key benefit is spreading the workload, especially the write work. Partitioning the database might make certain maintenance easier. Its limitation is that it does not improve robustness. Also getting the partition key right might be difficult.
Functional Decomposition is about extracting a new service from existing service. In general we should look for the simpler scaling options first before trying this hardest one.
Caching is a helpful mechanism for avoiding unnecessary work. We not only save network hops, but sometimes event request creation and parsing. But we have to be careful about cache eviction mechanism. We might use a dedicated cache system like Redis to mitigate the cache eviction or synchronization problem. We can also use HTTP request cache mechanism to save whole network hops. Cache invalidation can be as simple as specifying TTL - time to live. We can also use conditional GET requests, i.e. by using If-None-Match HTTP header. Or we can create events for notifying client code that the cache should be invalidated. We should be generally careful about caching in too many places.
Chapter 14 - User Interfaces
Despite being not recommended, Newman still sees many frontend silos in many organizations who use microservices. The major reasons are scarcity of specialists, drive for consistency and technical challenges. We should try to remove silos while helping to skill up our colleagues at the same time. Also, consistency is not an universal right. Sometimes we need to sacrifice some of it for the greater autonomy of the teams.
Monolithic Frontend is a frontend handling UI for all backend microservices. This is the most common approach, often with a dedicated frontend team. It works best when we want our frontend in one deployable unit, which is rarely the case.
Micro Frontends is a pattern where parts of the frontend can be deployed independently. There is widget-based decomposition and page-based decomposition. The challenge is e.g. versions of the libraries. While widget A could have react v16, widget B could have react v15. This will also probably bloat the bundle size.
Central Aggregating Gateway can help us minimize the backend calls frontend must make to get the information it needs. The problem with this pattern is the ownership of the gateway. We might effectively create another silo, which is never a good thing.
Backend for Frontend pattern solves some of the issues of the central aggregating gateway. Various frontends have various endpoint needs, especially mobile vs web. When we dedicate each frontend its separate backend for frontend, we might solve many of the issues. When there is too much duplication in two BFFs, we can extract it to a new service. Small duplication doesn't hurt that much, if the BFFs remain thin.
GraphQL is another alternative, where we give the frontend ability to query the backend. So there is no need to redeploy backend for every change in aggregation.
Chapter 15 - Organizational Structures
According to Conway's Law, if we want to get the most of the microservice, we have to take organizational structures into account as well. We need loosely coupled teams and reduce the coordination needed. The key abilities of autonomous teams to have the highest possible performance are:
- They can make large-scale changes to the design of their system without a permission of anybody outside the team.
- Make large-scaled changes without depending on someone else making changes over there.
- Complete their work without communication and coordination with someone outside the team.
- Deploy and release their service on demand, regardless on other services.
- Do most of their testing on demand, without a need of integrated environment.
- Perform deployment on business hours without a downtime.
When Microsoft analyzed the cause of the bugs in various parts of Windows Vista, it concluded that the organizational structure, e.g. the number of developers in a team, had the most impact. Amazon recognized early on that the optimal team size is 8-10 customer-facing people and developed AWS to make this work. Netflix says that the people talking to each other should sit closely together. Geographical distribution of the people should be major concern when deciding the software boundaries. The trick seems to be to create large organizations from many small teams. The biggest cost at scale seems to be the need for coordination. When it comes to Amazon's two-pizza teams, most people focus on the size and miss the point. The point is to give a small team all the autonomy it needs to deliver whatever it has to deliver.
We should prefer strong code ownership model. In a microservice world there is a place for Enabling Teams, the teams implementing cross-cutting concerns. They can be used for knowledge sharing and implementing the platform for all the teams. The Platform team should be operating like a consultancy inside the organization. It should serve all the teams. It could also pave road for other teams, but the paved road has to be optional. If there are too many restrictions for the teams, those restrictions will be bypassed. Microservices should not be shared. There is too much coordination needed in this case. The teams should own microservices and have a way of vetting or approving pull requests from other teams. Only when a service is pretty mature and not really contributed into, it might be wise to open it up for general contribution. If you have a service with lots of inbound pull requests it might be a bad sign that you have a shared microservice. We also want to have peer review and almost never external code reviews because they don't add much value. Code reviews should be done promptly. Authors of Accelerate found no correlation between reviewing only high-risk changes and team performance. They find a correlation between reviewing all code changes and the team performance. We should not push our developers to support in night hours too fast. No matter how it looks it is always a people problem.
Chapter 16 - An Evolutionary Architect
The vision of an architect, who draws blueprints and hands them over to someone who will implement them, did the most harm to the name. Architecture happens, either by design or accidentally. It is the shared understanding of the system by senior staff. It is a social construct. Another view is that the architecture is the things people consider hard to change.
Architects should instead enable change. They should be more like town planners instead of classic architects. They should develop a framework for others to work more efficiently. For example software boundaries are the town zones. Habitability of a code base is the understandability for the newcomers. The architects should clearly communicate technical vision, adapt it and understand challenges as they emerge.