Notes about the Building Microservices
These are notes for my future self about Sam Newman's book, Building Microservices 2nd edition.
Microservices are an approach to distributed applications that use finely-grained services and change, deploy and release them independently. Microservices have become the default go-to architecture when designing system, which Newman finds hard to justify #surprised and want to share his view on why.
Chapter 1 - What are Microservices?
Chapter 2 - How to model Microservices
Chapter 3 - Splitting the Monolith
Chapter 4 - Microservices Communication Styles
- Synchronous blocking, when a microservice call another service and waits for a response. This is request-response kind of communication typically implemented by REST of HTTP or RPC. Advantage here is that this is simple and familiar. Disadvantage is the temporal coupling - the downstream service has to be online. This pattern starts to be problematic in longer chains of downstream calls. This also causes resource contention - the threads from the thread pool are usually locked while waiting. We can try making some of the calls in the chain asynchronous, therefore removing them from the critical path.
- Asynchronous nonblocking means that the service continues in operation agnostic to whether the downstream call was received or not. Even the request-response communication style can be implemented using this pattern but then the response could return to one of the other nodes of the microservice. The advantage is that the services are decoupled temporarily. The disadvantage is level of complexity and the range of choice. The obvious use-cases are long chains of calls or long running processes.
- Request-response styles is when the upstream service is expecting some response from the downstream service. Asynchronous nonblocking request-response communication is typically implemented by queue-based brokers. These sorts of communication always require some sort of timeout handing. They work well in situations where some additional processing is needed based on the response or when some mitigation action, like retry, is required.
- Event-driven style works by emitting events from services which subscribed services receive and act upon. This is usually implemented by topic-based brokers or REST over HTTP (Atom). This is kind of inversion of responsibility for handling the event. RabbitMQ is a popular implementation choice for this communication pattern. We should keep our pipes dumb and endpoints smart. Events should contain all the data needed to perform the downstream action. This communication style is useful whenever the information wants to be broadcast and we are happy with inverting the intent. Newman gravitates to this style almost as a default. Just beware of the increased complexity this style adds.
- Common data is when services communicate via some shared data, e.g. database or filesystem. There is usually some polling involved. Two common examples are the data lake and data warehouse. The advantage is this is quite simple to implement, the data size is not so much of a concern and the technologies of the services could be very different. Disadvantage is the need for polling mechanism, which means this is not useful in low-latency situations. Also the common data introduces more coupling.
Part II - Implementation
Chapter 5 - Implementing Microservice Communication
- RPC, like SOAP and gRPC. Their main selling point is the ease of generating client code. Disadvantage is technology coupling, as they typically rely on a specific platform. Just beware that remote calls are not like local calls. Some of the RPC protocols can be quite brittle. Newman quite likes qRPC in situations where he is in control of both client and server.
- REST over HTTP is the obvious choice for synchronous request-response communication. The main challenges here are the lack of good documentation tools - although swagger tries to fix this issue and the big payloads can have a performance hit.
- GraphQL for querying the backend which does not have to be released so often. The clients are typically GUIs with mobile as obvious choice. However this can have big performance impact on the backend if the queries are not made with caution. Another issue is that while GraphQL handles reads quite well, it is not designed for writes. Also using GraphQL should not allow us to slip into thinking that microservices are just thin layers above our database.
- Message brokers for asynchronous communication via queues and topics. These are typically middleware boxes which sit between the microservices providing queues, topics, or both. Topics are good for event-based collaboration while queues are good for request-response communication. One of the selling points is the guarantee of delivery. Yet another might be guarantee of order of messages. Some brokers provide transactions on write. One other controversial marketed advantage might be just-once delivery, but this is quite hard to achieve, so we should always ask how this is done. Popular choices are RabbitMQ, ActiveMQ and Kafka.
- Kafka helps move large amounts of data by its streaming capabilities. It has message permanence and built-in support for stream processing.
Handling Change Between Microservices
- Expansion changes is when we add new things to APIs and don't remove old things.
- Tolerant reader is when consumers are quite flexible in what they take.
- Right technology means we should choose technology which can handle backwards compatibility better.
- Explicit interface means we should have a schema.
- Catch accidental breaking changes early. There are tools like Protolock, json-schema-diff-validator, Confluent Schema Registry and similar.
- Lockstep deployment of the changes service and all the client services together. This obviously breaks the independent deployment advantage of microservices, so it should be use scarcely. If we use it too often, we might have a distributed monolith problem.
- Coexist multiple microservice versions, which means running the whole two boxes in parallel. This means that every bugfix would have to be done in two places.
- Emulate the old interface, which means having the old version of the endpoint in the same service. This is the generally preferred approach. The old interface can be gradually deprecated and eventually removed. Some companies take quite extreme measures, e.g. turning off the old interface after a year or adding sleeps in the old interface to persuade clients to upgrade to the new one.
Code Reuse in Microservices
- Domain Name System (DNS). The more we cache these, the more stale they can become. We would usually use Load Balancer to avoid stale DNS entries.
- Dynamic Service Registries, e.g. ZooKeeper manage configuration, synchronizing data between services, leader election, message queues and a naming service. ZooKeeper is often used as a general configuration store. In reality there are so many tools better suited for dynamic service registration that Newman suggest not using ZooKeeper for this purpose. Consul provides configuration management and service discovery. Consult uses REST HTTP API for everything. Other option is etcd and Kubernetes.
Service Meshes and API Gateways
Chapter 6 - Workflow
Chapter 7 - Build
Chapter 8 - Deployment
- Isolated execution means that we can start and stop each microservice individually without affecting the others. There is a tradeoff between stronger isolation and lower cost + faster provisioning. The classic options are Physical machine, Virtual machine or a Container.
- Focus on automation start to become important at scale, when we start to have more microservices.
- Infrastructure as code means that infrastructure code should be shared using source version control tools.
- Zero-downtime deployment means services should not stop while deployment.
- Desired state management is about defining the target environment and letting the tools to get to the desired state. Things like number of instances and memory requirements can be defined as the desired state. The fully automated deployment for each service is a prerequisite for the desired state management. With GitOps the desired state management is stored as a code.
- Physical machine with no virtualization. This can be very ineffective and is almost not seen today.
- Virtual machine. The physical machine is split into multiple virtual ones. E.g. Netflix uses AWS EC2 for its services instead of Containers.
- Container, which might be managed by container orchestration tool like Kubernetes. The containers have become the de facto choice for running microservices these days. Normally the containers on the same machine use the same kernel. You don't have the same isolation as with virtual machines. With Docker we started to have the concept of "image" for containers along with the nice set of tools.
- Application container which manages other application instances in the same runtime. Examples are Weblogic or Tomcat. The lack of isolation is the main reason this pattern is not used too much with microservices.
- Platform as a Service (PaaS), like Heroku, Goolg App Enging or AWS Beanstalk. This is a higher-level abstraction where you deploy the application to a specific environment which offers set of system services, like database access, caching and more. Heroku remains the golden standard in this field. The smarted the PaaS solutions try to be, the more they go wrong.
- Function as a Service (FaaS) like AWS Lambda or Azure Functions. This is the only real contender with Kubernets when it comes to popularity with microservices. The serverless here means that developers should stop worrying about the servers at all. You can only control the memory allocated to the functions. The function invocations have to be stateless. The challenge is a spin-up time, which can be significant for languages such as Java. We can map functions to services or e.g. to aggregates.
- If it ain't broke, don't fix it. Use whatever technology works for you.
- Give up as much control over infrastructure as you feel happy with and then give up some more. Maybe you will like the PaaS more than you think. Both Heroku and Zeit are excellent tools for this.
- Otherwise expect Kubernetes in your future.
Chapter 9 - Testing
Chapter 10 - From Monitoring to Observability
- Log aggregation into a single place to look at. This is actually a prerequisite for building microservices. We should pick a common and sensible log format to make things easier. We should continue by adding a correlation ID to our log lines to make them more useful. Beware that we cannot get precise timing of events using logs only. To get that we need something like distributed tracing. There are many tools helping with this, e.g. open-source Kibana or commercial Humio or Datalog.
- Metrics aggregation to capture the big picture and maybe even scale our applications. There are tools in this area as well, e.g. open-source Prometheus. For high-cardinality data Neman recommends Honeycomb or Lightstep, which are often seen as distributed tracing systems, but can handle metrics aggregation as well.
- Distributed tracing to detect latency problems. E.g. Honeycomb shows us a diagram flow of each of the calls so we can detect where the most of time was spent. Another tool is open-source Jeager. Commercial tools are Honeycomb and Lightstep.
- Are you doing OK? Checking various SLAs SLOs and others. SLOs are service-level objectives. We can even implement error budgets for teams to give them clear view how well they are doing regarding their SLOs.
- Alerting - what should we alert on and what does a good alert look like? Often too many alerts can cause significant issues. A problem in one area can cause problem in another. Alerts should be relevant, unique, timely, prioritized, understandable, diagnostic (clear what is wrong), advisory (help to understand what actions to take) and focusing (draw attention to most important issues). We should maybe try a more holistic approach when to wake up people at 3am.
- Semantic monitoring - think differently about what should wake us up at 3am. Think higher-level. E.g. we should monitor that we are selling roughly the same amount of orders than weeks before, new customers ca register to join, etc. We are not talking about low-level stuff as disk monitoring. This is kind of monitoring where the product owner should come in and help defining it.
- Testing in production. With synthetic transactions we insert fake user behavior to our system. This behavior has known inputs and known expected outputs. This has proven to be better indicator of issues than lower-level monitoring. We can use tools for automated UI tests to implement them. With A/B testing we deploy two or more versions of the functionality and we present two or more groups of users either one or the second. Then we measure their performance. Canary release is a pilot of a new functionality to limited number of users. Parallel run means there are two implementations of the same functionality and their result is compared. Smoke tests are performed right after deployment. Chaos engineering is popularized by Netflix and their Chaos monkey, which kills some of the services right in production.
Chapter 11 - Security
- Identify potential attackers, what their targets are and where are we most vulnerable. The targets might be user credentials, secrets, etc. We should monitor their creation, distribution, storage, monitoring and rotation. We can limit the scope of the secrets therefore limiting the impact of their compromitation.
- Protect the key assets from the hackers. We should also patch our systems. When using backups, we should test that they really work.
- Detect if the attack happened.
- Respond when we found out something bad happened.
- Recover in the wake of an incident.
Chapter 12 - Resiliency
- Time-Outs we should get right. We should continually tweak the time-outs settings to achieve better robustness.
- Bulkheads, i.e. using separate connection pools for separate services. This way we limit the impact of one service going down #interesting.
- Circuit Breakers are recommended to use for all downstream calls.
Chapter 13 - Scaling
Chapter 14 - User Interfaces
Chapter 15 - Organizational Structures
- They can make large-scale changes to the design of their system without a permission of anybody outside the team.
- Make large-scaled changes without depending on someone else making changes over there.
- Complete their work without communication and coordination with someone outside the team.
- Deploy and release their service on demand, regardless on other services.
- Do most of their testing on demand, without a need of integrated environment.
- Perform deployment on business hours without a downtime.