In the previous article we uncovered ways to pass the DSA interview. The next part of the Ultimate interview preparation framework demystifies the system design interviews.

# What to expect

So far in my career I've met three types of interviews they call "System design":

A candidate is only given a task that states to design a system, evaluate constraints, requirements and negotiate a possible solution. This is a classic "FAANG style" system design, such as "Design Twitter", "Design Google Drive", "Design Payment System" and so on. This kind of interview can be learned and done by the book.
A candidate is given a diagram of an existing system and asked to evaluate it, spot potential issues, bottlenecks, areas of improvement and opportunities for redesign.
There could be other options, which are not strictly system design, but are related to it and sometimes referred as "system design":
- Improve a given code snippet to the best degree possible. Typically that best degree is a fully-fledged microservice.
- Improve a given database structure. So they must fix obvious things like absence of primary and foreign keys, de-normalisation-without-apparent-reason, and so on.

Don't come unarmed, scout the battleground. It is always a good idea to call your recruiter and ask to give your pointers on what to expect. Recruiters are sincerely interested in your success, so they cooperate willingly.

# How to behave

An interviewer typically says the interview is open-ended, so it potentially can go on and on. However, the way they say this is a bit deceiving, because the amount of time given is limited, and typically there is always a certain checklist of things they expect to hear and cover. Failing to do so will most likely lead to rejection. That's why:

Always know how much time you have. Normally it is one hour, sometimes 1.5 hours. Don't loose track of time, plan your interview.
Don't spend too much time on the introduction round. In most of the cases this is just a certain kind of a ritual that typically doesn't let you earn any score.
Always know what the expectations are. Is it to build a distributed system? Or just fix the code? It is better to discuss this with the recruiter beforehand.
Don't spend time on chit-chat and idle talks, trying to show good personality of yours. Good personality won't get you the offer. Be polite and positive, but focus on the task.
Always listen to the interviewer, as they may steer the discussion to the desired direction, in case if you are a bit off the road.
Drive the interview process, as this is always expected.

# Option 1: design a system from scratch

Once again, it's expected that the candidate drives the interview process. Imagine this being a real job task, when you are asked to design an application, and you make a presentation for your colleagues.

For better results, there is a certain script that should be followed. I call it "The FAANG script", because this is how the interviews go in Facebook, for instance.

# 1. Read the task and ask questions

Imagine you receive a task from your project manager, and it says: "Build me a system X". The very first and reasonable reaction from your side would be "Where are the requirements?" or "Should we outline the requirements together?". This is exactly the same with this kind of interviews. The first thing to do is to start asking clarifying questions to define some acceptance criteria.

If nothing comes up, start asking the standard questions, such as:

What are the most important system requirements (features)? - a broadly scoped question that can help you get many useful insights.
If the system to be designed is well-known (such as Google Drive or Twitter) and enormous, it makes sense to ask instead What features are important? to narrow down the scope a little.
How many DAU (daily active users) should the system handle? - a good question for the back of the envelope estimation (see below). It is also quite generic and is asked more out of politeness, because they will most likely answer "ten thousands" or so.
For how long the data must be stored?
What is the average size of one data item? - another good question for the back of the envelope estimation.
...

Starting off with drawing blocks and endpoints having the questions skipped can lead to a failure. Just like you can't jump over the discovery phase working on a real feature, you can't proceed with the design if you poorly understand what must be designed. If you skip this step, it's a clear sign that at your current position you are mostly a doer, not a researcher.

# 2. Make a list of must-have features and requirements

A list of basic features must be written down, unless given. Without a very well scoped and fostered list of mandatory features it's not possible to proceed, because then it would be hard to outline the limits and the scope of the solution itself.

There are two types of requirements:

Functional - list of features that are required to be implemented.
- The system must have feature X and allow Y
- The system must have feature Z and allow W
- ...
Non-functional - properties of the system such as:
- Scalability - how well the system can handle increased load
- Performance - how fast the system responds to requests
- Availability - how well the system can handle failures
- Latency - how fast the system responds to requests
- Privacy - how the system handles sensitive data
- Consistency - does the system tolerate eventual consistency, or should it be strong?
- Cost - how much it costs to run the system at scale
- Accuracy - how accurate the system is, what is the acceptable error rate?
- Concurrency - how the system handles concurrent requests

# 3. Propose a high-level architecture, get the buy-in

Here we can talk about tiers (web tier, VPC tier), very high level parts of the app, such as the components of the backend and the frontend. Also talk about the way of communicating between the parties, discuss contracts, endpoints, etc.

No need to care about the performance concerns at that stage, no need to introduce any caching or multi-thread processing. The main components of the system and the way the parts interact with each other must be defined. At this moment outline the system structure for lower scale, and in the next step you can optimize the system for higher scale.

✅ You can map the journey of a request through the system using colored dots, numbers and arrows.

✅ While mapping the request flow, it's a good idea to start proposing the microservice architecture. However, be mindful abut the following:

Microservices introduce communication overhead, so be ready to talk about eventual consistency, retries, circuit breakers, rate limiting, etc.
Your soultion should avoid heavy data fanouts (one service requests a lot of data from another).

At this point it's better to refrain from talking about specific technologies or frameworks, it's too early for that!

# 4. Deep dive into every component of the system

Outline the contract between the parts. Dive into every component layer by layer, one component at a time. Ask if some deeper explanation is needed.

At this point it makes sense to talk about scalability, availability, consistency, etc. For example, tell about the ways you would scale the microservices you proposed in the previous step, load balancers, sharding, etc.

If the time allows, you can talk about specific technologies and frameworks that could be used to implement the system.

# 5. Talk about resources needed

In conclusion, you should talk about how much CPU/memory/storage/network bandwidth/operational costs is needed to run the system.

Talking about security, observability, etc would be a good bonus.

# 6. Talk about edge cases

There may be some edge cases that may require special handling. For example, for the "Design Twitter" task it makes sense to mention so-called celebrities, who are users that are followed by a lot of other users.

Here is an example of system design, similar to which allowed me to land an offer once.

🚨 Important: this interview usually happens on-site, but can also be done remotely. For the on-site interview, you will be given a whiteboard and a pen. If the second is the case, familiarize yourself with the tool you will be working with. In my opinion, the best choice is Excalidraw, because it's free, simple and offers better UX, comparing to the alternatives.

Memoize the hotkeys. You can also prepare a starter diagram to save some time drawing, and then just fill it in. Here is an example:

# Option 2: evaluate an existing system

This is a less common interview type, but it works better for remote interviewing, because in this case only talking happens. You could be asked a series of questions about the system reliability, scalability and general functionality. Then, a couple of hypothetical situations can be offered for additional discussion.

Here are some examples of questions you could be asked:

The service on the diagram can handle low traffic well. Now the traffic has increased (e.g. during Black Friday), identify the bottlenecks and choking points and propose an improvement. Typical improvements you can pull out of your sleeve:
- Message broker instead of database polling or direct service-to-service communication
- Horizontal scaling of the service components (+ load balancer) and the database
- Caching (with caution)
- Async operations (job queues) instead of sync ones
- Retries, circuit breakers, rate limiting
We need to add functionality X to the system. What are the possible ways to do it?
A message broker suddenly goes down. What are the possible ways to handle it?
When the message broker goes up again, how would you cut out outdated messages, deal with the message pileups due to the consumers not being able to process them fast enough?

General advice here: take nothing you see for granted. There could be any kind of bullshit on the original diagram, feel free to propose alternatives and reasoning behind them.

# AI assisted interview

At this stage limited AI assistance may be allowed, but an explicit permission should be granted. The company is interested in assessing your take and your approach in solving a real world problem, so they will be also observing which tools you are willing to utilize in order to complete the task.

# Key concepts and building blocks

From the technical point of view, there is a set of theoretical concepts about the system design as such, that are typically assumed and not discussed in depth, but it is expected for the candidate to have good understanding of.

This section is heavily inspired by an amazing book System Design Interview – An insider's guide, Vol. I that I've recently read. I highly recommend it.

# CAP theorem

In every distributed system there are three parameters:

Consistency - each part of the system must have the most recent data.
Availability - the system must still function even if some part of it is down.
Network Partition tolerance - the system must tolerate temporary or permanent disruptions of connectivity between its parts.

The CAP theorem basically states, that any system always sacrifices either one of three properties, and keeps the rest two, so it's always a trade-off. Since the Network Partition tolerance is a must for every system, in real life it boils down to the following:

CP systems - consistency-first oriented. Such systems have a lot of blocking and sync operations in order to be consistent, I don't think a candidate will be asked to build one of those. A good example of such a system is a banking/financial application.
AP systems - can be inconsistent, but have high availability. This is 99% of all web applications.

# Eventual consistency

When building an AP system, one rule must be set: the system is allowed to be inconsistent, but only for a certain period of time (the shorter, the better). Eventually, the system must become consistent again, until the next change happens. This is called eventual consistency. A good example of an eventually consistent system is two microservices A and B talking to each other using Kafka. The microservice A received an update, and the update was communicated to the microservice B, but the message is not yet consumed and remains in a queue. It is said, that the system is currently inconsistent. But give it time, it will become consistent, eventually.

# Vertical Scaling vs Horizontal Scaling

When load on one of the system's element grows, and it starts choking up, there are two ways to deal with it.

Vertical scaling - giving a instance more CPU power and memory amount to cope with increased load.
- 👍 Rather cheap to do, especially in the cloud.
- 👎 There is always a limit to what the amount of resources can be increased.
Horizontal scaling - add more copies of the element, and spread the load between them.
- 👍 In theory, thanks to consistent hashing, horizontal scaling could grow the system to any amount of copies / partitions.
- 👎 Additional infrastructure is needed to manage traffic between the copies / partitions.

When it is said, that the system should be able to process high load and hundreds of thousands of requests per second, it usually means the system should effectively scale horizontally.

In the amazing Node.js Design Patterns book I saw a good diagram that unfolds the horizontal scaling concept. Basically, in the very center of a 3-dimensional coordinate system (0,0,0) stands a monolith running as a single instance, connected to a single database instance. And then you can scale horizontally in 3 orthogonal directions:

X-axis: add more instances of the application itself, to balance the load between them.
Y-axis: separate the responsibilities and domains by splitting the application onto parts (microservices and CQRS)
Z-axis: add sharded database, so each instance serves a subset of the data.

# Trading space for time and the other way around

It ain't possible to cheat laws of physics. The system you build always makes a tradeoff between time of execution and memory consumption. The algorithm in use can be either fast, but hungry on memory, or the other way around, or something balanced.

This is why every algorithm typically has two parameters of effectiveness: time complexity and space complexity. And that's exactly the reason why you shouldn't claim that the Bubble Sort algo sucks and Quicksort doesn't. It simply depends on what circumstances every specific algorithm is used under: the Bubble Sort has space complexity of O(1), so it is perfect for micro-controllers that only have as little as 2kb of RAM and low performance is generally expected.

# Sync and async actions

There are two types of processes that usually take place inside any system.

Synchronous - an action is performed right away, and the user or another element of the system, which triggered the action, must wait.
- a user gets a list of books via REST to see it in the UI,
- one service calls another via gRPC to get a list of products.
Asynchronous - an action is triggered and then enqueued, thus postponed. The user or element moves on with their business and gets notified when the action is completed. Examples:
- generating image previews after upload,
- generate CSV files containing exported data and upload to a bucket.

Before committing to either one, the following question must be asked and answered: "Is the result needed now (in real time) or later?".

In order for the system to successfully scale horizontally to handle high amount of load, all heavy processes in the system must be switched over to the async mode.

# In-memory vs disk operations

Every system that stores the data must decide where to store it. There are two options:

in-memory storage (e.g. Redis Key-Value, Redis Pub/Sub)
on-disk storage (file system, relational databases)

As usual, there is no right or wrong approach, it all depends on the system requirements and the type of data the system deals with. Large files can't be stored in memory (at least entirely), frequently read data shouldn't be stored on disk due to high access latency.

Rule of a thumb: in-memory is faster, disk operations are slower.

# Persistency

The in memory vs disk goes hand in hand with the persistency. Persistency is the ability to store the data on disk, so it can be retrieved later, and not all components of the system need it.

# Stateful vs stateless API

The term stateless means that at any point of time any instance of the application can serve any request from any user identically successfully. Simply put, a user session isn't stored on the server.

However, being stateful isn't necessary a bad thing. Websockets are stateful, because a permanent connection between a client and a server is preserved and thus every user is "bound" to a specific instance of the system.

Streaming can also be both stateless and stateful, depending on the concrete task.

If you want your system to effectively scale horizontally, you generally want your API stateless.

# Write-optimised vs read-optimised systems

It's always a good idea to understand whether the data in your system will be more frequently read or more frequently written. This is important to know, because then there is room for some optimisations. Most of the web applications are read-optimised, but there are exceptions such as metric and log aggregators, where there is constant influx of new data.

# Consistent hashing

Consistent hashing is a technique that keeps that whole idea of horizontal scaling afloat. Imagine we have N instances that process user requests. Basically, thanks to the consistent hashing we can take any incoming ID and then map it to one of the instances. When the same ID comes next time, it is mapped to the same instance once again. Furthermore, new additional or replacement instances can be added to the pool, old instances - removed, and the system can also keep track of the unhealthy instances and promptly redirect the requests.

A good application of a consistent hashing is database partitioning. Since every record has an ID, it can be unequivocally mapped to a certain instance now and later.

More information is here

In order for a database, message broker or in-memory store to scale horizontally, it should support partitioning.

# Application tiers

When designing the application, we must clearly understand the request flow, and what data we can trust, and which can't be trusted. Also, most of the time data is sensitive and must be protected by authorization and authentication. When the data is a subject to a change, we must also record who introduced those changes.

Typically, there are these tiers:

Public tier (or web tier) - untrusted tier, must be handled with caution and ideally should be protected with authentication,
VPC tier - trusted, because in this tier all communications between microservices happen,
VPN tier - half-trusted I would say. In case of a VPN, we know that the tier is not entirely public, but we still need to be careful with the data.
Database tier - here the data itself is trusted, but the authentication should still happen to protect the data from unauthorized access.

# Observability

We shouldn't neglect observability. Handling an un-observed service is like flying a plane without any cockpit devices: all you know is the engine makes sound and the earth is below, not above, which means the plain is still in the air and hasn't crushed (yet). But, obviously, this is not enough.

Observability (aka O11y) implements cockpit instrumentation for your application.

Here is a list of standard metrics that are typically of interest:

Request per second (RPS), per endpoint and total
Request duration, per endpoint
P99, per endpoint - shows the slowest endpoints
CPU & Memory consumption
Daily active users (DAU)

In order to observe how effectively your system scales, you need to have observability instrumentation in place.

# Contracts

One of the most important thing is to describe how the microservice communicates with the user and other parts of the cloud native application.

There are several transports that are good to know:

REST (OpenAPI standard)
GraphQL
RPC (gRPC, tRPC, ...)
Streaming (Websockets, Server-Sent Events, WebRTC, ...)
Event-based (Kafka, Google Pub/Sub, RabbitMQ, ...)

# Microservice architecture

It is compulsory to know the concept of the microservice architectures and the related design patterns. Be ready to talk about it. Also be ready to explain when microservices are not needed (small projects, PoCs, eventual consistency intolerant systems, etc.)

Here is a good website dedicated to the microservice architecture.

The microservice architecture is a design pattern where an application is broken down into independent, smaller services.

Concepts:

Single responsibility
Autonomy
Data isolation
Failure isolation

Advantages:

Scalability (each service can be scaled independently) and distribution
Faster deployment cycle
Technology agnostic

Disadvantages:

Higher maintenance cost (observability, orchestration, tracing, contracts, security)
Data consistency (eventual consistency, distributed transactions)
Network latency and communication overhead

# Data propagation patterns

There are two main patterns for data propagation, that are used to keep the system consistent:

2 phase commit - a coordinator is used to coordinate the actions between the parties. The coordinator sends a message to all parties to open a transaction simultaneously, and waits for a response. If all parties respond with "yes", the coordinator sends a final message to all parties to commit the transaction. If any party responds with "no", the coordinator sends a message to all parties to rollback the transaction. This allows strong consistency, but is blocking. Good for financial systems.
Saga - a coordinator or event-driven approach is used to coordinate the actions between the parties. Each party makes a change. If some party fails to make a change, all parties get to know about this and execute the undo action on their end. This allows eventual consistency, but is non-blocking. Good for order processing.

# Architecture patterns

Some of the most common architecture patterns:

API Gateway - a single entry point for all clients to access backend services.
Backend for Frontend - provides a custom API layer tailored for a specific frontend.
Service Mesh - a network of services that are used to communicate with each other. It is used to route requests to the appropriate service.
Peer-to-peer communication - a pattern that is used to communicate between services using a direct connection.
Event-driven communication - an architecture that is used to communicate between services using events.
Load balancing - a pattern that is used to distribute requests between multiple instances of the application.
Caching - a pattern that is used to cache data in a distributed system.
Circuit breaker/Retry - a pattern that is used to handle failures in a distributed system.
Private database - a pattern that is used to store data that is only accessible to the service itself.
Sidecar - a pattern that is used to add additional functionality to an already existing service without modifying it.
Command Query Responsibility Segregation - a pattern that is used to separate the read and write operations of a system, to let them scale independently.

Here is a list of all patterns.

Be ready to talk about the anti-patterns as well. Here is a good set of slides about the whole topic.

# Load balancing

A load balancer is a special kind of software that distributes requests between multiple instances of the application. In basic situations, the round-robin algorithm is used, but there may be variations. Typically the balancer is a built-in feature of the Cloud Platform or K8s, but on a C4 diagram it must be clearly highlighted.

# Databases and storage

When it comes to structured keeping of data, databases come into play. There are several main cohorts of databases/storage:

SQL
- row-based (MySQL/MariaDB/Percona, Postgres, AWS Aurora/RDS, GCP BigQuery, etc.)
- column-based (ClickHouse, Grafana Mimir, etc.)
NoSQL
- document-based (MongoDB, AWS DynamoDB, etc.)
- key-value-based (Redis, AWS DynamoDB, etc.)
- graph-based (AWS Neptune, etc.)
- time-series-based (InfluxDB, Grafana Mimir, etc.)
- search-based (Elasticsearch, etc.)
Somewhat SQL
- time-series-based (ClickHouse, etc.)
Storage
- object-based (AWS S3, etc.)
- remote filesystem (AWS EFS, etc.)

# Replication

When a database is read-heavy and high availability is needed, it makes sense to introduce multiple read replicas, and even put them into different availability zones. There will always be a master node, that will take the write operations and then propagate the changes to the replicas. Because of this, the system becomes eventually consistent, since the replicas will need time to catch up with the master. Therefore, there always be a trade-off about the replica lag.

Even the 1 master + 1 replica setup makes sense, when the data is read/write-heavy. Long writes affect reading, and heavy reads affect writing, so it's a good idea to delegate reading to the replica, even if it's just one.

# Sharding

When a database is write-heavy, and the flux of data is huge and diverse, table partitioning (sharding) is also possible, but it comes with a price.

NoSQL allows better partitioning, due to the flat nature of data, but JOINs are obviously not available natively.

When talking about partitioning, always mention the partition key. Without it, sharding doesn't make sense.

# ACID vs BASE databases

ACID databases are databases that guarantee Atomicity, Consistency, Isolation, and Durability. Example: MySQL, PostgreSQL, Oracle. Good for:

Financial systems
Banking systems
Healthcare systems
Reservation systems

ACID databases have blocking nature, transactional, scale mostly vertically, provide strong consistency.

BASE databases are databases that guarantee Basic Availability, Soft state, and Eventual consistency. Example: MongoDB, DynamoDB, Cassandra.

Analytics systems
Log aggregation systems
Monitoring systems
Document-based systems / CMS

BASE databases are non-blocking, scale horizontally (thus deal with heavy throughput better), provide eventual consistency.

# Database decision matrix

As everything mentioned above, every type of a database comes with own tradeoffs. To make a choice, follow the decision matrix below.

Goal ↓ / DB →	PostgreSQL	MongoDB	Redis	Cassandra
Strong Consistency	✅✅	✅ (readConcern)	❌ / Partial	✅ (QUORUM)
High Availability	✅ (replication)	✅✅	✅✅	✅✅✅
Low Latency	✅	✅✅	✅✅✅	✅✅ (local)
Sharding Support	✅ (Citus/manual)	✅✅ (native)	✅ (cluster)	✅✅✅ (native)
Global Distribution	⚠️ (complex)	✅	⚠️ (limited)	✅✅✅

Examples of questions that can be answered with the help of the matrix:

Are flexible reports needed? If yes, proceed with SQL.
Is your data mostly flat or can be considered as a unit? If yes, proceed with document-based NoSQL.
Is the data rather simple and low read-latency crucial? If yes, proceed with in-memory NoSQL.
Is the data graph-like? If yes, it is actually tricky, because both SQL and NoSQL can handle tree-structured data.
Will the amount of data grow indefinitely? If yes, then either go with NoSQL and partitioning or with SQL and periodical dumping to a cold storage.
Want to store meter readings/logs/stock prices? If yes, proceed with a NoSQL/SQL time-series-based database.

Keep in mind, that SQL databases due to the nature of JOINs typically don't scale well horizontally. There could be some optimisations made though, such as master node and read replicas. As mentioned above, when using replication, the database an becomes eventually consistent AP system.

# Other databases

There are also hassle-free cloud relational databases, such as GCP BigQuery, but they ain't cheap.

There sometimes may be a combination of databases implemented. For example, a service manages data using MongoDB, but once in every N hours the data is dumped to BigQuery to populate intricate analytics down the road.

# Caching

Cache is needed, when the data managed by a service is more frequently read than written. In that case, it's a good practice to put a cache in front of the long-term storage, such as a database.

Two important things to keep in mind:

Cache invalidation strategy
- Keep the data for too long, and you'll end up with stale data.
- Keep the data for too short, and you'll face frequent cache miss.
Maximum cache size and an eviction algorithm

There are some good caching techniques, such as

TTL - Time To Live, allows setting the cache expiration time
LRU, LFU - Least Recently Used and Least Frequently Used
Tagged caching - invalidates the cache by a label
Periodic/manual re-validation

# Cron jobs

CronJobs are periodical tasks used to keep the system up to date. The frequency of execution depends on the concrete task.

# Content Delivery Network (CDN)

CDN is a network of services typically used to deliver static content (such as images, videos, etc.) to the users, based on the user's location. The system reads assets from the source and caches them in the CDN nodes, so the next time the user requests the same asset, it is served from the CDN node, which is typically closer to the user.

The cache invalidation strategy is based on the TTL, URL, ETag, periodic re-validation, etc.

# Concurrency safety

When it comes to concurrency safety, there are two places where it must be considered:

On the application level, using locks, mutexes, semaphores, etc.
On the database level, using
- Pessimistic locking (row-level locking, table-level locking)
  - Pros: easy to implement
  - Cons: lead to deadlocks, performance impact
- Optimistic locking (versioning)
  - Pros: no deadlocks
  - Cons: additional logic needed, performance impact on high data contention
- Database constraints
  - Pros: easy to implement
  - Cons: not all databases support this feature, not always applicable

# Security

Security is crucial to have between untrusted parties, such as VPC and Web tier. Be ready to talk about:

IAM
- API keys
- Login & Password
- OAuth2 - authorization framework
  - grant types:
    - authorization_code - via UI user allows an application to access some resource (owned by the user)
    - client_credentials - used for machine to machine communication
    - refresh_token - used to refresh the access token
- OpenID - authentication and authorization framework
- Role-based access control
- Attribute-based access control
Digital signatures
- JWT - JSON Web Token
- JWKS - extension of JWS, allows storing public keys elsewhere by exposing an endpoint
- JWE - JSON Web Encryption, allows encrypting the token
TLS
- Handshake
- Asymmetric encryption (RSA, Diffie-Hellman)
- Symmetric encryption (AES)
- Signing algorithms (HS256, RS256, ES256)
mTLS and zero trust architecture and when it may be needed (banking, military, government, healthcare, etc.)
VPN
Vault for secrets management
Cookies
- SameSite - cookie is only sent in the same-site context
- HttpOnly - cookie is not accessible from JavaScript
- Secure - cookie is only sent over HTTPS
- Expires/Max-Age - allows setting cookie TTL (without it, the cookie is deleted when the browser is closed)
- Path - allows setting cookie scope (without it, the cookie is sent for all paths)

A bonus would be to mention ways to mitigate DDoS attack and prevention of resource bottlenecks by introducing:

API rate limiter - when a client sends too many requests, after reaching a certain threshold the requests are denied. Then there is cool down period. There are two main algorithms:
- leaky bucket - an incoming request adds a drop into a bucket. The request is denied when the bucket is full, but it constantly leaks at the bottom, creating more space.
- token bucket - a new token is added to the bucked at a certain rate. The request is denied when the bucket is empty.
Consumer throttling for the event based communication - when messages start coming in big numbers, you typically want to start making short pauses between acknowledgments, otherwise the CPU and DB CPU will be overwhelmed.

# Cost estimation

During back of the envelope estimation, you need to consider the cost of the resources.

You can make a rough estimation of the cost of the resources. For example:

Average Kubernetes cluster with two medium size nodes and a control plane on GCP could cost as much as 25*2 + 100 = 150 $ a month.
Redis cluster with 3 nodes 2 CPU 13GB RAM will be 250 * 3 = 750 $ a month.
Postgres with 2 nodes (primary + standby or primary + read replica) and storage is about 200 - 300 $ a month.

# Algorithms & data structures

We should be aware of the most commonly used data structures and algorithms on those. It's not typically asked to implement any of these, as it's not an DSA interview, but there must be understanding of the basic principles of each and when to use what.

For the system design it's ultimately important to understand the job queue algorithm, and what advantages it has over the traditional synchronous approach of data processing.

# Back of the envelope estimation

Basically, the Back of the envelope estimation is a technique that allows very-very-very rough estimation of average amount resources that the system will probably consume.

This must be well understood, as it could be asked during the interview.

Typically, what attracts interest is the following:

Query Per Second and peak Query Per Second
Storage size for N years
Bandwidth per second

In order to get these values, we need to know some indications:

Average active users (AAU) per a day (or daily active users (DAU))
Percentage of requests that save something
Average data size

Then we can easily make the calculations. Consider the example:

Let's say, that

1. AAU per day = 500 // this amount comes and does something on the platform
2. Data write requests = 50% // half of the users posts a message
3. Average message size = 300 kB

Then

1. Requests per an hour = 500 / 24 =~ 21
2. Users per a minute = 21 / 60 = 0.35
3. QPS = 0.35 / 60 = 0.006
4. Peak QPS = 2 * 0.006 = 0.012
5. Amount of requests that posts a message = 500 * 0.5 = 250
6. Message volume per a day = 250 * 300 kB = 73 mB
7. Bandwidth of new messages per second = ((75000 kB / 24) / 60) / 60 = 0.9 kB

📃 Copy

The code is licensed under the MIT license

# Links

Some useful links to read more about the topics covered in this article:

Microservices.io
Microservices anti-patterns in Melbourne - true classic never gets old
A pattern language for microservices

***

Well, that was a long post. As before, this article is a work in progress, I will enrich and expand it when I have new experience to share.

Sergei Gannochenko

Business-focused product engineer, in ❤️ with tech and making customers happy.

AI, Golang/Node, React, TypeScript, Docker/K8s, AWS/GCP, NextJS

20+ years in dev

The ultimate interview preparation framework. Part 4: System design

Articles in this series

Table of contents