Scalability

Scalability is the term we use to describe a system's ability to cope with increased load. It is not a one-dimensional label ("X is scalable"). Rather, we must discuss: "If the system grows in a particular way, what are our options for coping with the growth?"

Describing Load

Load can be described with a few numbers which we call load parameters.

Requests per second to a web server.
Ratio of reads to writes in a database.
Number of active users in a chat room.
Hit rate on a cache.

Interactive Simulation: Load vs. Latency

Use the simulator below to understand how increasing traffic (RPS) affects latency, especially as you approach the capacity of your servers. Note how latency spikes when utilization gets high!

Interactive: Load vs. Latency

Incoming Traffic (RPS)50 req/s

Processing Time (ms)15 ms

Number of Servers1

System Utilization

75.0%

Avg Latency

60ms

Queuing Delay: +45ms

Takeaway: As utilization approaches 100%, queueing causes latency to spike exponentially. This is why “scalability” often means keeping utilization low enough (e.g., 70%) to absorb spikes.

Example: Twitter

Two main operations:

Post tweet: User publishes a new message (4.6k req/sec avg, 12k peak).
Home timeline: User views tweets from people they follow (300k req/sec).

Approach 1 (Pull): Posting inserts into a global collection. Reading queries the collection joining with followers.

Problem: Reading is expensive.

Approach 2 (Push): Maintain a cache for each user's timeline. When a user posts, look up all followers and insert the tweet into their timeline caches.

Problem: Write load is huge for celebrities (fan-out).

Hybrid: Use Push for most users, Pull for celebrities.

Describing Performance

Throughput: Records processed per second (batch).
Response Time: Time between sending a request and receiving a response (online).

Percentiles

The mean (average) is not a good metric because it doesn't tell you how many users experienced a delay. Percentiles are better:

p50 (Median): Half of requests are faster, half are slower.
p95, p99, p999 (Tail Latencies): Useful for finding outliers. Amazon uses p999 to ensure the most valuable customers (who have the most data) get good service.

Coping with Load

Scaling Up (Vertical): More powerful machine.
Scaling Out (Horizontal): Distributing load across multiple smaller machines (shared-nothing).
Elastic: Automatically adding resources when load increases.