Problems and Solutions Noted From Facebook Scaling Memecached

Abderahmane Toumi
6 min readNov 29, 2024

--

solutions for scaling Memcached in large-scale systems like Facebook

Wide fanout:

happen when one request from a user or a client causes the system to send multiple requests to other components of the system, like database , caching

In the case of Facebook, this was solved by balancing load using consistent hasing and introducing multi-caching servers

Usage of Mcrouter:

a proxy between the webservers and Memcached servers, his role was to distribute the traffic between cache servers and batch the requests to increase network throughput and reduce latency.

Incast congestion:

When a web server sends requests to multiple caching servers simultaneously, and these caching servers respond all at once, it can overwhelm the web server and the network hardware. The high number of responses creates a workload that exceeds the server and network’s handling capacity, leading to potential congestion and degraded performance.

Facebook solved this by batch request using mcrouter (collescing),

A dependency graph that will track witch data depends on other data to fetch it first reduces network usage

Inspired by how TCP handle sending requests, sliding window was introduced, which get bigger when response received successfully and smaller when it fails, but they had to find a balance between causing latency because of small sliding window and causing incast congestion in the case of big sliding window

Logic in client, state in cache:

To keep clients stateless and scale it easily.

Reducing latency:

By using UDP to send get requests from webservers to Memcached, as UDP does not require a long process to establish a connection , but he does not provide a good way to handle failure or a sequence number to track the failure or success

TCP for set and delete requests because TCP provides a built-in ability to handle failure and retries , as these two actions are very important and failing to do this will cost the system a lot, which will increase cache miss and overload the true sources of data (ex: databases)

Thunder heard:

It is when a single key in the memcached servers experiences many read and write operations , multiple writes will come, which will result in them invalidating the data multiple times so when that read operations come, it just have to go to the original source of data because it is enabled to find valid data , and this will result in latency

Stale data and thunderheads can be solved by using leases , a token that is unique for each key protect them from getting accessed by outdated write, and also putting a time limit on write

Different needs of each app:

Create pools of caching servers; for example, some servers might be built to low latency with low memory , while others have alot of memory but with a high latency

Also, if some apps experience too many requests with alot of data and alot lot of similar requests routinely, Facebook chooses to go with replicas; do not go for sharding, because in case of sharding we will have many servers sharing some parts of data, which will result in handling the same load from requests, while with replication, each server will have the same data so they will share the load between each other

Recovering from failure:

When a server is down and needs to recover, Facebook introduced gutter servers; they represent 1 percent of the total server in a cluster. The role of Gutter is to replace the recovering server until it is up again. Gutter experiences fewer deletes on keys and replaces that with low exp time for keys, but might cause the client to experience and receive stale data

System is getting larger:

Buying more resources is naive, so splitting into regions was better to enable dynamic configuration for the network and other hardware. Also, some regions might provide a cheaper hardware setup and cost; this setup also results in a better fault tolerance and availability.

Read-after-write semnatics:

After each write or delete operation to the original data source:

The system notifies Mcrouter to invalidate the relevant cache entries.

Msqueal, a daemon process, monitors the database logs and sends cache invalidation messages to Mcrouter.

Mcrouter batches these invalidation requests and forwards them to the appropriate Memcached nodes.

This ensures consistency between the database and the cache, preventing stale data from being served.

Regional pools:

Each Region have a set of clusters; they had multiple apps and systems that might have different needs, so Facebook introduced regional pools. An example of this can be a cluster to trade fast access with latency to save memory, some other clusters might be for large data not accessed frequently;

These pools get created based on statistics and data

  • how large the data that needs to be stored
  • number of users accessing
  • how many users accessing this data

When a Cluster contained in a region is down, he starts getting cached data from a warm cluster. The client update cold cluster cache from warm so the cold cluster can get back faster

Multi-region setup, we will have so much data exchange, a usage of a single channel of communication between regions and custer to reduce metadata to be exchanged to establish connection

If one component is not responsive, mcrouter and msquel keep queuing data until component is back to run and replay them again

Use fine-grained lock instead of one lock to protect a resource

Nonzero hold-off time:

does not allow a key to be added again after a delete to give other process time to be completed until specific time is passed; in case of Facebook, they used 2 seconds to insure consistency when cold cluster update data but does not update cache in warm cluster (webservers are now fetching from warm cluster because cold cluster are under maintenance).

Facebook solution prioritize performance over providing non-stale or very up-to-date data.

Remote maker:

In Facebook’s regional architecture, the system consists of a master region and several read-only replica regions. When a web server in a replica region needs to update data, it first invalidates the corresponding key in the replica cache and assigns a remote marker (RK) to the key to track the invalidation process. The update is then sent to the master region, where the database is updated, and the same RK is assigned to the key in the database’s SQL statement to ensure consistency. Once the key is invalidated, it is deleted from the local replica cache.

When the web server needs to access the data again, it checks the replica cache. If the remote marker is still present, the system understands that the data is outdated and fetches the updated value directly from the master region. This approach helps maintain consistency across regions while minimizing stale data issues in a distributed setup.

Slab allocatros

Facebook introduced slab allocators to Memcached to manage memory more efficiently. Slabs are divided into classes, each representing a specific size category. For example, a slab belonging to the 64-byte class consists of chunks, each sized at 64 bytes. To monitor and manage memory usage effectively, a global Least Recently Used (LRU) mechanism is employed. This mechanism tracks the usage patterns of items within chunks and slabs, allowing the system to allocate or evict chunks based on their utilization.

For keys that experience high usage in short time then never get used again a circular linked list to track these items and evict them, they get indexed by second, each second it progress and evict a set of keys

V shared memory

In case of a software update to memached server a V shared memory was used to store data and meta data and then get it back once start

--

--

No responses yet