Learning About Replication, MySQL as an example

Abderahmane Toumi
5 min readFeb 14, 2025

--

Relational databases like MySQL have been widely used by many companies, including big names like Pinterest, Tumblr, and Facebook. While some of these companies might have evolved their infrastructure over time, MySQL remains a popular choice for its reliability and scalability, especially in the early stages of building applications.

One of the principles to ensure availability and scalability in MySQL is replication; it allow us to sync between a master and slave db server, where we have a copy of our data in both

Master Slave Setup

In a master-slave setup, we are only allowed to read data from the slave and not allowed to do any write operations, which must be done in the master (write, update, delete)

The master logs these actions on a log file called binlog with a timestamp and assigns a sequence number to each one and then sends this logs once done writing to sync between them

When a client connects to a db server, he send a write operation to the master DB, which then executes it and writes it into the binlog, The slave db server, on the other hand, can connect any time, send the last sequence number he got last time, and then continue receiving a stream of the binlog while updating the data and writing it in his own version of binlog called relay log It then updates the sequence number

Communication between master and slave is async; the master does not care about his slaves; he only logs, and slaves will read, slaves are decoupled from that master and can disconnect at any time

This allow us to have two or more slaves ,

  • with this, you can increase your read capacity by distributing the load between multiple MySQL servers and adding more clones
  • You can also distribute queries by type, those heavy, slow, long-running query into a server and light, fast queries into another, so you can stop the app from having a bottleneck in the I/O
  • You can also backup slaves by shutting down the MySQL process , copy data to archives and then run it again and it will just connect to master and continue fetching from Binlog
  • If a slave stops working, you can simply remove it from the rotation and stop sending requests to it , Those MySQL servers does not keep track of there availability, so we need to build that logic in our app or add a proxy that can detect if those slaves fail

The usage of replication is often used to increase availability, We have two cases of failure: a failure of slave, which is easy to deal with and failure of a master, which is a little bit tricky

To start a slave or bootstrap it, MySQL does not allow you to start it directly; you need a full backup of the data with the log of the last operation executed with it sequence number , The larger the database, the more time it takes a slave to start

The most tricky situation is when a master db server fails, as MySQL does not support automatic failover (switching to a backup server when main server fails) or promoting another slave to a master automatically , This is all manual, and it require finding the most up-to-date slave who have the highest sequence number in his relay log, then adjusting the slave's relay logs or rebuilding them completely from the new master backup, and finally configuring them to replicate from the master

Master To Master Setup

A solution to this is master-to-master replication, where we have two masters, A and B. The binlog of each server contains the server identifier where the write has been done, so we do not have to fall into a circular replication. A will replicate from B what he does not have, and B will do the same.

So if a write happens in B for example, A will replicate by writing on his relay log and then execute these operations After this, A can write into his binlog

So in case of failure, all writes will be redirected to Master B. We can also give each master a group of slave DBs, so when A fails, requests start getting sent to B directly.

With this, if we want to repair or update something on Master A, we switch the standby master to B and stop writes to A. This will cause a short downtime compared to if we have a single master. Next we wait for a time until B replicate everything from A , and then we enable writes on B

In theory we can send writes to both masters but this can cause some data inconsistency; also, it might lead to the same sequence number in both binlogs of both masters and we have to find a mechanism to avoid this, One case is we might run into a situation where we update the same row in both masters and while replicating from each other, one change might be overwritten.

Master — Master setup is good for availability but for scalability, it is hard to setup

  • Having two masters will increase i/o by making each master a slave who replicate from the other master and writes on his relay log, then executes these queries , and also receive request from the servers
  • More memory to store data and indexes to scale vertically in Both

Ring Setup

Ring replication by using multiple db servers to form a ring, this will make things worse, as each server will need to run the write operations each time a one is sent, which consume more resources and takes more time

Ring replication can increase replication lag. Replication lag is basically the time it takes for a replica (slave) to get the data written to it from the master’s binlog after the transaction is logged and committed on the master. In a good infrastructure, this lag should ideally be less than 1 second.

But why does ring replication increase lag? Well, in a simple master-slave setup, if the replication lag is, say, 500 ms, in a ring with four masters, the lag can go up to 2 seconds. This happens because the data has to travel through each server in the ring before it becomes visible to all of them. Each hop adds more delay, so the more servers in the ring, the higher the lag.

Having a replicated Master can cause an issue when it comes to consistency , In one master setup, we are sure that we are having the most recent data on each query, In a master-to-master or ring setup , querying a master where the most recent update did not take effect yet might cause an issue in our data consistency

--

--

No responses yet