Disk Divergence across Master and Slaves

We have deployed MySQL cluster version - 5.7.22-22 on Debian GNU/Linux 9, 1 Master, and 2 replicas(slaves). From the hardware, software, and MySQL configuration point of view, all the instances are identical. Both the slaves are read-only, with no direct writing happening to the slaves.

We have observed that for some tables with the same number of rows on master and slaves the .ibd file size for the table is different in master and slave. for example, if the table employee is 100GB on master it will be 120GB on slaves. The difference varies from 10-20% and even more sometimes. Generally, we have observed these issues on the partitioned schemas or schemas with column data types varchar/text/longtext.

To fix this issue, we have been running table optimization or some time rebuilding slaves from the master if the data size is more.

Sample Schema:
CREATE TABLE messages (
id int(11) NOT NULL AUTO_INCREMENT,
message_id varchar(100) NOT NULL,
relayed tinyint(1) DEFAULT NULL,
relayed_at datetime DEFAULT NULL,
exchange_name varchar(100) NOT NULL,
exchange_type varchar(10) NOT NULL DEFAULT ‘queue’,
message mediumtext,
inbound_message_id varchar(100) DEFAULT NULL,
app_id varchar(50) DEFAULT NULL,
correlation_id varchar(100) DEFAULT NULL,
group_id varchar(100) NOT NULL,
http_method varchar(10) DEFAULT NULL,
http_uri varchar(4096) DEFAULT NULL,
reply_to varchar(100) DEFAULT NULL,
reply_to_http_method varchar(10) DEFAULT NULL,
reply_to_http_uri varchar(4096) DEFAULT NULL,
txn_id varchar(100) DEFAULT NULL,
routing_key varchar(100) DEFAULT NULL,
context text,
destination_response_status int(11) DEFAULT NULL,
relay_error varchar(255) DEFAULT NULL,
retries int(11) DEFAULT ‘0’,
custom_headers text,
created_at datetime NOT NULL,
updated_at datetime NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY message_id_idx (message_id),
KEY exchange_name_idx (exchange_name),
KEY group_id_idx (group_id),
KEY relayed_idx (relayed)
) ENGINE=InnoDB AUTO_INCREMENT=671810 DEFAULT CHARSET=latin1

Does anyone else also observing the same phenomenon in your setups, want to understand what is causing this issue?

1 Like

Hi Sachin,

First of all I would suggest you run pt-table-checksum from Percona toolkit pt-table-checksum — Percona Toolkit Documentation

With pt-tc you can check if there are any data inconsistencies.

Then, file difference might happen for example because of fragmentation, a table rebuild with different MySQL versions (when rebuilding a table on 5.6 and 5.7 a different algorithm is used and might differ) , a difference in “innodb_fill_factor” , data differences, extra indexes on one of the servers, etc


Do you still have size differences after rebuilding the table with same version and configuration and data set size?

1 Like

@CTutte

As mentioned earlier also, all the nodes in the cluster are identical.
There are no differences with respect to schema, indexes, or configuration between master and slaves.

Post rebuilding slaves from the master sizes will be the same for 1 or 2 days post that size difference will start showing up for the table. and the difference will increase as and when data grows.
Here is the screenshot showing differences in “.ibd” file size between master and slave of same cluster

1 Like

Screenshot of schema on master and slave

1 Like

@sachin_japate,
Run pt-table-checksum to ensure that the actual row data is indeed the same between source and replica.
After that, I would re-create the table without the COMPRESSION=zlib and then observe file size changes. The compression algorithm could be a factor here for unknown reasons, but we need to rule it out.

1 Like

@matthewb Here is another example we have tried on the test cluster

Screenshot of schema on master snd slave

1 Like

IBD file size and records count on master and slave

1 Like

pt-table-checksum report

Sorry for spamming with multiple replying for the same case, I was not allowed to upload multiple screenshots

1 Like

Hello @sachin_japate,
Thanks for providing all of that info. I can’t think of any reasons off-hand why the files would be different. If you truly want to understand, you’ll need to use the InnoDB-Ruby tools and inspect the .ibd file at a page-level and see. Specifically, I would look compare ‘space-page-type-summary’ and ‘space-extents-illustrate’ between the two servers. Good luck!

1 Like

@matthewb Attached screenshot, Ran the innodb_space tool on both master and slave.

1 Like

Instead of using screen shots, please consider using cut & paste to include that information (and format it as “Preformatted Text”) in your replies. This way the content of your screen output is searchable, too, which might help others to find your posting when they are facing a similar issue, and it makes it easier to reply/comment on anything noteworthy in the output.

2 Likes

Why do you assume the files on disk should be the same size? There can be so many things causing the size to differ. The secondary could simply have long running read transactions blocking the purge, resulting in more page splits. As long as pt-table-checksum reports no diff, I wouldn’t care at all. In addition, compression=zlib is implemented using punch holes, not all filesystems support that.

2 Likes

@yves.trudeau Files size for the table should be the same as schema and data on master and slave are the same.
In our case we don’t use secondary(Hotstandby) for reading purposes, this node is only used for managing master failover. We provision dedicated read-replica for that purpose.

Disk usage on slave nodes in the cluster is much more than the master node. In most of the clusters where this issue is observed slave are going down as slave is 100% filled but the master is at ~80%, we will end up rebuilding all slaves in the clusters, during the course of slave rebuild will be running with one master, cluster availability is at risk.

compression=zlib is only enabled on 1-2 clusters, but the issue has been seen without compression enabled in most of the clusters we manage.

1 Like

So, InnoDB stores rows in b-trees. Pages in a b-tree can be anywhere between half filled (50%) and filled (100%). Let’s assume there is a long running read query running on the replica forcing a consistent read view to stay opened. A row is deleted from an innodb page and another one is later inserted to the same page. The page in question is almost full.

Since there is no long running query on the master, the old rows is removed by the purge thread almost immediately and then the new row is added. The new row fit in the page because the old one has been removed.

On the replica, there is a long running transaction (a select) which prevent the removal of the deleted row, the row is just flagged as deleted. Then, the new row is added. Since there is not enough space in the page, the page has to be split in 2 before the new row is finally added. Eventually the long select finishes and the purge thread removes the deleted row. The data is still the same but you now have 2 pages on the replica and 1 on the master.

As I said, you cannot assume the sizes are the same on both the master and the replica, at least within a factor of 2. I hope my explanations are clear, it is a subtle subject involving dynamic systems.

1 Like

Thanks, @yves.trudeau for the information.

Are there any other processes that can delay the removal of the deleted rows other than long-running select queries? because, in our cluster deployment, we have dedicated slaves for serving the read queries, and we have a dedicated slave for managing master failover(this slave will be promoted in case the master goes down), this slave does not accept any reads, we have seen disk divergence even on this slave.

Hi @sachin_japate,
yes, there are quite a few. Nobody really looks into such issue because people care about the data consistency and less about the file sizes. b-tree are very flexible data structure, they have evolution of their own in terms of page splits and merges. I think you’ll have to reduce your expectations of similarity because between the source and replica.

Hi Yves. This is interesting as I have what seems to be the exact same issue. I have tables that are partitioned, compressed and have different sizes between master/slaves. What’s odd to me is that the slaves have the exact same size and yet are used very differently (one is not used at all as it is a fail over node, and the other is a READ replica with many different queries and stored procs, there will be many long running queries).

I’m wondering if you have any potential insights here since this discussion took place some time ago?

The master server
[master bin]# ls -lhS /somewhere/tableinquestion*
-rw-r----- 1 mysql mysql 599G Jun 2 11:46 /somewhere/tableinquestion#P#p5.ibd
-rw-r----- 1 mysql mysql 480G Apr 6 12:36 /somewhere/tableinquestion#P#p3.ibd
-rw-r----- 1 mysql mysql 465G May 3 15:18 /somewhere/tableinquestion#P#p4.ibd
-rw-r----- 1 mysql mysql 82G Jun 5 10:43 /somewhere/tableinquestion#P#p6.ibd

The following 2 are slaves

A failover node. No queries of any kind are issued here. It is fully isolated and left alone
[replication-failover bin]# ls -lhS /somewhere/tableinquestion*
-rw-r–r-- 1 mysql mysql 707G Jun 2 11:44 /somewhere/tableinquestion#P#p5.ibd
-rw-r–r-- 1 mysql mysql 564G Apr 6 12:33 /somewhere/tableinquestion#P#p3.ibd
-rw-r–r-- 1 mysql mysql 548G May 3 15:14 /somewhere/tableinquestion#P#p4.ibd
-rw-r–r-- 1 mysql mysql 96G Jun 5 10:43 /somewhere/tableinquestion#P#p6.ibd

A READ node. Many queries/stored procedures are running here
[replication-read bin]# ls -lhS /somewhere/tableinquestion*
-rw-r–r-- 1 mysql mysql 707G Jun 2 12:12 /somewhere/tableinquestion#P#p5.ibd
-rw-r–r-- 1 mysql mysql 564G Apr 6 13:01 /somewhere/tableinquestion#P#p3.ibd
-rw-r–r-- 1 mysql mysql 548G May 3 16:41 /somewhere/tableinquestion#P#p4.ibd
-rw-r–r-- 1 mysql mysql 96G Jun 5 10:45 /somewhere/tableinquestion#P#p6.ibd

We see that both slaves have the exact same file sizes yet their usage is very different.

Hello. Just wondering if you have had any insights into this?

I have what seems to be the same issue. We’ve checked row counts and the data does seem to be in tact, but it is still perplexing.

@ChrisDK,
You can use the InnoDB Ruby tools to dump the per-page contents of one of your .ibd files and compare them page-by-page. You will probably see what Yves has mentioned before that 10 rows can exist on one 16K page in the source, but might be split between N x 16K pages in the replicas. Even though the data is the same, the space on disk will be greater with more pages in use; each page uses 16K no matter how much, or how little data is stored.