MySQL 8.0 information_schema Corruption - Cannot Expand PXC Cluster

Environment

  • PXC Version: 8.0.42-33.1 Percona XtraDB Cluster (GPL), Release rel33, Revision 6673f8e

  • WSREP Version: 26.1.4.3

  • OS: Ubuntu 24.04

  • Cluster: 3-node setup

  • Data Size: ~161GB production data

Problem Summary

After a 3-node PXC cluster failure, we successfully bootstrapped pxc3 which is now serving production traffic normally. However, we discovered severe information_schema corruption that blocks all administrative operations and prevents cluster expansion via SST.

Production queries work perfectly - the issue is purely with metadata access.

Symptoms

mysql> SHOW DATABASES;
ERROR 1356 (HY000): View 'information_schema.SCHEMATA' references invalid table(s) 
or column(s) or function(s) or definer/invoker of view lack rights to use them

mysql> SELECT COUNT(*) FROM information_schema.tables WHERE engine = 'InnoDB';
ERROR 1356 (HY000): View 'information_schema.TABLES' references invalid table(s)...

What’s Broken:

  • All information_schema views (TABLES, SCHEMATA, COLUMNS, etc.)

  • mysqldump (requires information_schema access)

  • User management (SHOW GRANTS fails)

  • SST donor operations (xtrabackup needs information_schema)

  • Cluster expansion (cannot add pxc1/pxc2 back)

What Works:

  • Direct table queries: SELECT * FROM database.table

  • Production application queries

  • SHOW STATUS commands

  • Data integrity seems intact

What We’ve Tried

1. mysql_upgrade --force

# Modified systemd to force upgrade
ExecStart=/usr/sbin/mysqld --upgrade=FORCE $EXTRA_ARGS $_WSREP_START_POSITION

# Result in error log:
2026-01-30T00:30:25.554808Z 6 [System] [MY-013381] [Server] 
  Server upgrade from '80042' to '80042' started.
2026-01-30T00:30:26.664702Z 6 [System] [MY-013381] [Server] 
  Server upgrade from '80042' to '80042' completed.

# But corruption persists - upgrade ran but didn't fix it

2. mysqldump with Various Options

# All variations fail with same error
mysqldump --skip-opt --quick --single-transaction --databases [list]
# Error: Couldn't execute 'SELECT COUNT(*) FROM INFORMATION_SCHEMA.TABLES...'

mysqldump --no-tablespaces --skip-lock-tables [dbname]
# Error: Same information_schema error

3. mysqlcheck Repair

mysqlcheck -u root -p --auto-repair --all-databases
# Error: Couldn't get table list for database mysql: 
# View 'information_schema.TABLES' references invalid table(s)

4. Direct System Table Repair

SET SESSION sql_log_bin=0;
REPAIR TABLE mysql.user;
REPAIR TABLE mysql.db;
FLUSH PRIVILEGES;
# All complete successfully but information_schema still broken
mysql> REPAIR TABLE mysql.user;
+------------+--------+----------+---------------------------------------------------------+
| Table      | Op     | Msg_type | Msg_text                                                |
+------------+--------+----------+---------------------------------------------------------+
| mysql.user | repair | note     | The storage engine for the table doesn't support repair |
+------------+--------+----------+---------------------------------------------------------+
1 row in set (0.00 sec)

mysql> 

5. Physical File Copy for Slave Setup

  • Copied 162GB data dir to new server

  • Created fresh mysql system database

  • User databases not discovered (data dictionary mismatch)

  • Tried restoring old mysql directory - corruption copied over

Current Situation

pxc3 Status (Donor):

mysql> SHOW STATUS WHERE Variable_name IN 
  ('wsrep_cluster_status', 'wsrep_cluster_size', 
   'wsrep_local_state_comment', 'wsrep_ready');
+---------------------------+---------+
| Variable_name             | Value   |
+---------------------------+---------+
| wsrep_cluster_size        | 1       |
| wsrep_cluster_status      | Primary |
| wsrep_local_state_comment | Synced  |
| wsrep_ready               | ON      |
+---------------------------+---------+

SST Failure Pattern (when trying to add pxc1):

2026-01-29T01:25:25.766624Z 0 [Note] [WSREP-SST] 
  ERROR 1142 (42000) at line 1: SELECT command denied to user 
  'mysql.pxc.sst.user'@'localhost' for table 'keyring_component_status'

2026-01-29T01:25:25.791119Z 0 [Warning] [Galera] 1.0 (pxc3): 
  State transfer to 0.0 (pxc1) failed: Not in primary component

We’ve created SST user with all permissions but role grants don’t activate properly due to information_schema issues.

Relevant Configuration

my.cnf (pxc3):

[mysqld]
server-id=158
wsrep_cluster_name=yellowdot_pxc_cluster
wsrep_cluster_address=gcomm://pxc1,pxc2,pxc3 # I am using actual IPs here
wsrep_node_address=pxc3 # I am using actual IP here
wsrep_sst_method=xtrabackup-v2
wsrep_sst_auth='mysql.pxc.sst.user:'
pxc-encrypt-cluster-traffic=OFF
wsrep_provider_options="pc.ignore_sb=true"

Questions

  1. Is there a way to repair the data dictionary (mysql.ibd) directly without relying on information_schema?

  2. Can we force xtrabackup SST to skip information_schema queries and just copy physical files? We have --no-version-check but xtrabackup still checks information_schema internally.

  3. Can we manually export/import the data dictionary from the physical files to create a clean slave, then use that slave as SST donor?

  4. Is there a recovery tool that can rebuild information_schema views from the underlying InnoDB data dictionary without needing information_schema access itself?

  5. Could we use rsync/physical file copy as SST method given that xtrabackup requires information_schema? Our files are crash-consistent since we’re running single-node.

Our Current Plan (if no solution)

  • Keep pxc3 running as-is (production stable)

  • Create slave node and make it replicate data from pxc3

  • Export data via application-level tools

  • Failover traffic to new slave when replication is in sync , allowing us to

  • rebuild the 3-node prod cluster from scratch

However, we’d prefer to recover the existing cluster if there’s a way to work around this information_schema corruption.

Additional Context

  • This happened after 2 nodes left the cluster and went down due to unknown reasons at the moment.

  • pxc2 was last to go down and had most current data

  • we brought it down and up as a bootstrap, pxc3 successfully joined the cluster

  • pxc1 attempt to join the cluster usong pxc2 failed.

  • when we forced pxc1 to joing the cluster using pxc3 - sst process initially started but got stuck in a loop where data was deleted from /var/lib/mysql/ and would grow till 42GB, before it gets all deleted again when network error occurs on donor during sync. So we lost data completely on pxc1.

  • brought down pxc2 bootstrap and started pxc2 normally, and it failed to join the cluster. Its been down since.

  • We have physical file backups but they contain the same corruption

  • InnoDB tablespaces (.ibd files) appear intact and readable

  • Only the metadata layer is corrupted

Any guidance would be greatly appreciated! We’re particularly interested in:

  • Alternative SST methods that bypass information_schema

  • Data dictionary repair procedures

  • Manual cluster recovery techniques

  • Similar experiences with this type of corruption

Thank you!

1 Like

From the information you provided, the information_schema has been seriously damaged, and even whether the data page is damaged or not cannot be guaranteed, so back up the data as soon as possible and rebuild the cluster.

Update: Issue Resolved

Thank you for your assistance. We’ve successfully resolved the issue using the following approach:

Resolution Steps:

  1. Exported all individual tables to CSV files using SELECT ... INTO OUTFILE statements

  2. Dropped the corrupted database

  3. Recreated the database from scratch

  4. Re-imported all tables from the CSV exports

  5. Verified data integrity and confirmed information_schema.TABLES is now functioning correctly

The database is now fully operational with the information_schema views working as expected. The export/reimport process allowed us to recover all data while bypassing the corrupted system metadata.

Thanks again for the guidance and support.

1 Like

This is kinda scary.. Do you have any thoughts on why the schema was so corrupted? I would have thought that with a three node cluster two of them must have been in sync and one should have been in desync mode.. I would have brought down the problem child and had it sst from one of the good nodes..

I would be curious what may have caused this if you know.. Thanks