I am open to suggestions on how to account for the following condition that happens sometimes
Setup:
I have a Cluster setup where all traffic is sent to the primary node. I have an issue that I am attempting to account for.
During some “batch” processing that is going on that results in a Duplicate entry that initiates a vote on the quarm. The Active primary ends up getting voted out of the quarm (works as designed and fails over to the next node) The problem is anyone who has persistent pooled connections continues to get wsrep errors until either they are timed out or I shut down the service. I had thought about writing an event that looks at the status and if wsrep_cluster_status is in Disconnected to call a procedure that kills all non-system processes. But I am not sure if when the system is in a disconnected mode a kill on a process will work or if it will result in the wsrep error.
I would like to do this inside the DB vs running an external script that checks the mode and issues a shutdown command
Thoughts?
1 Like
Hi @meyerder , thanks for posting to the Percona forums!
If 100% of your activity is against one member of a PXC cluster, any DUPLICATE KEY violations should be resolved within InnoDB and should not result in any member evictions / failures in certification.
We can help you further if you were to share the error logs from the evicted member and from another member for the same time period.
You may also want to look into using the wsrep_sync_wait variable as that will ensure database queries are executed only once a member is fully in sync with the other cluster members.
1 Like
See Attached… From “master” or primary node… Others will be attached soon…
error.txt (3.3 KB)
Secondary
secondary.txt (4.9 KB)
mysql> show variables like ‘wsrep_sync_wait’;
±----------------±------+
| Variable_name | Value |
±----------------±------+
| wsrep_sync_wait | 0 |
1 Like
This is an incorrectly handled error condition in your application. If your application code receives an error, then the app should close the persistent connection and open a new one to the new primary.
1 Like
@matthewb easier said than done to get app teams to do that 
1 Like
Hi @meyerder
wsrep_sync_wait is designed to be a session variable that you set for queries you want to ensure are executed on an in-sync node. It would look like this:
mysql -e "set wsrep_sync_wait=7;select ..."
Regarding your logs - what format is that? they don’t look like the complete MySQL traditional error logs, usually wsrep is much more verbose.
1 Like
@Michael_Coburn I have them in json… I was cutting and pasting… is there a secure location I can send them to and would upload the entire thing…
1 Like
Well, this entire situation you came here for help on is all the application’s fault yet you want the DB to resolve it. Here’s what I see: Some “batch” job (ie: app) is loading bad data, causing errors in the DB, forcing the DB to take recovery action. The DB tells the app of this error state, yet the app ignores the error and continues anyways.
The database is doing exactly what it’s designed to do. The app isn’t listening to the errors thus the app is at fault.
Find out where these Dup key errors are originating. Does this table have an auto-increment column? If not, then the app is incorrectly generating ID numbers or there are 2 separate aspects of the app trying to write data to that table at the same time and colliding with IDs. If table does have A_I, then app appears to be specifying IDs rather than letting the DB manage them.
1 Like
I think I might see the reason but only happens one time a day and runs multiple times… (why Exactly at the same time almost daily I do not quite get)
As the table has fk and a auto-increment colum it is my thought that this should be a delete vs a truncate. This ?may? also be the reason for some of the “stalls” that I get at times as well if my reading is correct…
START TRANSACTION;
BEGIN
--
-- Truncate & insert into table
--
TRUNCATE table; -- Should leverage a delete from is how I understand it
INSERT INTO table
( values )
select .....from various temp tables
END;
COMMIT;
1 Like
TRUNCATE TABLE
is equal to ‘DROP TABLE; CREATE TABLE;’
I would need to test, but I’m pretty sure that since TRUNCATE is a DDL operation, it is implicitly committing your transaction before the INSERTs begin.
Since a FK is involved, this table will get DROP/CREATE but any referenced rows in other tables will receive DELETE. Yes, I would expect this to be the source of stalls in your setup.
2 Likes
I think I may have narrowed it down some…
Master gets the following.
Event 3 Write_rows apply failed: 121, seqno 2757773857 in the wsrep_apply_events
According to the binlog this transaction has a successful commit with no issues
#220131 5:00:13 server id 444 end_log_pos 276507974 CRC32 0xa468ebf8 Xid = 2757773857
COMMIT/!/;
On reviewing the binlog files on the source and replicas I noticed one item. Inside this transaction is the truncate table. While I see the truncate table on the source replica I do not see it on the slave binlogs. Thus the following happens
- The procedure does the insert and attempts to commit. The Slave replicas have the data still because they were not truncated. This causes the initial error
- Somehow it triggers a quorum vote.
- As the Transaction was successful on the master and not the slaves it is determined that the master is not correct and gets evicted.
I think I will see if we can move the truncate table outside the transaction to see if that accounts for it
I am now reviewing the following which is similar (but not the same)
(Yearly Reminder: DDLs That Fail to Propagate May Cause Percona XtraDB Cluster Inconsistencies - Percona Database Performance Blog)
1 Like