1047 Unknown command errors on PXC node at the same time each day

We have a 3 node PXC cluster running in production with HA Proxy sitting in front handling 99% of MySQL connections from our web applications.

However we have a PHP (yes I know, not the best tool for the job) daemon that connects directly to the first node in the cluster and runs a simple select query every 2 seconds: SELECT * FROM myTable WHERE status IN (‘New’) ORDER BY id ASC;

Its a small table with few rows. It runs perfectly all day every day and then at exactly 10:05PM every day it throws a 1047 (unknown command) error.

The only thing I can think of is logrotate possibly putting the cluster in a state where it throws such an error.

I’ve read Frederic Descamps post on split brain ( [url]http://www.mysqlperformanceblog.com/2012/07/25/percona-xtradb-cluster-failure-scenarios-with-only-2-nodes/[/url] ) but I’m not quite sure how this cluster node would get separated from the cluster to cause this error.

Any help would be greatly appreciated.

You want to know the state of each node in the cluster when this happens. (SHOW GLOBAL STATUS like ‘wsrep%’) Does it recover by itself? What happens in the mysql error logs at this time?

Thanks for your reply Jay. I really appreciated it.

It does recover by itself but takes a few minutes.

I’ll have to try to trap the wsrep variable states when it happens.

I’ve pasted the mysql logs below. Seems pretty consistent at this time each day.

Thanks in advance for any insight you can offer.

130708 22:04:31 [Note] WSREP: (0dca7efa-ceb9-11e2-0800-426d225ba0b6, ‘tcp://0.0.0.0:4567’) reconnecting to 063fe2dd-ceba-11e2-0800-37fa3005f9fe (tcp://172.18.4.129:4567), attempt 0
130708 22:04:32 [Note] WSREP: evs::proto(0dca7efa-ceb9-11e2-0800-426d225ba0b6, GATHER, view_id(REG,063fe2dd-ceba-11e2-0800-37fa3005f9fe,152)) suspecting node: 063fe2dd-ceba-11e2-0800-37fa3005f9fe
130708 22:04:33 [Note] WSREP: declaring 7554f7d0-ceb7-11e2-0800-c1b580896c4b stable
130708 22:04:33 [Note] WSREP: Node 0dca7efa-ceb9-11e2-0800-426d225ba0b6 state prim
130708 22:04:33 [Note] WSREP: view(view_id(PRIM,0dca7efa-ceb9-11e2-0800-426d225ba0b6,154) memb {
0dca7efa-ceb9-11e2-0800-426d225ba0b6,
7554f7d0-ceb7-11e2-0800-c1b580896c4b,
} joined {
} left {
} partitioned {
063fe2dd-ceba-11e2-0800-37fa3005f9fe,
})
130708 22:04:33 [Note] WSREP: forgetting 063fe2dd-ceba-11e2-0800-37fa3005f9fe (tcp://172.18.4.129:4567)
130708 22:04:33 [Note] WSREP: deleting entry tcp://172.18.4.129:4567
130708 22:04:33 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
130708 22:04:33 [Note] WSREP: (0dca7efa-ceb9-11e2-0800-426d225ba0b6, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
130708 22:04:33 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 8d4155ec-e7c6-11e2-0800-577937398872
130708 22:04:33 [Note] WSREP: STATE EXCHANGE: sent state msg: 8d4155ec-e7c6-11e2-0800-577937398872
130708 22:04:33 [Note] WSREP: STATE EXCHANGE: got state msg: 8d4155ec-e7c6-11e2-0800-577937398872 from 0 (pxc3)
130708 22:04:33 [Note] WSREP: STATE EXCHANGE: got state msg: 8d4155ec-e7c6-11e2-0800-577937398872 from 1 (pxc2)
130708 22:04:33 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 145,
members = 2/2 (joined/total),
act_id = 533576,
last_appl. = 533498,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = a9e219f1-4291-11e2-0800-f07a1fee082b
130708 22:04:33 [Note] WSREP: Flow-control interval: [23, 23]
130708 22:04:33 [Note] WSREP: New cluster view: global state: a9e219f1-4291-11e2-0800-f07a1fee082b:533576, view# 146: Primary, number of nodes: 2, my index: 0, protocol version 2
130708 22:04:33 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130708 22:04:33 [Note] WSREP: Assign initial position for certification: 533576, protocol version: 2
130708 22:04:38 [Note] WSREP: cleaning up 063fe2dd-ceba-11e2-0800-37fa3005f9fe (tcp://172.18.4.129:4567)
130708 22:05:09 [Note] WSREP: (0dca7efa-ceb9-11e2-0800-426d225ba0b6, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://172.18.4.129:4567
130708 22:05:09 [Note] WSREP: (0dca7efa-ceb9-11e2-0800-426d225ba0b6, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
130708 22:05:10 [Note] WSREP: declaring 063fe2dd-ceba-11e2-0800-37fa3005f9fe stable
130708 22:05:10 [Note] WSREP: declaring 7554f7d0-ceb7-11e2-0800-c1b580896c4b stable
130708 22:05:10 [Note] WSREP: Node 0dca7efa-ceb9-11e2-0800-426d225ba0b6 state prim
130708 22:05:10 [Note] WSREP: view(view_id(PRIM,063fe2dd-ceba-11e2-0800-37fa3005f9fe,155) memb {
063fe2dd-ceba-11e2-0800-37fa3005f9fe,
0dca7efa-ceb9-11e2-0800-426d225ba0b6,
7554f7d0-ceb7-11e2-0800-c1b580896c4b,
} joined {
} left {
} partitioned {
})
130708 22:05:10 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
130708 22:05:10 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
130708 22:05:10 [Note] WSREP: STATE EXCHANGE: sent state msg: a392b179-e7c6-11e2-0800-cb8c686fe4f5
130708 22:05:10 [Note] WSREP: STATE EXCHANGE: got state msg: a392b179-e7c6-11e2-0800-cb8c686fe4f5 from 0 (pxc1)
130708 22:05:10 [Note] WSREP: STATE EXCHANGE: got state msg: a392b179-e7c6-11e2-0800-cb8c686fe4f5 from 1 (pxc3)
130708 22:05:10 [Note] WSREP: STATE EXCHANGE: got state msg: a392b179-e7c6-11e2-0800-cb8c686fe4f5 from 2 (pxc2)
130708 22:05:10 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 146,
members = 2/3 (joined/total),
act_id = 533582,
last_appl. = 533498,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = a9e219f1-4291-11e2-0800-f07a1fee082b
130708 22:05:10 [Note] WSREP: Flow-control interval: [28, 28]
130708 22:05:10 [Note] WSREP: New cluster view: global state: a9e219f1-4291-11e2-0800-f07a1fee082b:533582, view# 147: Primary, number of nodes: 3, my index: 1, protocol version 2
130708 22:05:10 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130708 22:05:10 [Note] WSREP: Assign initial position for certification: 533582, protocol version: 2
130708 22:05:42 [Note] WSREP: Node 0 (pxc1) requested state transfer from ‘any’. Selected 1 (pxc3)(SYNCED) as donor.
130708 22:05:42 [Note] WSREP: Shifting SYNCED → DONOR/DESYNCED (TO: 533582)
130708 22:05:42 [Note] WSREP: IST request: a9e219f1-4291-11e2-0800-f07a1fee082b:533576-533582|tcp://172.18.4.129:4568
130708 22:05:42 [Note] WSREP: async IST sender starting to serve tcp://172.18.4.129:4568 sending 533577-533582
130708 22:05:42 [Note] WSREP: async IST sender served
130708 22:05:42 [Note] WSREP: 1 (pxc3): State transfer to 0 (pxc1) complete.
130708 22:05:42 [Note] WSREP: Shifting DONOR/DESYNCED → JOINED (TO: 533582)
130708 22:05:42 [Note] WSREP: Member 1 (pxc3) synced with group.
130708 22:05:42 [Note] WSREP: Shifting JOINED → SYNCED (TO: 533582)
130708 22:05:42 [Note] WSREP: Synchronized with group, ready for connections
130708 22:05:42 [Note] WSREP: 0 (pxc1): State transfer from 1 (pxc3) complete.
130708 22:05:42 [Note] WSREP: Member 0 (pxc1) synced with group.
130708 22:06:51 [Note] WSREP: (0dca7efa-ceb9-11e2-0800-426d225ba0b6, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://172.18.4.130:4567
130708 22:06:52 [Note] WSREP: (0dca7efa-ceb9-11e2-0800-426d225ba0b6, ‘tcp://0.0.0.0:4567’) reconnecting to 7554f7d0-ceb7-11e2-0800-c1b580896c4b (tcp://172.18.4.130:4567), attempt 0
130708 22:06:53 [Note] WSREP: evs::proto(0dca7efa-ceb9-11e2-0800-426d225ba0b6, GATHER, view_id(REG,063fe2dd-ceba-11e2-0800-37fa3005f9fe,155)) suspecting node: 7554f7d0-ceb7-11e2-0800-c1b580896c4b
130708 22:06:54 [Note] WSREP: evs::proto(0dca7efa-ceb9-11e2-0800-426d225ba0b6, GATHER, view_id(REG,063fe2dd-ceba-11e2-0800-37fa3005f9fe,155)) suspecting node: 7554f7d0-ceb7-11e2-0800-c1b580896c4b
130708 22:06:55 [Note] WSREP: declaring 063fe2dd-ceba-11e2-0800-37fa3005f9fe stable
130708 22:06:55 [Note] WSREP: Node 063fe2dd-ceba-11e2-0800-37fa3005f9fe state prim
130708 22:06:55 [Note] WSREP: view(view_id(PRIM,063fe2dd-ceba-11e2-0800-37fa3005f9fe,156) memb {
063fe2dd-ceba-11e2-0800-37fa3005f9fe,
0dca7efa-ceb9-11e2-0800-426d225ba0b6,
} joined {
} left {
} partitioned {
7554f7d0-ceb7-11e2-0800-c1b580896c4b,
})
130708 22:06:55 [Note] WSREP: forgetting 7554f7d0-ceb7-11e2-0800-c1b580896c4b (tcp://172.18.4.130:4567)
130708 22:06:55 [Note] WSREP: deleting entry tcp://172.18.4.130:4567
130708 22:06:55 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
130708 22:06:55 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
130708 22:06:55 [Note] WSREP: (0dca7efa-ceb9-11e2-0800-426d225ba0b6, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
130708 22:06:55 [Note] WSREP: STATE EXCHANGE: sent state msg: e1e0b984-e7c6-11e2-0800-e0315f7301f1
130708 22:06:55 [Note] WSREP: STATE EXCHANGE: got state msg: e1e0b984-e7c6-11e2-0800-e0315f7301f1 from 0 (pxc1)
130708 22:06:55 [Note] WSREP: STATE EXCHANGE: got state msg: e1e0b984-e7c6-11e2-0800-e0315f7301f1 from 1 (pxc3)
130708 22:06:55 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 147,
members = 2/2 (joined/total),
act_id = 533583,
last_appl. = 0,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = a9e219f1-4291-11e2-0800-f07a1fee082b
130708 22:06:55 [Note] WSREP: Flow-control interval: [23, 23]
130708 22:06:55 [Note] WSREP: New cluster view: global state: a9e219f1-4291-11e2-0800-f07a1fee082b:533583, view# 148: Primary, number of nodes: 2, my index: 1, protocol version 2

130708 22:06:55 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130708 22:06:55 [Note] WSREP: Assign initial position for certification: 533583, protocol version: 2
130708 22:07:00 [Note] WSREP: cleaning up 7554f7d0-ceb7-11e2-0800-c1b580896c4b (tcp://172.18.4.130:4567)
130708 22:07:30 [Note] WSREP: (0dca7efa-ceb9-11e2-0800-426d225ba0b6, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://172.18.4.130:4567
130708 22:07:30 [Note] WSREP: (0dca7efa-ceb9-11e2-0800-426d225ba0b6, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
130708 22:07:31 [Note] WSREP: declaring 063fe2dd-ceba-11e2-0800-37fa3005f9fe stable
130708 22:07:31 [Note] WSREP: declaring 7554f7d0-ceb7-11e2-0800-c1b580896c4b stable
130708 22:07:31 [Note] WSREP: Node 063fe2dd-ceba-11e2-0800-37fa3005f9fe state prim
130708 22:07:31 [Note] WSREP: view(view_id(PRIM,063fe2dd-ceba-11e2-0800-37fa3005f9fe,157) memb {
063fe2dd-ceba-11e2-0800-37fa3005f9fe,
0dca7efa-ceb9-11e2-0800-426d225ba0b6,
7554f7d0-ceb7-11e2-0800-c1b580896c4b,
} joined {
} left {
} partitioned {
})
130708 22:07:31 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
130708 22:07:31 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
130708 22:07:31 [Note] WSREP: STATE EXCHANGE: sent state msg: f769c6d3-e7c6-11e2-0800-9031d1b1aac0
130708 22:07:31 [Note] WSREP: STATE EXCHANGE: got state msg: f769c6d3-e7c6-11e2-0800-9031d1b1aac0 from 0 (pxc1)
130708 22:07:31 [Note] WSREP: STATE EXCHANGE: got state msg: f769c6d3-e7c6-11e2-0800-9031d1b1aac0 from 1 (pxc3)
130708 22:07:31 [Note] WSREP: STATE EXCHANGE: got state msg: f769c6d3-e7c6-11e2-0800-9031d1b1aac0 from 2 (pxc2)
130708 22:07:31 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 148,
members = 2/3 (joined/total),
act_id = 533584,
last_appl. = 0,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = a9e219f1-4291-11e2-0800-f07a1fee082b
130708 22:07:31 [Note] WSREP: Flow-control interval: [28, 28]
130708 22:07:31 [Note] WSREP: New cluster view: global state: a9e219f1-4291-11e2-0800-f07a1fee082b:533584, view# 149: Primary, number of nodes: 3, my index: 1, protocol version 2
130708 22:07:31 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130708 22:07:31 [Note] WSREP: Assign initial position for certification: 533584, protocol version: 2
130708 22:08:03 [Note] WSREP: Node 2 (pxc2) requested state transfer from ‘any’. Selected 0 (pxc1)(SYNCED) as donor.
130708 22:08:03 [Note] WSREP: 0 (pxc1): State transfer to 2 (pxc2) complete.
130708 22:08:03 [Note] WSREP: Member 0 (pxc1) synced with group.
130708 22:08:03 [Note] WSREP: 2 (pxc2): State transfer from 0 (pxc1) complete.
130708 22:08:03 [Note] WSREP: Member 2 (pxc2) synced with group.

I can see what is happening, but I don’t know why:

pxc1 fails out of the cluster with a suspect timeout:

130708 22:04:32 [Note] WSREP: evs::proto(0dca7efa-ceb9-11e2-0800-426d225ba0b6, GATHER, view_id(REG,063fe2dd-ceba-11e2-0800-37fa3005f9fe,152)) suspecting node: 063fe2dd-ceba-11e2-0800-37fa3005f9fe
130708 22:04:33 [Note] WSREP: declaring 7554f7d0-ceb7-11e2-0800-c1b580896c4b stable
130708 22:04:33 [Note] WSREP: Node 0dca7efa-ceb9-11e2-0800-426d225ba0b6 state prim
130708 22:04:33 [Note] WSREP: view(view_id(PRIM,0dca7efa-ceb9-11e2-0800-426d225ba0b6,154) memb {
0dca7efa-ceb9-11e2-0800-426d225ba0b6,
7554f7d0-ceb7-11e2-0800-c1b580896c4b,
} joined {
} left {
} partitioned {
063fe2dd-ceba-11e2-0800-37fa3005f9fe,
})

So, at this point node identified by ‘063fe2dd-ceba-11e2-0800-37fa3005f9fe’ has been partitioned from the cluster. Which node is it? We can tell when it comes back about 37s later:

130708 22:05:10 [Note] WSREP: view(view_id(PRIM,063fe2dd-ceba-11e2-0800-37fa3005f9fe,155) memb {
063fe2dd-ceba-11e2-0800-37fa3005f9fe,
0dca7efa-ceb9-11e2-0800-426d225ba0b6,
7554f7d0-ceb7-11e2-0800-c1b580896c4b,
} joined {
} left {
} partitioned {
})

and does an IST:

130708 22:05:42 [Note] WSREP: Node 0 (pxc1) requested state transfer from ‘any’. Selected 1 (pxc3)(SYNCED) as donor.

So it was pxc1.

Then, I see pxc2 do exactly the same thing:

130708 22:06:54 [Note] WSREP: evs::proto(0dca7efa-ceb9-11e2-0800-426d225ba0b6, GATHER, view_id(REG,063fe2dd-ceba-11e2-0800-37fa3005f9fe,155)) suspecting node: 7554f7d0-ceb7-11e2-0800-c1b580896c4b
130708 22:06:55 [Note] WSREP: view(view_id(PRIM,063fe2dd-ceba-11e2-0800-37fa3005f9fe,156) memb {
063fe2dd-ceba-11e2-0800-37fa3005f9fe,
0dca7efa-ceb9-11e2-0800-426d225ba0b6,
} joined {
} left {
} partitioned {
7554f7d0-ceb7-11e2-0800-c1b580896c4b,
})

Then it comes back about 36 seconds later:

130708 22:07:31 [Note] WSREP: view(view_id(PRIM,063fe2dd-ceba-11e2-0800-37fa3005f9fe,157) memb {
063fe2dd-ceba-11e2-0800-37fa3005f9fe,
0dca7efa-ceb9-11e2-0800-426d225ba0b6,
7554f7d0-ceb7-11e2-0800-c1b580896c4b,
} joined {
} left {
} partitioned {
})

130708 22:08:03 [Note] WSREP: Node 2 (pxc2) requested state transfer from ‘any’. Selected 0 (pxc1)(SYNCED) as donor.

From the cluster’s perspective a suspect_timeout is the complete absence of communication from the node in question. If the node were simply shutdown, this would not occur. It’s as if network communication stops on that node for a while – the default timeout here is 5 seconds. I can’t tell what is causing the disconnection, but only the cluster’s view on it.

During this timeout all commits on the cluster will hang.

After the node is partitioned from the cluster, any clients connected there would receive the ‘Unknown command’ errors you are talking about.

Thanks Jay,

I really appreciate the time you took in looking into this. I’ve been right through all our cron jobs and the daily tasks on the server and I can’t see anything that could cause this so periodically from our side and can’t see anything that could be causing this apart from maybe log rotate.

The environment is hosted on a VMWare ESX platform and uses sure backup to check and verify backups. I’ve sent an email off to our hosting provider to see if it could be something from their end (firewall, switch, ESX or sure backup). I’ll see what they come back to me with.

If nothing comes from that I might get LeFred to jump on the server and see if he can find anything that could be causing the issue.

I’ll put what we find out back on the forums in an effort to increase the general knowledge base for PXC.

Thanks again for your help, it’s given us some good insight into what’s occurring each night.

Kind Regards

William Jamieson