Xtradb + Magento + EC2 + EBS No Joy

We have a 3 node EC2 Xtradb cluster. All the nodes are m1.xlarge, iOPS optimized EC2 instances, with 1000 IOPS provisioned on EBS IOPS optimized volumes. We have a 1.3Gb database, which is not very large. I have included our my.cnf file.

Magento has a lot of tables created like this one:

CREATE TABLE report_viewed_product_index ( index_id bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT ‘Index Id’, visitor_id int(10) unsigned DEFAULT NULL COMMENT ‘Visitor Id’, customer_id int(10) unsigned DEFAULT NULL COMMENT ‘Customer Id’, product_id int(10) unsigned NOT NULL COMMENT ‘Product Id’, store_id smallint(5) unsigned DEFAULT NULL COMMENT ‘Store Id’, added_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT ‘Added At’, PRIMARY KEY (index_id), UNIQUE KEY UNQ_REPORT_VIEWED_PRODUCT_INDEX_CUSTOMER_ID_PRODUCT_ID (customer_id,product_id), UNIQUE KEY UNQ_REPORT_VIEWED_PRODUCT_INDEX_VISITOR_ID_PRODUCT_ID (visitor_id,product_id), KEY IDX_REPORT_VIEWED_PRODUCT_INDEX_STORE_ID (store_id), KEY IDX_REPORT_VIEWED_PRODUCT_INDEX_ADDED_AT (added_at), KEY IDX_REPORT_VIEWED_PRODUCT_INDEX_PRODUCT_ID (product_id), CONSTRAINT FK_REPORT_VIEWED_PRD_IDX_CSTR_ID_CSTR_ENTT_ENTT_ID FOREIGN KEY (customer_id) REFERENCES customer_entity (entity_id) ON DELETE CASCADE ON UPDATE CASCADE, CONSTRAINT FK_REPORT_VIEWED_PRD_IDX_PRD_ID_CAT_PRD_ENTT_ENTT_ID FOREIGN KEY (product_id) REFERENCES catalog_product_entity (entity_id) ON DELETE CASCADE ON UPDATE CASCADE, CONSTRAINT FK_REPORT_VIEWED_PRODUCT_INDEX_STORE_ID_CORE_STORE_STORE_ID FOREIGN KEY (store_id) REFERENCES core_store (store_id) ON DELETE SET NULL ON UPDATE CASCADE) ENGINE=InnoDB AUTO_INCREMENT=1398 DEFAULT CHARSET=utf8 COMMENT=‘Reports Viewed Product Index Table’;

Magento has a data access ORM they use to access the database. The default behavior of the Magento ORM is to take all of the columns in a table and construct an Insert on Duplicate Key Update (IDKU) on that table. The application does most of it’s data access using this approach.

	 1181 Query	START TRANSACTION		 1181 Query	INSERT INTO `report_viewed_product_index` (`visitor_id`,`customer_id`,`product_id`,`store_id`,`added_at`) VALUES ('542', NULL, '8033', '1', '2013-02-08 09:08:27') ON DUPLICATE KEY UPDATE visitor_id = VALUES(`visitor_id`), customer_id = VALUES(`customer_id`), product_id = VALUES(`product_id`), store_id = VALUES(`store_id`), added_at = VALUES(`added_at`)		 1181 Query	COMMIT		 1181 Query	START TRANSACTION		 1181 Query	INSERT INTO `report_event` (`logged_at`, `event_type_id`, `object_id`, `subject_id`, `subtype`, `store_id`) VALUES ( '2013-02-08 09:08:27', '1', '8033', '542', '1', '1')		 1181 Query	COMMIT</pre>

This table has a couple of fun things about to note. First the data access is using a IDKU with an autoinc key.
There are two unique secondary constraints. All of the rows in this table will have a null in either visitor_id or in customer_id. Guest users have a visitor_id and a null customer_id. Logged in users have null visitor_id and customer_id.
Look closely at this at was a surprise to me that this even works. In this table you will have tuples that are {customer_id=null, product_id=5665} ,{customer_Id=null, product_id=5665} . They are relying on the fact that null!=null honor the constraint secondary FKey constraint on unique customer_id, product_id. Anyway, I find it quite a bit disconcerting that this even works, let alone that this probably ends up being our number one path to an update on our database.
The report_viewed_product_index table basically results in an IDKU. Index_id is not passed in. The Key exist test relies upon a Null!=Null test to ignore one of the secondary constraints. It tests the other constraint to see if the row exists. If it exists, it updates that row. Otherwise it does an autoinc and inserts a new row.
report_viewed_product_index gets a new update or insert everytime someone clicks on a product in a magento store. Lots going on here on a simple update. Especially the single most lively update of all of the updates.
Seems at some point in development of magento, the dev’s rallied together and killed off all the dba’s.
OK, now for the actual errors. We often see this on a replication partner.

Slave SQL: Could not execute Write_rows event on table magento.report_viewed_product_index; Duplicate entry ‘28-12467’ for key ‘UNQ_REPORT_VIEWED_PRODUCT_INDEX_CUSTOMER_ID_PRODUCT_ID’, Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event’s master log FIRST, end_log_pos 130, Error_code: 1062130207 21:00:44 [Warning] WSREP: RBR event 2 Write_rows apply warning: 121, 1090257130207 21:00:44 [ERROR] WSREP: Failed to apply trx: source: d598ddc3-7142-11e2-0800-14fbd278c361 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 43487 trx_id: 33729512 seqnos (l: 35922, g: 1090257, s: 1090256, d: 1090159, ts: 1360270844390362667)130207 21:00:44 [ERROR] WSREP: Failed to apply app buffer: üQ, seqno: 1090257, status: WSREP_FATAL

We go look at the main database in our cluster at that point, and you see that the report_viewed_product_index table is corrupt. At that point the two slave nodes have taken themselves offline. When we bring the first slave backup it will dsync the master, so we have a bit of outage time.
Sometimes we see other tables in the mix. They have the same data access pattern, so I have not shared their details.

130208 7:59:56 [ERROR] Slave SQL: Could not execute Write_rows event on table magento.sales_bestsellers_aggregated_monthly; Duplicate entry ‘2012-09-01-0-15003’ for key ‘UNQ_SALES_BESTSELLERS_AGGRED_MONTHLY_PERIOD_STORE_ID_PRD_ID’, Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event’s master log FIRST, end_log_pos 26681, Error_code: 1062130208 7:59:56 [Warning] WSREP: RBR event 29 Write_rows apply warning: 121, 1164947130208 7:59:56 [Note] WSREP: declaring 31740356-717f-11e2-0800-26ba8c5eb9c8 stable130208 7:59:56 [Note] WSREP: (629b59d1-7180-11e2-0800-2391cac868f2, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.147.177.227:4567130208 7:59:56 [ERROR] WSREP: Failed to apply trx: source: 31740356-717f-11e2-0800-26ba8c5eb9c8 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 70554 trx_id: 44884447 seqnos (l: 56964, g: 1164947, s: 1164946, d: 1164944, ts: 1360310399157610267)130208 7:59:56 [ERROR] WSREP: Failed to apply app buffer:°Q, seqno: 1164947, status: WSREP_FATAL

Usually around the time that we see these in the msyql error logs, we see iowaits sustained up around 45% marks.
If we setup a standalone Mysql instance through quite a bit of tuning we had been able to get this database to be pretty reliable. But still occasionally it will bail, so we would like to have a cluster db to increase our chances of staying up.
So far we have not been able to get our xtradb instances to stay up. Help/pointers how to debug, how to figure out what is going on, would be greatly appreciated. I suspect at this point we are looking at rewriting the data access code, probably starting with our hot tables. But I am grasping a bit to try and find a database way to keep it up and running.