Tips for Warehouse Dimension Cleanup

We have a star-schema data warehouse in MySQL that has been online for a long time and has many unreferenced dimension records. We periodically archive data from our fact tables by dropping old partitions (easy). However, we have never cleaned-up (deleted unreferenced rows) the dimensions since many are referenced by multiple fact tables. In addition, we have new data coming in hourly.

It would be nice if the cleanup process worked like pt-online-schema-change, where it sets some triggers and then makes a copy of a dimension and swaps in the cleaned-up version. Looks like pt-archiver and pt-table-sync might be useful also. One thought, is to do an INSERT IGNORE of the dimension keys from the fact tables and then do a JOIN to produce the cleaned-up dimension. But, how do we keep the incoming data in sync (triggers?) since the cleanup could take days?

Not sure if there was some standard way to approach our issue. Thanks and any ideas appreciated.

1 Like