Logical replication on large scale - 133 Databases - 140 Large tables - 2 replication slots per database

Hi there, and thank you in advance for your help.

I need to build a pipeline for Data Lakes/Data Warehouses using PostgresSQL 14.8 as a source and S3 as a target, with AWS Managed Service DMS capturing the data from logical replication with test_decoding.

First question: Is 266 replication slots too much and I am being optimistic to believe that PostgreSQL 14.8 can handles it? Any foreseen challenges and general guidance?

For instance, so far the same solution is working fine on servers with low number of databases but ingests much more data daily. For context we replicate 140 tables per database divided in 2 replication slots per database with roughly 70 tables per replication slot.

Yesterday particularly, for the mentioned server with 133 databases, we noticed that when we ran backups (parallel pg_dump on all 133 databases flag “–jobs 8”), the CPU used by the logical replication user skyrocketed, as could be seen in attached screenshot; database wait events: “IO:ReorderBufferWrite”, “IO:WALRead”.

Any guidance as to why pg_dump would cause this issue, and specifically on “IO:ReorderBufferWrite”? We noticed that even after the pg_dump completed at 6:00am, the replication user was still behind on “IO:ReorderBufferWrite”.

Thank you so much for your time and consideration. Appreciate it!

Link for the screenshot: Logical-Replication-High-Usage hosted at ImgBB — ImgBB

Hi,

Is 266 replication slots too much and I am being optimistic to believe that PostgreSQL 14.8 can handles it?

I’ll be honest, I’ve never tested the limits of maximum replication slots possible. Offhand there’s two limitations you need to consider:

  • hardcoded limitations: in that case get onto the hackers maillist and ask there
  • network IO: your network admin can help you determine bottlenecks

The simple way of course is to build out a demo environment and try it out. If you choose that option I’d suggest looking into leveraging pgbench using a custom script.

… we ran backups (parallel pg_dump on all 133 databases flag “–jobs 8”), the CPU used by the logical replication user skyrocketed,…

Sounds like the Long Running Transaction bugaboo. You can also have a look at this blog too for further guidance/

Hope this helps.

1 Like