Seeking best practices and real world experiences around managing, tuning, and operating PostgreSQL and MongoDB databases in production environments at scale

Hi everyone,

I’m looking to start a discussion around real world database administration challenges for PostgreSQL and MongoDB in production environments.

Specifically, I’m interested in hearing from DBAs and platform engineers on:
:speaking_head: Common performance bottlenecks you see in production
:speaking_head: Best practices for backup, restore, and disaster recovery
:speaking_head: High availability and replication strategies that have worked well
:speaking_head: Monitoring and alerting tools you rely on day to day
:speaking_head: Lessons learned from incidents or migrations at scale

This would be especially helpful for teams running databases on cloud platforms (AWS/GCP/Azure) as well as hybrid or on-prem setups.

Looking forward to learning from the community and sharing experiences.

Thanks!

Regarding

  • Common performance bottlenecks you see in production
    That depends on large number of factors like schema design, application design concurrency, workload, Active data set, available CPU, Memory, IO bandwidth, IO latency, Network bandwith, Network latency etc..etc. There could be hundreds of factors which could be affecting data performance, scalability, stability and availability. Generally there is nothing so common for all cases. But if you insist, poorly tuned host machine and database would one common problem.
  • Best practices for backup, restore, and disaster recovery
    Start with organizations backup and backup-retention policy and use that information to implement it in a reliable backup solution like pgBackRest.
  • High availability and replication strategies that have worked well
    One primary and two standbys are the common approach for very critical environments. Use frameworks like Patroni to manage.
  • Monitoring and alerting tools you rely on day to day
    At Percona, we use and promote PMM.
  • Lessons learned from incidents or migrations at scale
    That would be too big for a forum update. But, in simple, Invest in design of the system. Quality of the design will be deciding how it is going to work in the future. The philosophy of KISS (Keep it Simple and Stupid) generally works better. More the complexity, expect more trouble.