High CPU usage after PBM restore run finished

Hi there,

I faced the issue that after a restore operation triggered via PMM finished, the pbm-agent is staying at a very high CPU usage level. Unfortunately, I can’t see anything suspicious in the pbm logs and the pbm status looks ok as well. Anyone having an idea what’s wrong here?

$ pbm logs
2024-01-22T02:40:56Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for thing-it.productoffers: byProduct
2024-01-22T02:40:56Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for thing-it.bruteforce.log: expiresAt_1
2024-01-22T02:40:56Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for thing-it.sessionStores: storeId_1, expireAt_1
2024-01-22T02:40:56Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for thing-it.sensorDataAggregation: byTag, byTimeRange, byTimeRangeAndMesh, byFloorAndKeyword
2024-01-22T02:41:22Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for thing-it.sessionTokens: byTokenKey, autoExpiration, byUserIdAndExpiresAt
2024-01-22T02:41:27Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for thing-it.meshes: byUuid
2024-01-22T02:41:27Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for thing-it.nodecomputers: byUuid
2024-01-22T02:41:27Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for thing-it.calendarentries: assetClass_1_assetId_1_recurrence_1_end_1_start_1, byRelatedGroupwareEvent, timelineByAsset, recurringByAsset, nttpGroupwareEventId_1, lockerBookingsByCalendar, assetClass_1_start_1_end_1, byCalendar, bookingUser_1_assetClass_1_recurrence_1_end_1_start_1, end_1_event_1, bookedForUser_1_event_1_class_1, timelineByCustomer, timelineByCalendar, entry-timeline-recurring, entry-timeline-recurring2, event_1_class_1, groupwareEventId_1, timeline, by-start-end-end, customer_1_recurrence_1_start_1_end_1, byBookedAsset, recurringByCustomerAndAsset, timelineByCustomer2, recurringByUserAndAssetClass, recurringEventsByParticipant, eventTimelineByParticipant, byModificationDate, byRelatedBookings, assetBookingsByUserAndAssetClass, lockerBookingsByUser, nonMeshTimelineByCustomerAndAssetClass, entry-timeline, groupwareSeriesId_1, mesh_1_assetClass_1_recurrence_1_end_1_start_1, byBookedForUser, timelineByAssetBooking, entry-timeline1, recurringByCalendar, timelineByUserAndAssetClass, timelineByCustomer1, agendaByUserAndAssetClass, byCheckoutIdConnector, start_1_recurrence_1, recurringByAssetClass, assetBookingsByUser, byCreationDate, customer_1_groupwareEventId_1_recurrence_1, byTtpGroupwareEventId
2024-01-22T05:42:54Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for thing-it.devicecontexts: byDevice, byUser, byUserSession
2024-01-22T05:46:08Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for thing-it.users: byCellPhone, byAlias, account_1, byEMail, searchableByLastNameEN, searchableByAliasEN, byCustomerAndValidity, searchableByFirstNameEN, byUuid, byCustomer, searchableByAccountEN, bySearchablePropertyEN, byAccount
2024-01-22T05:46:49Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for thing-it.externalsystems: byMailDomainAndType
2024-01-22T05:46:49Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for thing-it.billingpositions: ativityByCustomerAndDate, type_1, meshCustomer_1_date_1, mesh_1_type_1_date_1, byTypeAndDate, activityByDateAndUser
2024-01-22T05:52:08Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for admin.pbmOpLog: opid_1_replset_1
2024-01-22T05:52:08Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for admin.pbmLock: replset_1
2024-01-22T05:52:08Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for admin.pbmLockOp: replset_1_type_1
2024-01-22T05:52:08Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for admin.system.users: user_1_db_1
2024-01-22T05:52:08Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for admin.pbmBackups: name_1, start_ts_1_status_1
2024-01-22T05:52:08Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for admin.pbmPITRChunks: rs_1_start_ts_1_end_ts_1, start_ts_1_end_ts_1
2024-01-22T05:52:08Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] restoring indexes for admin.system.roles: role_1_db_1
2024-01-22T05:52:10Z I [replica01/172.16.4.100:27017] [restore/2024-01-21T22:15:11.509129099Z] recovery successfully finished

Thanks and best regards,
Alex

@thing-it-agd

Did you captured any OS level stats [Vmstat, IOSTAT, MPSTAT] around the issue duration or any similar details or graphs where we can check the patterns ?

Can you please share the Kernel/OS logs (/var/log/messages) or (dmesg -T > dmesg_t) etc ?

It appears this node is monitored via PMM also. Did you seen any abnormal spikes there for OS or database ?

Just to have a quick look over, can you please share the PBM configuration details as well ?

shell> pbm config
shell> pbm version

@anil.joshi , unfortunately I neither have any other OS level stats nor logs, I will try to capture them when the problem appears again.

This is what I can see in PMM:


Around 8am on 2024/01/22 was when I restarted the pbm process.

PBM config:

$ pbm config
pitr:
  enabled: false
  oplogSpanMin: 0
  compression: s2
storage:
  type: s3
  s3:
    provider: aws
    region: eu-central-1
    endpointUrl: s3.eu-central-1.amazonaws.com
    forcePathStyle: true
    bucket: my-bucket
    prefix: my-prefix
    credentials:
      access-key-id: '***'
      secret-access-key: '***'
    maxUploadParts: 10000
    storageClass: STANDARD
    insecureSkipTLSVerify: false
backup:
  compression: s2

PBM version:

$ pbm version
Version:   2.3.0
Platform:  linux/amd64
GitCommit: 3b1c2e263901cf041c6b83547f6f28ac2879911f
GitBranch: release-2.3.0
BuildTime: 2023-09-20_14:42_UTC
GoVersion: go1.19

@thing-it-agd

Thanks for confirming the details.

Indeed the CPU usage were high on 21st Jan(08:00) as well and it looks like the saturation again started on 22Jan (00:00) and there continues until you restarted the PBM.

Did you observed such spike on other time duration as well when no restoration activity is performed ? How often you see such CPU spike with respect to PBM process ?

If you face the issue next time , please gather the TOP and other OS related statistics for review.

@anil.joshi , no, such spikes do not occur when no restoration is running. In general, this particular DB system is normally not exposed to heavy load as it’s only a mirror of the production system used for testing.

I will try to gather more data next time it occurs, but until now the spike didn’t re-appear. For me, it would be ok to close this thread, if the problem is not reproducible within the next 2 weeks.