Hi,
i’m currently testing replication of a MySQL DB. MySQL is version 5.5.49, Server is SLES 11 SP4. I have installed percona-toolkit 2.2.16-1.
I’m currently testing pt-heartbeat. I have a script which starts pt-heartbeat:
sunhb65278:~ # cat /root/skripte/heartbeat.sh
#!/bin/bash
pidof perl /usr/bin/pt-heartbeat > /dev/null
rueck=$?
if [ $rueck -ne 0 ]; then
pt-heartbeat -h 127.0.0.1 --user checksum --password checksum --update --database percona --daemonize
sleep 5
fi
temp=$(pt-heartbeat -h sunhb58820-2 --user checksum --password checksum --check --database percona --master-server-id 10352397)
diff1=${temp#*.}
# Zahl vor dem Punkt
diff2=${temp%.*}
# Zahl hinter dem Punkt
if [ $diff1 -gt 0 ]; then
mail -s "pt-heartbeat on $HOSTNAME fehlgeschlagen" bernd.lentes@helmholtz-muenchen.de << EOT
Achtung ! Slave hängt $temp Sekunden hinter dem Master !
EOT
exit
fi
if [ $diff2 -gt 0 ]; then
mail -s "pt-heartbeat on $HOSTNAME fehlgeschlagen" bernd.lentes@helmholtz-muenchen.de << EOT
Achtung ! Slave hängt $temp Sekunden hinter dem Master !
EOT
exit
fi
The script is called by cron every minute.
As you see the script first tries if pt-heartbeat is already running, if not it starts.
If i watch the processes, this happen:
TIME:15:51:01
root 31532 0.0 0.0 4552 548 pts/1 S+ 15:51 0:00 grep heartbeat
TIME:15:51:02
root 31535 0.0 0.0 11320 1400 ? Ss 15:51 0:00 /bin/bash /root/skripte/heartbeat.sh
root 31539 0.0 0.0 76732 15008 ? Ss 15:51 0:00 perl /usr/bin/pt-heartbeat -h 127.0.0.1 --user checksum --password checksum --update --database percona --daemonize
root 31546 0.0 0.0 4552 544 pts/1 S+ 15:51 0:00 grep heartbeat
TIME:15:51:03
root 31535 0.0 0.0 11320 1400 ? Ss 15:51 0:00 /bin/bash /root/skripte/heartbeat.sh
root 31539 0.0 0.0 76732 15312 ? Ss 15:51 0:00 perl /usr/bin/pt-heartbeat -h 127.0.0.1 --user checksum --password checksum --update --database percona --daemonize
root 31553 0.0 0.0 4552 544 pts/1 S+ 15:51 0:00 grep heartbeat
TIME:15:51:04
root 31535 0.0 0.0 11320 1400 ? Ss 15:51 0:00 /bin/bash /root/skripte/heartbeat.sh
root 31539 0.0 0.0 76732 15312 ? Ss 15:51 0:00 perl /usr/bin/pt-heartbeat -h 127.0.0.1 --user checksum --password checksum --update --database percona --daemonize
root 31560 0.0 0.0 4552 548 pts/1 S+ 15:51 0:00 grep heartbeat
TIME:15:51:05
root 31535 0.0 0.0 11320 1400 ? Ss 15:51 0:00 /bin/bash /root/skripte/heartbeat.sh
root 31539 0.0 0.0 76732 15312 ? Ss 15:51 0:00 perl /usr/bin/pt-heartbeat -h 127.0.0.1 --user checksum --password checksum --update --database percona --daemonize
root 31567 0.0 0.0 4552 548 pts/1 S+ 15:51 0:00 grep heartbeat
TIME:15:51:06
root 31535 0.0 0.0 11320 1400 ? Ss 15:51 0:00 /bin/bash /root/skripte/heartbeat.sh
root 31539 0.0 0.0 76732 15312 ? Ss 15:51 0:00 perl /usr/bin/pt-heartbeat -h 127.0.0.1 --user checksum --password checksum --update --database percona --daemonize
root 31574 0.0 0.0 4552 548 pts/1 S+ 15:51 0:00 grep heartbeat
TIME:15:51:07
root 31535 0.0 0.0 11320 1400 ? Ss 15:51 0:00 /bin/bash /root/skripte/heartbeat.sh
root 31539 0.0 0.0 76732 15312 ? Ss 15:51 0:00 perl /usr/bin/pt-heartbeat -h 127.0.0.1 --user checksum --password checksum --update --database percona --daemonize
root 31576 33.3 0.0 83044 17924 ? S 15:51 0:00 perl /usr/bin/pt-heartbeat -h sunhb58820-2 --user checksum --password checksum --check --database percona --master-server-id 10352397
root 31582 0.0 0.0 4552 548 pts/1 S+ 15:51 0:00 grep heartbeat
TIME:15:51:08
root 31535 0.0 0.0 0 0 ? Zs 15:51 0:00 [heartbeat.sh] <defunct>
root 31539 0.0 0.0 76732 15312 ? Ss 15:51 0:00 perl /usr/bin/pt-heartbeat -h 127.0.0.1 --user checksum --password checksum --update --database percona --daemonize
root 31589 0.0 0.0 4552 548 pts/1 S+ 15:51 0:00 grep heartbeat
TIME:15:51:09
root 31535 0.0 0.0 0 0 ? Zs 15:51 0:00 [heartbeat.sh] <defunct>
root 31539 0.0 0.0 76732 15312 ? Ss 15:51 0:00 perl /usr/bin/pt-heartbeat -h 127.0.0.1 --user checksum --password checksum --update --database percona --daemonize
root 31596 0.0 0.0 4552 544 pts/1 S+ 15:51 0:00 grep heartbeat
First no process is running. Then cron starts the script. Why is the script /root/skripte/heartbeat.sh (pid 31535) becoming a zombie at 15:51:08 ?
Do you have any idea ?
Thanks.
Bernd