[Bucardo-general] Zombies Created when Testing One Node Offline
Mike Tonks
michael.tonks at headforwards.com
Wed May 8 15:41:55 UTC 2013
Hi,
I'm new to this list so first I will say hello! I'm a new bucardo user
and so far I'm enjoying using bucardo a lot. Thanks!
I have two issues and the debugging is rather long, so I'll keep them in
seperate mails. Any help much appreciated.
I'm testing with 4.99.7 on ubuntu 13.04, Postgres 9.1.9
I have 4 source databases and one target, with a very simple test
schema, and am testing the scenario when one db goes offline.
For now, I test this using:
- sudo /etc/init.d/postgres stop # (kill the pg server)
Let's start with bucardo running:
PID of Bucardo MCP: 10652
Name State Last good Time Last I/D Last bad Time
==========+========+============+=======+===========+===========+=======
testsync | Good | 15:58:21 | 2s | 0/4 | 15:49:22 | 9m 1s
And here's my process tree using ps auxf:
root 10652 0.0 0.1 146020 20856 ? S 15:49 0:00 Bucardo
Master Control Program v4.99.7. Active syncs: testsync
root 10662 0.0 0.1 147540 21936 ? S 15:49 0:00 \_
Bucardo VAC.
root 10668 0.0 0.1 147564 22048 ? S 15:49 0:00 \_
Bucardo Controller. Sync "testsync" for relgroup "testherd" to dbs
"testgroup"
root 10683 0.0 0.1 149376 23688 ? S 15:49 0:00 \_
Bucardo Kid. Sync "testsync"
Now I stop one node, and insert a row on another node to intentionally
kill he replication:
Name State Last good Time Last I/D Last bad Time
==========+========+============+=======+===========+===========+=======
testsync | Bad | 15:58:21 | 3m 6s | 0/4 | 16:01:25 | 2s
Now my process tree looks a bit funny:
root 10662 0.0 0.1 147540 21936 ? S 15:49 0:00 Bucardo
VAC.
root 10668 0.0 0.1 147564 22072 ? S 15:49 0:00 Bucardo
Controller. Sync "testsync" for relgroup "testherd" to dbs "testgroup"
root 10965 0.0 0.1 145260 20352 ? S 16:02 0:00 Bucardo
Master Control Program v4.99.7.
In the log file I see it respawns the child process every 15 seconds,
with a sensible error message:
(11061) [Wed May 8 16:03:48 2013] KID Kid 11061 exiting at cleanup_kid.
Sync "testsync" Reason: DBI
connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could
not connect to server: Connection refused Is the server running on
host "192.168.97.93" and accepting TCP/IP connections on port 5432?
at /usr/local/share/perl/5.14.2/Bucardo.pm line 4941. Line: 2718
So far so good, so let's restart the offline node. Replication catches
up nice and quickly.
But now I have a zombie VAC process?
Name State Last good Time Last I/D Last bad Time
==========+========+============+=======+===========+===========+========
testsync | Good | 16:04:54 | 13s | 0/4 | 16:01:25 | 3m 42s
root 10662 0.0 0.1 147540 21936 ? S 15:49 0:00 Bucardo
VAC.
root 11148 0.1 0.1 146020 20852 ? S 16:05 0:00 Bucardo
Master Control Program v4.99.7. Active syncs: testsync
root 11159 0.0 0.1 147540 21952 ? S 16:05 0:00 \_
Bucardo VAC.
root 11165 0.0 0.1 147564 22052 ? S 16:05 0:00 \_
Bucardo Controller. Sync "testsync" for relgroup "testherd" to dbs
"testgroup"
root 11180 0.0 0.1 149228 23276 ? S 16:05 0:00 \_
Bucardo Kid. Sync "testsync"
I'll now repeat the test, downing the node again:
Name State Last good Time Last I/D Last bad Time
==========+========+============+=======+===========+===========+=======
testsync | Bad | 16:07:19 | 26s | 0/4 | 16:07:41 | 4s
root 10662 0.0 0.1 147540 21936 ? S 15:49 0:00 Bucardo
VAC.
root 11148 0.0 0.1 146020 21008 ? S 16:05 0:00 Bucardo
Master Control Program v4.99.7. Active syncs: testsync
root 11159 0.0 0.1 147540 21952 ? S 16:05 0:00 \_
Bucardo VAC.
root 11165 0.0 0.1 147564 22076 ? S 16:05 0:00 \_
Bucardo Controller. Sync "testsync" for relgroup "testherd" to dbs
"testgroup"
and restart:
Name State Last good Time Last I/D Last bad Time
==========+========+============+=======+===========+===========+========
testsync | Good | 16:08:58 | 8s | 0/8 | 16:07:41 | 1m 25s
root 10662 0.0 0.1 147540 21936 ? S 15:49 0:00 Bucardo
VAC.
root 11159 0.0 0.1 147540 21952 ? S 16:05 0:00 Bucardo
VAC.
root 11303 0.4 0.1 146020 20852 ? S 16:09 0:00 Bucardo
Master Control Program v4.99.7. Active syncs: testsync
root 11318 0.1 0.1 147540 21940 ? S 16:09 0:00 \_
Bucardo VAC.
root 11324 0.3 0.1 147564 22048 ? S 16:09 0:00 \_
Bucardo Controller. Sync "testsync" for relgroup "testherd" to dbs
"testgroup"
root 11342 0.6 0.1 149228 23276 ? S 16:09 0:00 \_
Bucardo Kid. Sync "testsync"
More information about the Bucardo-general
mailing list