[Bucardo-general] Zombies Created when Testing One Node Offline

Mike Tonks michael.tonks at headforwards.com
Wed May 8 15:41:55 UTC 2013


Hi,

I'm new to this list so first I will say hello!  I'm a new bucardo user 
and so far I'm enjoying using bucardo a lot.  Thanks!

I have two issues and the debugging is rather long, so I'll keep them in 
seperate mails.  Any help much appreciated.


I'm testing with 4.99.7 on ubuntu 13.04, Postgres 9.1.9

I have 4 source databases and one target, with a very simple test 
schema, and am testing the scenario when one db goes offline.

For now, I test this using:

  - sudo /etc/init.d/postgres stop # (kill the pg server)

Let's start with bucardo running:

PID of Bucardo MCP: 10652
  Name       State    Last good    Time    Last I/D    Last bad Time
==========+========+============+=======+===========+===========+=======
  testsync | Good   | 15:58:21   | 2s    | 0/4       | 15:49:22  | 9m 1s


And here's my process tree using ps auxf:

root     10652  0.0  0.1 146020 20856 ?        S    15:49   0:00 Bucardo 
Master Control Program v4.99.7. Active syncs: testsync
root     10662  0.0  0.1 147540 21936 ?        S    15:49   0:00  \_ 
Bucardo VAC.
root     10668  0.0  0.1 147564 22048 ?        S    15:49   0:00  \_ 
Bucardo Controller. Sync "testsync" for relgroup "testherd" to dbs 
"testgroup"
root     10683  0.0  0.1 149376 23688 ?        S    15:49 0:00      \_ 
Bucardo Kid. Sync "testsync"


Now I stop one node, and insert a row on another node to intentionally 
kill he replication:


  Name       State    Last good    Time    Last I/D    Last bad Time
==========+========+============+=======+===========+===========+=======
  testsync | Bad    | 15:58:21   | 3m 6s | 0/4       | 16:01:25  | 2s


Now my process tree looks a bit funny:

root     10662  0.0  0.1 147540 21936 ?        S    15:49   0:00 Bucardo 
VAC.
root     10668  0.0  0.1 147564 22072 ?        S    15:49   0:00 Bucardo 
Controller. Sync "testsync" for relgroup "testherd" to dbs "testgroup"
root     10965  0.0  0.1 145260 20352 ?        S    16:02   0:00 Bucardo 
Master Control Program v4.99.7.

In the log file I see it respawns the child process every 15 seconds, 
with a sensible error message:

(11061) [Wed May  8 16:03:48 2013] KID Kid 11061 exiting at cleanup_kid. 
Sync "testsync" Reason: DBI 
connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could 
not connect to server: Connection refused     Is the server running on 
host "192.168.97.93" and accepting     TCP/IP connections on port 5432? 
at /usr/local/share/perl/5.14.2/Bucardo.pm line 4941. Line: 2718

So far so good, so let's restart the offline node.  Replication catches 
up nice and quickly.

But now I have a zombie VAC process?

  Name       State    Last good    Time    Last I/D    Last bad Time
==========+========+============+=======+===========+===========+========
  testsync | Good   | 16:04:54   | 13s   | 0/4       | 16:01:25  | 3m 42s

root     10662  0.0  0.1 147540 21936 ?        S    15:49   0:00 Bucardo 
VAC.
root     11148  0.1  0.1 146020 20852 ?        S    16:05   0:00 Bucardo 
Master Control Program v4.99.7. Active syncs: testsync
root     11159  0.0  0.1 147540 21952 ?        S    16:05   0:00  \_ 
Bucardo VAC.
root     11165  0.0  0.1 147564 22052 ?        S    16:05   0:00  \_ 
Bucardo Controller. Sync "testsync" for relgroup "testherd" to dbs 
"testgroup"
root     11180  0.0  0.1 149228 23276 ?        S    16:05 0:00      \_ 
Bucardo Kid. Sync "testsync"


I'll now repeat the test, downing the node again:

  Name       State    Last good    Time    Last I/D    Last bad Time
==========+========+============+=======+===========+===========+=======
  testsync | Bad    | 16:07:19   | 26s   | 0/4       | 16:07:41  | 4s

root     10662  0.0  0.1 147540 21936 ?        S    15:49   0:00 Bucardo 
VAC.
root     11148  0.0  0.1 146020 21008 ?        S    16:05   0:00 Bucardo 
Master Control Program v4.99.7. Active syncs: testsync
root     11159  0.0  0.1 147540 21952 ?        S    16:05   0:00  \_ 
Bucardo VAC.
root     11165  0.0  0.1 147564 22076 ?        S    16:05   0:00  \_ 
Bucardo Controller. Sync "testsync" for relgroup "testherd" to dbs 
"testgroup"


and restart:

  Name       State    Last good    Time    Last I/D    Last bad Time
==========+========+============+=======+===========+===========+========
  testsync | Good   | 16:08:58   | 8s    | 0/8       | 16:07:41  | 1m 25s

root     10662  0.0  0.1 147540 21936 ?        S    15:49   0:00 Bucardo 
VAC.
root     11159  0.0  0.1 147540 21952 ?        S    16:05   0:00 Bucardo 
VAC.
root     11303  0.4  0.1 146020 20852 ?        S    16:09   0:00 Bucardo 
Master Control Program v4.99.7. Active syncs: testsync
root     11318  0.1  0.1 147540 21940 ?        S    16:09   0:00  \_ 
Bucardo VAC.
root     11324  0.3  0.1 147564 22048 ?        S    16:09   0:00  \_ 
Bucardo Controller. Sync "testsync" for relgroup "testherd" to dbs 
"testgroup"
root     11342  0.6  0.1 149228 23276 ?        S    16:09 0:00      \_ 
Bucardo Kid. Sync "testsync"






More information about the Bucardo-general mailing list