[Bucardo-general] support for loss of connectivity to slave server/slave db going down

Wed Jan 20 18:50:06 UTC 2010

> > I have 2 master servers and 1 slave server.  The master servers have
> ...
> > that have not been replicated (300k+ rows).  Now, I restart the
> > postgresql-8.4 service on the slave and restart the burcado processes
> > on the master servers.  I expected all the rows that haven't yet been
> > replicated to the slave to be replicated over.  However, after a
> > minute or so, the "catch-up" is done and there are rows that haven't
> > been replicated-- 708586 across 2 master servers and 706191 on the
> > slave server.  What happened to the other 2395 rows ?
> 
> Good question! Have you setup makedelta, so that changes from one master
> to the other are receorded in the bucardo_delta table and thus ready to
> go to the slave? If not, that's the problem.

The master servers are not setup to replicate data to each other (on purpose). They each have approximately 350k rows in the single table and have a separate bucardo setup (located on the master servers themselves) to replicate to the single slave.  In the test scenario, when I restart the slave server and the bucardo processes on each master, approximately 300k un-replicated rows (approximately 150k from each server) are correctly replicated over, but the count doesn't match up exactly-- there are a small number of rows (2395 in this example) that didn't get replicated.

> > As a side question, other than having some sort of cron job on each
> > master server that periodically checks for the burcado process and
> > restarts it if it is dead, is there a more graceful way of supporting
> > loss of connectivity/issues with the burcado process ?
> 
> Bucardo is setup to restart itself by default, so if it's not, that's a
> bug we need to address.

In the test scenario, if I shutdown the slave server DBMS, the Bucardo processes die (the Bucardo setups are located on the master servers).  Here is the log.bucardo entry at the time the slave server DBMS is shutdown for one of the Bucardo processes:

[Wed Jan 20 18:28:21 2010]  MCP Warning: Killed (line 890): Ping failed for remote database server
[Wed Jan 20 18:28:21 2010]  MCP Database problem, will respawn after a short sleep: 15
[Wed Jan 20 18:28:21 2010]  MCP Final database backend PID is 15790
[Wed Jan 20 18:28:21 2010]  MCP Attempting to kill PID 12388
[Wed Jan 20 18:28:21 2010]  MCP Sending signal 15 to pid 12388
[Wed Jan 20 18:28:21 2010]  MCP Successfully signalled pid 12388
[Wed Jan 20 18:28:21 2010]  MCP Attempting to kill PID 12390
[Wed Jan 20 18:28:21 2010]  MCP Sending signal 15 to pid 12390
[Wed Jan 20 18:28:21 2010]  MCP Successfully signalled pid 12390
[Wed Jan 20 18:28:21 2010]  MCP End of cleanup_mcp. Sys time: Wed Jan 20 18:28:21 2010. Database time: 2010-01-20 18:28:21.499826+00
[Wed Jan 20 18:28:21 2010]  MCP Removed pid file "/var/run/bucardo/bucardo.mcp.pid"
[Wed Jan 20 18:28:21 2010]  KID Final database backend PID is 15793
[Wed Jan 20 18:28:21 2010]  KID Kid exiting at cleanup_kid. Reason: Caught a SIGTERM at /usr/local/share/perl/5.10.0/Bucardo.pm line 4236
[Wed Jan 20 18:28:21 2010]  KID Removed pid file "/var/run/bucardo/bucardo.kid.sync.mobile1_server.server.pid"
[Wed Jan 20 18:28:21 2010]  CTL Controller exiting at cleanup_controller. Reason: Caught a SIGTERM at /usr/local/share/perl/5.10.0/Bucardo.pm line 3065
[Wed Jan 20 18:28:21 2010]  CTL Removed pid file "/var/run/bucardo/bucardo.ctl.sync.mobile1_server.pid"
[Wed Jan 20 18:28:36 2010]  MCP Respawn attempt: /usr/local/bin/bucardo_ctl start "Attempting automatic respawn after MCP death"

Omar