[Bucardo-general] support for loss of connectivity to slave server/slave db going down

Wed Jan 20 02:02:15 UTC 2010

Hi,

I have 2 master servers and 1 slave server.  The master servers have 1 table each using a sequence with distinct ranges for the primary key column.  There is a separate Burcardo setup on each of the master servers connected 1:1 with the same slave server.  I have a script which inserts rows into the single table on each server.  After a certain period of time (e.g. 5 minutes), I stop the scripts and check the row counts on the master and slave servers-- all the data is replicated correctly (all 100k+ rows).  Note that the master and slave servers maintained connectivity throughout the entire test.

Now, I setup the same test, but this time I stop the postgresql-8.4 service on the slave after 5 minutes and let the scripts continue to execute on the master servers for another 5 minutes.  The burcado bucardo.warning.log file indicates that the Kids on the respective master servers have died (expected behavior when the slave db goes down).  I stop the scripts and perform a row count across the master and slave servers.  As expected, there is a huge number of records that have not been replicated (300k+ rows).  Now, I restart the postgresql-8.4 service on the slave and restart the burcado processes on the master servers.  I expected all the rows that haven't yet been replicated to the slave to be replicated over.  However, after a minute or so, the "catch-up" is done and there are rows that haven't been replicated-- 708586 across 2 master servers and 706191 on the slave server.  What happened to the other 2395 rows ?  I would expect that any uncommitted
 transactions by the burcado process (its kids?) would be rolled back when the connection to the slave db dies and replicated over when the slave db and burcado process are restarted (since the data should still be available and not deleted from the delta table).

As a side question, other than having some sort of cron job on each master server that periodically checks for the burcado process and restarts it if it is dead, is there a more graceful way of supporting loss of connectivity/issues with the burcado process ?

Omar