[Bucardo-general] Concurrency errors when replicating a master to two slaves and installation experiences

Thu Oct 28 11:53:59 UTC 2010

Hi List,

I rolled out a bucardo installation today to replicate a master 
database to two slaves and think I've discovered some bugs/strange 
behaviour in the process.

I'm using Centos Linux 5.5 64 bit with:
        bucardo-4.4.0-2.el5 (from epel)
        perl-5.8.8-32.el5_5.2
        perl-DBIx-Safe-1.2.5-6.el5
        perl-DBD-Pg-2.17.1-1.el5.rf
        perl-DBI-1.52-2.el5
        postgresql-server-8.4.4-2PGDG.el5 (from pgdg)
        postgresql-plperl-8.4.4-2PGDG.el5 (from pgdg)

Setup:
~~~~~~
First off, I had some problems setting things up - these might help 
others who are stuck:
1) Reading password from ~/.pgpass will fail if SELINUX is in 
enforcing mode with it's "permission denied" message.  I had to 
disable SELINUX to allow bucardo to read the file and connect to 
databases
2) There seems to be an undocumented ~20 character limit to the 
length of a sync name.  If I create a sync with a name 25 characters 
long, it will be accepted by bucardo_ctl and even perform an initial
sync but future updates will fail with key constraint errors like 
this:
        [Thu Oct 21 17:53:33 2010]  CTL Controller exiting at 
        cleanup_controller. Reason: DBD::Pg::st execute failed: ERROR
        :  insert or update on table "q" violates foreign key 
        constraint "q_tdb_fk"
        DETAIL:  Key (targetdb)=(jefferson_a) is not present in table
        "db". at /usr/lib/perl5/vendor_perl/5.8.8/Bucardo.pm line 
        3235.
The database did exist in the db table but bucardo had truncated the 
name it was searching for for some reason.  To fix the error, I just
had to alter sync to have a name of 20 or less characters. (bug?)

Permissions:
~~~~~~~~~~~~
Once I got past these problems, I managed to get things replicating 
but users other than the superuser were not able to truncate 
replicated tables because of insufficient permissions.  They got 
errors like this:
        ERROR:  permission denied for schema bucardo
        LINE 1: INSERT INTO
        bucardo.bucardo_truncate_trigger(tablename,sname...
Looking at all other triggers, the 
bucardo.bucardo_triggerkick_SYNCNAME one was missing the "SECURITY 
DEFINER" keywords to take on the role of the bucardo user.  Once I 
added this, validated syncs and restarted bucardo I was able to 
truncate tables as normal. 

Concurrency:
~~~~~~~~~~~~
I wanted to replicate a master database to two slaves. I did this by:
1) adding all three databases to bucardo
2) adding all tables and syncs from the master database
3) created 2x syncs, one from master -> slave1 and another from 
master -> slave2

I think this is the way to do it?

Soon after getting this working, I started to get errors in the 
bucardo warning log when the master database was updated:

Truncate error:
On truncating a table on master, I would occasionally get this error:
        [Thu Oct 28 20:36:42 2010]  KID Kid exiting at cleanup_kid. 
        Reason: DBD::Pg::st execute failed: ERROR:  could not 
        serialize access due to concurrent update at 
        /usr/lib/perl5/vendor_perl/5.8.8/Bucardo.pm line 4619.
It looks like there's a race condition between the two sync triggers 
as they each try to execute:
        DELETE FROM bucardo.bucardo_delta WHERE txntime <= now() AND
        tablename = $g->{oid}
Restarting bucardo unblocked things and allowed replication to 
continue.

Sequence error:
Not sure what caused this one.  The database is updated at various
times during they day by processing scripts and people manually
running SQL commands:
        [Thu Oct 28 18:29:59 2010]  KID Kid exiting at cleanup_kid. 
        Reason: DBD::Pg::st execute failed: ERROR:  could not 
        serialize access due to concurrent update at 
        /usr/lib/perl5/vendor_perl/5.8.8/Bucardo.pm line 4420.
The previously executed SQL was:
        $SQL = 'UPDATE bucardo.bucardo_sequences SET value=?, 
        iscalled=? WHERE tablename=?';
I couldn't see a failed SQL statement that would cause this in the 
postgres logs, although there were some errors about failing to 
insert some records with duplicate primary keys.  I tried running 
the same SQL again and I didn't get the error from bucardo.

Restarting bucardo seemed to get things replicating again but I'm 
worried that things will block again next time whatever update 
script caused this runs again.

Status dates in future:
~~~~~~~~~~~~~~~~~~~~~~~
When I do a bucardo_ctl status on any of my syncs, I always get a 
date in the future, eg a moment ago I got:
        ...
        Last good:            ? (time to run: 1s)
        Last good time:       Oct 29, 2010 13:57:20  Target: 
          maplayers_dbfarm0
        Ins/Upd/Del:          0 / 0 / 0
        Last bad:             ? (time to run: 14s)
        Last bad time:        Oct 29, 2010 12:36:45  Target: 
          maplayers_dbfarm0
        ...
System date (unix date command):
        Thu Oct 28 22:29:26 EST 2010
Postgres date ("SELECT CURRENT_TIMESTAMP;" in psql)
        2010-10-28 22:31:51.844271+11
Our servers are in Tasmania, Australia which is currently GMT +10.  I
suspect this has something to do with it but I'm not sure what.

Anyway, apologies for the huge post and thanks for making this handy
piece of software.

Does anyone have an idea of how to fix the concurrency problems I'm
experiencing?

Cheers,
Geoff