[Bucardo-general] Bucardo weird errors on buzy system

Fri Feb 20 16:27:32 UTC 2015

Hello all,

I am using bucardo 5.1.2 to replicate multiple groups of tables using 
multimaster mode - currently 2 machines (32 cores, 16 GB), but may be 
more in the future to make the application more scalable (lot of pg 
clients).

A group of table is related to a specific event and consists of 10 
tables and sequences. Two of those tables are updated frequently, 
approximatively 20 new lines and 20 or less lines deleted each second.

The main application creates a fixed number of events, not more than 100 
at the early stage of the application; when a new event is created, an 
external program creates the corresponding schema in all machines and 
calls bucardo to create a sync for those new tables and sequences.

bucardo sync are created like this :
bucardo add sync sync_xxx db_group=dbgroup_xxx relgroup=relgroup_xxx 
conflict_strategy=bucardo_latest autokick=1

I first tried with only a few events, 1 or 2, (so there are 1 or 2 
bucardo sync) and it was working correctly. Il I increases the number of 
sync (10 syncs), I notice that all sync status oscillate between Good 
and Bad and that they are a lot of errors in the logs. When they are in 
Bad state, it takes them one or two minute them to go back in Good state 
, meaning the tables are not updated during this time.

Here the type of error that comes  very often in the logs :
<<<
(2754) [Fri Feb 20 15:56:35 2015] KID (the_sync_XXX_6) Kid has died, 
error is: DBD::Pg::db pg_cancel failed: No asynchronous query is running 
at /usr/share/perl5/Bucardo.pm line 5403. Line: 5425 Main DB state: ? 
Error: none DB channel_db_bucardo_0 state: ? Error: none DB 
channel_db_bucardo_1 state: 40001 Error: 7
DBI::db=HASH(0x1cdce80)->disconnect invalidates 20 active statement 
handles (either destroy statement handles or call finish on them before 
disconnecting) at /usr/share/perl5/Bucardo.pm line 2692.
(2754) [Fri Feb 20 15:56:35 2015] KID (the_sync_XXX_6) Kid 2754 exiting 
at cleanup_kid. Sync "the_sync_XXX_6" channel_XXX_0.streams Reason: 
DBD::Pg::db pg_cancel failed: No asynchronous query is running at 
/usr/share/perl5/Bucardo.pm line 5403. Line: 5425 Main DB state: ? 
Error: none DB channel_db_bucardo_0 state: ? Error: none DB 
channel_db_bucardo_1 state: 40001 Error: 7
(2681) [Fri Feb 20 15:56:35 2015] KID (the_sync_XXX_8) Kid has died, 
error is: DBD::Pg::db pg_cancel failed: No asynchronous query is running 
at /usr/share/perl5/Bucardo.pm line 5403. Line: 5425 Main DB state: ? 
Error: none DB channel_db_bucardo_0 state: ? Error: none DB 
channel_db_bucardo_1 state: 40001 Error: 7
DBI::db=HASH(0x1ce50c8)->disconnect invalidates 26 active statement 
handles (either destroy statement handles or call finish on them before 
disconnecting) at /usr/share/perl5/Bucardo.pm line 2692.
(2681) [Fri Feb 20 15:56:35 2015] KID (the_sync_XXX_8) Kid 2681 exiting 
at cleanup_kid. Sync "the_sync_XXX_8" channel_XXX_0.streams Reason: 
DBD::Pg::db pg_cancel failed: No asynchronous query is running at 
/usr/share/perl5/Bucardo.pm line 5403. Line: 5425 Main DB state: ? 
Error: none DB channel_db_bucardo_0 state: ? Error: none DB 
channel_db_bucardo_1 state: 40001 Error: 7
 >>>

If I restart bucardo, it does not solve the problem.

Questions:
1) How scalable is bucardo : In other words, is there a sync limit in 
bucardo, that could make a solution with lot of syncs not scalable? For 
example, are syncs independent or not / dependent of a process that 
controls all syncs, ie, if one process is blocked, does it have impacts 
on others?

2) Is there a way to get rid of those errors? I guess they are related 
to the fact the sync are not refreshed.

3) Is there a way to make the syncs more responsive? for example, the 
kid can be created with options "checktime", "lifetime", "maxkicks", 
"overdue" or "expired", but I am not sure to understand the benefits of 
those options.

4) I notice there are global options to control the bucardo children 
processes - for example 'ctl_checkonkids_time'. Can it help to restart 
erroneous processes more quickly ?

Last question, may be related or not : I notice that some sync sometimes 
become inactive. After that, I find no way to make then work again, 
using bucardo activate XXX does not solve, and stopping / restarting the 
daemon does not help, and nothing special is present in the logs to 
explain what is wrong. So why a sync can become inactive, why and what 
to do in this case?

Thanks and regards,
Sylvain