[Bucardo-general] Bucardo weird errors on buzy system
sym39
marechal.sylvain2 at gmail.com
Fri Feb 20 16:27:32 UTC 2015
Hello all,
I am using bucardo 5.1.2 to replicate multiple groups of tables using
multimaster mode - currently 2 machines (32 cores, 16 GB), but may be
more in the future to make the application more scalable (lot of pg
clients).
A group of table is related to a specific event and consists of 10
tables and sequences. Two of those tables are updated frequently,
approximatively 20 new lines and 20 or less lines deleted each second.
The main application creates a fixed number of events, not more than 100
at the early stage of the application; when a new event is created, an
external program creates the corresponding schema in all machines and
calls bucardo to create a sync for those new tables and sequences.
bucardo sync are created like this :
bucardo add sync sync_xxx db_group=dbgroup_xxx relgroup=relgroup_xxx
conflict_strategy=bucardo_latest autokick=1
I first tried with only a few events, 1 or 2, (so there are 1 or 2
bucardo sync) and it was working correctly. Il I increases the number of
sync (10 syncs), I notice that all sync status oscillate between Good
and Bad and that they are a lot of errors in the logs. When they are in
Bad state, it takes them one or two minute them to go back in Good state
, meaning the tables are not updated during this time.
Here the type of error that comes very often in the logs :
<<<
(2754) [Fri Feb 20 15:56:35 2015] KID (the_sync_XXX_6) Kid has died,
error is: DBD::Pg::db pg_cancel failed: No asynchronous query is running
at /usr/share/perl5/Bucardo.pm line 5403. Line: 5425 Main DB state: ?
Error: none DB channel_db_bucardo_0 state: ? Error: none DB
channel_db_bucardo_1 state: 40001 Error: 7
DBI::db=HASH(0x1cdce80)->disconnect invalidates 20 active statement
handles (either destroy statement handles or call finish on them before
disconnecting) at /usr/share/perl5/Bucardo.pm line 2692.
(2754) [Fri Feb 20 15:56:35 2015] KID (the_sync_XXX_6) Kid 2754 exiting
at cleanup_kid. Sync "the_sync_XXX_6" channel_XXX_0.streams Reason:
DBD::Pg::db pg_cancel failed: No asynchronous query is running at
/usr/share/perl5/Bucardo.pm line 5403. Line: 5425 Main DB state: ?
Error: none DB channel_db_bucardo_0 state: ? Error: none DB
channel_db_bucardo_1 state: 40001 Error: 7
(2681) [Fri Feb 20 15:56:35 2015] KID (the_sync_XXX_8) Kid has died,
error is: DBD::Pg::db pg_cancel failed: No asynchronous query is running
at /usr/share/perl5/Bucardo.pm line 5403. Line: 5425 Main DB state: ?
Error: none DB channel_db_bucardo_0 state: ? Error: none DB
channel_db_bucardo_1 state: 40001 Error: 7
DBI::db=HASH(0x1ce50c8)->disconnect invalidates 26 active statement
handles (either destroy statement handles or call finish on them before
disconnecting) at /usr/share/perl5/Bucardo.pm line 2692.
(2681) [Fri Feb 20 15:56:35 2015] KID (the_sync_XXX_8) Kid 2681 exiting
at cleanup_kid. Sync "the_sync_XXX_8" channel_XXX_0.streams Reason:
DBD::Pg::db pg_cancel failed: No asynchronous query is running at
/usr/share/perl5/Bucardo.pm line 5403. Line: 5425 Main DB state: ?
Error: none DB channel_db_bucardo_0 state: ? Error: none DB
channel_db_bucardo_1 state: 40001 Error: 7
>>>
If I restart bucardo, it does not solve the problem.
Questions:
1) How scalable is bucardo : In other words, is there a sync limit in
bucardo, that could make a solution with lot of syncs not scalable? For
example, are syncs independent or not / dependent of a process that
controls all syncs, ie, if one process is blocked, does it have impacts
on others?
2) Is there a way to get rid of those errors? I guess they are related
to the fact the sync are not refreshed.
3) Is there a way to make the syncs more responsive? for example, the
kid can be created with options "checktime", "lifetime", "maxkicks",
"overdue" or "expired", but I am not sure to understand the benefits of
those options.
4) I notice there are global options to control the bucardo children
processes - for example 'ctl_checkonkids_time'. Can it help to restart
erroneous processes more quickly ?
Last question, may be related or not : I notice that some sync sometimes
become inactive. After that, I find no way to make then work again,
using bucardo activate XXX does not solve, and stopping / restarting the
daemon does not help, and nothing special is present in the logs to
explain what is wrong. So why a sync can become inactive, why and what
to do in this case?
Thanks and regards,
Sylvain
More information about the Bucardo-general
mailing list