[Bucardo-general] Sync Status when One Node is Offline
Mike Tonks
michael.tonks at headforwards.com
Wed May 8 15:42:30 UTC 2013
Hi,
As per my previous mail, I'm testing with 4.99.7 on ubuntu 13.04,
Postgres 9.1.9
I have 4 source databases and one target, with a very simple test
schema, and am testing the scenario when one db goes offline.
For now, I test this using:
- sudo /etc/init.d/postgres stop # (kill the pg server)
I repeat the test a few times, and sometimes after taking the node
offline and inserting into an active node to break replication, status
incorrectly reports good. I expect to see 'Bad' here. This is
unpredictable, sometimes it reports correctly and sometimes it does not.
Name State Last good Time Last I/D Last bad Time
==========+========+============+=======+===========+===========+=======
testsync | Good | 16:33:28 | 25s | 0/4 | 16:32:55 | 58s
In the log I can see it has trapped the connection error:
(12265) [Wed May 8 16:34:17 2013] MCP DB "testa" Postgres version: 90109
(12265) [Wed May 8 16:34:17 2013] MCP DB "testa" Database port: 5432
(12265) [Wed May 8 16:34:17 2013] MCP Connecting to database
"testa_remote" (source)
(12265) [Wed May 8 16:34:17 2013] MCP Warning: Killed (line 100): DBI
connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could
not connect to server: Connection refused
Is the server running on host "192.168.97.93" and accepting
TCP/IP connections on port 5432? at
/usr/local/share/perl/5.14.2/Bucardo.pm line 4941.
(12265) [Wed May 8 16:34:17 2013] MCP Database problem, will respawn
after a short sleep: 15
(12265) [Wed May 8 16:34:18 2013] MCP End of cleanup_mcp. Sys time: Wed
May 8 16:34:18 2013. Database time: 2013-05-08 16:34:18.900294+01
(12265) [Wed May 8 16:34:18 2013] MCP Sleep time: 15
(12270) [Wed May 8 16:34:19 2013] KID New kid, sync "testsync" alive=1
Parent=11974 PID=12270 kicked=1
(12270) [Wed May 8 16:34:19 2013] KID Missing testa_remote database handle
(12270) [Wed May 8 16:34:19 2013] KID Missing testb database handle
(12270) [Wed May 8 16:34:19 2013] KID Missing testc database handle
(12270) [Wed May 8 16:34:19 2013] KID Missing testd database handle
(12270) [Wed May 8 16:34:19 2013] KID Kid 12270 exiting at cleanup_kid.
Sync "testsync" Reason: DBI
connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could
not connect to server: Connection refused Is the server running on
host "192.168.97.93" and accepting TCP/IP connections on port 5432?
at /usr/local/share/perl/5.14.2/Bucardo.pm line 4941. Line: 2718
(11974) [Wed May 8 16:34:19 2013] CTL Created new kid 12270 for sync
"testsync"
(12275) [Wed May 8 16:34:30 2013] KID New kid, sync "testsync" alive=1
Parent=11974 PID=12275 kicked=1
(12275) [Wed May 8 16:34:30 2013] KID Missing testa_remote database handle
(12275) [Wed May 8 16:34:30 2013] KID Missing testb database handle
(12275) [Wed May 8 16:34:30 2013] KID Missing testc database handle
(12275) [Wed May 8 16:34:30 2013] KID Missing testd database handle
(12275) [Wed May 8 16:34:30 2013] KID Kid 12275 exiting at cleanup_kid.
Sync "testsync" Reason: DBI
connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could
not connect to server: Connection refused Is the server running on
host "192.168.97.93" and accepting TCP/IP connections on port 5432?
at /usr/local/share/perl/5.14.2/Bucardo.pm line 4941. Line: 2718
(11974) [Wed May 8 16:34:30 2013] CTL Created new kid 12275 for sync
"testsync"
But status is still reporting good:
Name State Last good Time Last I/D Last bad Time
==========+========+============+========+===========+===========+=======
testsync | Good | 16:33:28 | 1m 33s | 0/4 | 16:32:55 | 2m 6s
I try inserting a few more rows into the online node for luck, but
status is not affected.
Now I bring the offline node back up. Replication catches up correctly,
but the 'Last Bad' was not registered'
Name State Last good Time Last I/D Last bad Time
==========+========+============+=======+===========+===========+========
testsync | Good | 16:36:20 | 4s | 0/12 | 16:32:55 | 3m 29s
and the log file ...
(12381) [Wed May 8 16:36:01 2013] MCP Activating sync "testsync"
(12381) [Wed May 8 16:36:01 2013] MCP Running validate_sync on "testsync"
(12381) [Wed May 8 16:36:01 2013] MCP Connecting to database "testa"
(source)
(12381) [Wed May 8 16:36:01 2013] MCP DB "testa" Local epoch:
1368027361.55505 DB epoch: 1368027361.55507
(12381) [Wed May 8 16:36:01 2013] MCP DB "testa" Local time: Wed May 8
16:36:01 2013 DB time: 2013-05-08 16:36:01.555066+01
(12381) [Wed May 8 16:36:01 2013] MCP DB "testa" Local timezone: BST
(+0100) DB timezone: GB
(12381) [Wed May 8 16:36:01 2013] MCP DB "testa" Postgres version: 90109
(12381) [Wed May 8 16:36:01 2013] MCP DB "testa" Database port: 5432
(12381) [Wed May 8 16:36:01 2013] MCP Connecting to database
"testa_remote" (source)
(12381) [Wed May 8 16:36:01 2013] MCP Warning: Killed (line 100): DBI
connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could
not connect to server: Connection refused
Is the server running on host "192.168.97.93" and accepting
TCP/IP connections on port 5432? at
/usr/local/share/perl/5.14.2/Bucardo.pm line 4941.
(12381) [Wed May 8 16:36:01 2013] MCP Database problem, will respawn
after a short sleep: 15
(12381) [Wed May 8 16:36:03 2013] MCP End of cleanup_mcp. Sys time: Wed
May 8 16:36:03 2013. Database time: 2013-05-08 16:36:03.076622+01
(12381) [Wed May 8 16:36:03 2013] MCP Sleep time: 15
(12387) [Wed May 8 16:36:09 2013] KID New kid, sync "testsync" alive=1
Parent=11974 PID=12387 kicked=1
(12387) [Wed May 8 16:36:09 2013] KID Missing testa_remote database handle
(12387) [Wed May 8 16:36:09 2013] KID Missing testb database handle
(12387) [Wed May 8 16:36:09 2013] KID Missing testc database handle
(12387) [Wed May 8 16:36:09 2013] KID Missing testd database handle
(12387) [Wed May 8 16:36:09 2013] KID Kid 12387 exiting at cleanup_kid.
Sync "testsync" Reason: DBI
connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could
not connect to server: Connection refused Is the server running on
host "192.168.97.93" and accepting TCP/IP connections on port 5432?
at /usr/local/share/perl/5.14.2/Bucardo.pm line 4941. Line: 2718
(11974) [Wed May 8 16:36:09 2013] CTL Created new kid 12387 for sync
"testsync"
(12381) [Wed May 8 16:36:18 2013] MCP Respawn attempt:
/usr/local/bin/bucardo start 'Attempting automatic respawn after MCP death'
(12394) [Wed May 8 16:36:18 2013] MCP Starting Bucardo version 4.99.7
(12394) [Wed May 8 16:36:18 2013] MCP Log level: normal
(12395) [Wed May 8 16:36:18 2013] MCP Master DB Local epoch:
1368027378.27916 DB epoch: 1368027378.27921
(12395) [Wed May 8 16:36:18 2013] MCP Master DB Local time: Wed May 8
16:36:18 2013 DB time: 2013-05-08 16:36:18.279208+01
(12395) [Wed May 8 16:36:18 2013] MCP Master DB Local timezone: BST
(+0100) DB timezone: GB
(12395) [Wed May 8 16:36:18 2013] MCP Master DB Postgres version: 90109
(12395) [Wed May 8 16:36:18 2013] MCP Master DB Database port: 5432
(12395) [Wed May 8 16:36:18 2013] MCP PID: 12395
(12395) [Wed May 8 16:36:18 2013] MCP Postgres backend PID: 12396
(12395) [Wed May 8 16:36:18 2013] MCP Postgres library version: 90106
(12395) [Wed May 8 16:36:18 2013] MCP bucardo: /usr/local/bin/bucardo
(12395) [Wed May 8 16:36:18 2013] MCP Bucardo.pm:
/usr/local/share/perl/5.14.2/Bucardo.pm
(12395) [Wed May 8 16:36:18 2013] MCP OS: linux Perl: /usr/bin/perl 5.14.2
(12395) [Wed May 8 16:36:18 2013] MCP DBI version: 1.622 DBD::Pg
version: 2.19.2 (21902) DBIx::Safe version: 1.2.5
(12395) [Wed May 8 16:36:18 2013] MCP Bucardo object:
batch => '0'
bcverbose => '1'
created => 'Wed May 8 16:36:18 2013'
dbdpgversion => '21902'
dbhlist => 'HASH(0x1adaf48)'
dbhost => '127.0.0.1'
dbname => 'bucardo'
dbpass => '<not shown>'
dbport => '5432'
dbuser => 'bucardo'
dryrun => '0'
exit_on_nosync => '0'
extraname => ''
listening => 'HASH(0x1eedf98)'
logclean => '0'
logcodes => 'ARRAY(0x19eb378)'
logdest => 'ARRAY(0x1630218)'
logextension => ''
logpid => '12395'
logprefix => 'MCP'
logseparate => '0'
masterdbh => 'DBI::db=HASH(0x1c633d0)'
mcp_backend => '12396'
mcp_clock_timestamp => 'clock_timestamp()'
mcppid => '12381'
pidfile => '/var/run/bucardo/bucardo.mcp.pid'
pidmap => 'HASH(0x17b4378)'
sendmail => '0'
sendmail_file => ''
sqlprefix => '/* Bucardo 4.99.7 */'
stopfile => '/var/run/bucardo/fullstopbucardo'
verbose => '1'
version => '4.99.7'
warning_file => ''
(12395) [Wed May 8 16:36:18 2013] MCP Bucardo config:
autosync_ddl => 'newcol'
bucardo_current_version => '4.99.7'
bucardo_vac => '1'
bucardo_version => '4.99.6'
ctl_checkonkids_time => '10'
ctl_createkid_time => '0.5'
ctl_sleep => '0.2'
default_conflict_strategy => 'bucardo_latest'
default_email_from => 'root'
default_email_host => 'localhost'
default_email_to => 'root'
email_debug_file => ''
endsync_sleep => '1.0'
flatfile_dir => '.'
host_safety_check => ''
kid_deadlock_sleep => '0.5'
kid_nodeltarows_sleep => '0.5'
kid_pingtime => '60'
kid_restart_sleep => '1'
kid_serial_sleep => '0.5'
kid_sleep => '0.5'
log_conflict_file => '/var/log/bucardo/conflict.log'
log_level => 'normal'
log_level_number => '2'
log_microsecond => '0'
log_showlevel => '0'
log_showline => '0'
log_showpid => '1'
log_showtime => '3'
mcp_dbproblem_sleep => '15'
mcp_loop_sleep => '0.2'
mcp_pingtime => '60'
mcp_vactime => '60'
piddir => '/var/run/bucardo'
reason_file => '/var/log/bucardo/restart.reason'
semaphore_table => 'bucardo_status'
statement_chunk_size => '10000'
stats_script_url => 'http://www.bucardo.org/'
stopfile => 'fullstopbucardo'
syslog_facility => 'log_local1'
tcp_keepalives_count => '0'
tcp_keepalives_idle => '0'
tcp_keepalives_interval => '0'
vac_run => '30'
vac_sleep => '120'
warning_file => ''
(12395) [Wed May 8 16:36:18 2013] MCP Activating sync "testsync"
(12395) [Wed May 8 16:36:18 2013] MCP Running validate_sync on "testsync"
(12395) [Wed May 8 16:36:18 2013] MCP Connecting to database "testa"
(source)
(12395) [Wed May 8 16:36:18 2013] MCP DB "testa" Local epoch:
1368027378.91922 DB epoch: 1368027378.91924
(12395) [Wed May 8 16:36:18 2013] MCP DB "testa" Local time: Wed May 8
16:36:18 2013 DB time: 2013-05-08 16:36:18.919237+01
(12395) [Wed May 8 16:36:18 2013] MCP DB "testa" Local timezone: BST
(+0100) DB timezone: GB
(12395) [Wed May 8 16:36:18 2013] MCP DB "testa" Postgres version: 90109
(12395) [Wed May 8 16:36:18 2013] MCP DB "testa" Database port: 5432
(12395) [Wed May 8 16:36:18 2013] MCP Connecting to database
"testa_remote" (source)
(12395) [Wed May 8 16:36:18 2013] MCP DB "testa_remote" Local epoch:
1368027378.94052 DB epoch: 1368027378.62312
(12395) [Wed May 8 16:36:18 2013] MCP DB "testa_remote" Local time: Wed
May 8 16:36:18 2013 DB time: 2013-05-08 16:36:18.623115+01
(12395) [Wed May 8 16:36:18 2013] MCP DB "testa_remote" Local timezone:
BST (+0100) DB timezone: GB
(12395) [Wed May 8 16:36:18 2013] MCP DB "testa_remote" Postgres
version: 90109
(12395) [Wed May 8 16:36:18 2013] MCP DB "testa_remote" Database port: 5432
(12395) [Wed May 8 16:36:18 2013] MCP Connecting to database "testb"
(source)
(12395) [Wed May 8 16:36:18 2013] MCP DB "testb" Local epoch:
1368027378.94788 DB epoch: 1368027378.94789
(12395) [Wed May 8 16:36:18 2013] MCP DB "testb" Local time: Wed May 8
16:36:18 2013 DB time: 2013-05-08 16:36:18.94789+01
(12395) [Wed May 8 16:36:18 2013] MCP DB "testb" Local timezone: BST
(+0100) DB timezone: GB
(12395) [Wed May 8 16:36:18 2013] MCP DB "testb" Postgres version: 90109
(12395) [Wed May 8 16:36:18 2013] MCP DB "testb" Database port: 5432
(12395) [Wed May 8 16:36:18 2013] MCP Connecting to database "testc"
(source)
(12395) [Wed May 8 16:36:18 2013] MCP DB "testc" Local epoch:
1368027378.95366 DB epoch: 1368027378.95371
(12395) [Wed May 8 16:36:18 2013] MCP DB "testc" Local time: Wed May 8
16:36:18 2013 DB time: 2013-05-08 16:36:18.953713+01
(12395) [Wed May 8 16:36:18 2013] MCP DB "testc" Local timezone: BST
(+0100) DB timezone: GB
(12395) [Wed May 8 16:36:18 2013] MCP DB "testc" Postgres version: 90109
(12395) [Wed May 8 16:36:18 2013] MCP DB "testc" Database port: 5432
(12395) [Wed May 8 16:36:18 2013] MCP Connecting to database "testd"
(source)
(12395) [Wed May 8 16:36:18 2013] MCP DB "testd" Local epoch:
1368027378.96011 DB epoch: 1368027378.96012
(12395) [Wed May 8 16:36:18 2013] MCP DB "testd" Local time: Wed May 8
16:36:18 2013 DB time: 2013-05-08 16:36:18.960125+01
(12395) [Wed May 8 16:36:18 2013] MCP DB "testd" Local timezone: BST
(+0100) DB timezone: GB
(12395) [Wed May 8 16:36:18 2013] MCP DB "testd" Postgres version: 90109
(12395) [Wed May 8 16:36:18 2013] MCP DB "testd" Database port: 5432
(12395) [Wed May 8 16:36:19 2013] MCP Inspecting source table
"public.test" on database "testa"
(12395) [Wed May 8 16:36:19 2013] MCP Active syncs: 1
(12395) [Wed May 8 16:36:19 2013] MCP Entering main loop
(12405) [Wed May 8 16:36:19 2013] VAC New VAC daemon. PID=12405
(12405) [Wed May 8 16:36:19 2013] VAC Connected to database "testd"
with backend PID of 12407
(12405) [Wed May 8 16:36:19 2013] VAC Connected to database "testb"
with backend PID of 12408
(12405) [Wed May 8 16:36:19 2013] VAC Connected to database "testa"
with backend PID of 12409
(12405) [Wed May 8 16:36:19 2013] VAC Connected to database "testc"
with backend PID of 12410
(12405) [Wed May 8 16:36:19 2013] VAC Connected to database
"testa_remote" with backend PID of 6085
NOTICE: Rows deleted from delta_public_test: 3 Rows deleted from
track_public_test: 3
(12395) [Wed May 8 16:36:19 2013] MCP Created VAC 12405
NOTICE: Rows deleted from delta_public_test: 0 Rows deleted from
track_public_test: 0
NOTICE: Rows deleted from delta_public_test: 0 Rows deleted from
track_public_test: 0
NOTICE: Rows deleted from delta_public_test: 0 Rows deleted from
track_public_test: 0
NOTICE: Rows deleted from delta_public_test: 0 Rows deleted from
track_public_test: 0
(12411) [Wed May 8 16:36:19 2013] CTL New controller for sync
"testsync". Relgroup is "testherd", dbs is "testgroup". PID=12411
(12411) [Wed May 8 16:36:19 2013] CTL stayalive: 1 checksecs: 0 kicked: 1
(12411) [Wed May 8 16:36:19 2013] CTL kidsalive: 1 onetimecopy: 0
lifetimesecs: 0 (NULL) maxkicks: 0
(12411) [Wed May 8 16:36:19 2013] CTL Database "testa" backend PID: 12413
(12395) [Wed May 8 16:36:19 2013] MCP Created controller 12411 for sync
"testsync". Kick is 1
(12411) [Wed May 8 16:36:19 2013] CTL Database "testa_remote" backend
PID: 6086
(12411) [Wed May 8 16:36:19 2013] CTL Database "testb" backend PID: 12414
(12411) [Wed May 8 16:36:19 2013] CTL Database "testc" backend PID: 12415
(12411) [Wed May 8 16:36:19 2013] CTL Database "testd" backend PID: 12416
(12417) [Wed May 8 16:36:19 2013] KID New kid, sync "testsync" alive=1
Parent=12411 PID=12417 kicked=1
(11974) [Wed May 8 16:36:19 2013] CTL Got a stop sync request, so exiting
(11974) [Wed May 8 16:36:19 2013] CTL Warning! Controller for
"testsync" was killed at line 1779: Stop sync request at
/usr/local/share/perl/5.14.2/Bucardo.pm line 1779.
(12417) [Wed May 8 16:36:19 2013] KID Got a stop sync request, so exiting
(12417) [Wed May 8 16:36:19 2013] KID Kid 12417 exiting at cleanup_kid.
Sync "testsync" public.test Reason: Stop sync request at
/usr/local/share/perl/5.14.2/Bucardo.pm line 2754. Line: 4762
(12411) [Wed May 8 16:36:19 2013] CTL Created new kid 12417 for sync
"testsync"
(11974) [Wed May 8 16:36:19 2013] CTL Controller 11974 exiting at
cleanup_controller. Reason: Stop sync request at
/usr/local/share/perl/5.14.2/Bucardo.pm line 1779.
(12425) [Wed May 8 16:36:20 2013] KID New kid, sync "testsync" alive=1
Parent=12411 PID=12425 kicked=1
(12411) [Wed May 8 16:36:20 2013] CTL Created new kid 12425 for sync
"testsync"
(12425) [Wed May 8 16:36:20 2013] KID Delta count for
testa.public.test : 3
(12425) [Wed May 8 16:36:20 2013] KID Totals: deletes=0 inserts=12
conflicts=0
Thanks for any help with this, I hope it is useful for your beta testing.
Regards,
Mike Tonks
More information about the Bucardo-general
mailing list