[Bucardo-general] Sync Status when One Node is Offline

Mike Tonks michael.tonks at headforwards.com
Wed May 8 15:42:30 UTC 2013


Hi,

As per my previous mail, I'm testing with 4.99.7 on ubuntu 13.04, 
Postgres 9.1.9

I have 4 source databases and one target, with a very simple test 
schema, and am testing the scenario when one db goes offline.

For now, I test this using:

  - sudo /etc/init.d/postgres stop # (kill the pg server)


I repeat the test a few times, and sometimes after taking the node 
offline and inserting into an active node to break replication, status 
incorrectly reports good.  I expect to see 'Bad' here.  This is 
unpredictable, sometimes it reports correctly and sometimes it does not.

  Name       State    Last good    Time    Last I/D    Last bad Time
==========+========+============+=======+===========+===========+=======
  testsync | Good   | 16:33:28   | 25s   | 0/4       | 16:32:55  | 58s


In the log I can see it has trapped the connection error:

(12265) [Wed May  8 16:34:17 2013] MCP DB "testa" Postgres version: 90109
(12265) [Wed May  8 16:34:17 2013] MCP DB "testa" Database port: 5432
(12265) [Wed May  8 16:34:17 2013] MCP Connecting to database 
"testa_remote" (source)
(12265) [Wed May  8 16:34:17 2013] MCP Warning: Killed (line 100): DBI 
connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could 
not connect to server: Connection refused
     Is the server running on host "192.168.97.93" and accepting
     TCP/IP connections on port 5432? at 
/usr/local/share/perl/5.14.2/Bucardo.pm line 4941.
(12265) [Wed May  8 16:34:17 2013] MCP Database problem, will respawn 
after a short sleep: 15
(12265) [Wed May  8 16:34:18 2013] MCP End of cleanup_mcp. Sys time: Wed 
May  8 16:34:18 2013. Database time: 2013-05-08 16:34:18.900294+01
(12265) [Wed May  8 16:34:18 2013] MCP Sleep time: 15
(12270) [Wed May  8 16:34:19 2013] KID New kid, sync "testsync" alive=1 
Parent=11974 PID=12270 kicked=1
(12270) [Wed May  8 16:34:19 2013] KID Missing testa_remote database handle
(12270) [Wed May  8 16:34:19 2013] KID Missing testb database handle
(12270) [Wed May  8 16:34:19 2013] KID Missing testc database handle
(12270) [Wed May  8 16:34:19 2013] KID Missing testd database handle
(12270) [Wed May  8 16:34:19 2013] KID Kid 12270 exiting at cleanup_kid. 
Sync "testsync" Reason: DBI 
connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could 
not connect to server: Connection refused     Is the server running on 
host "192.168.97.93" and accepting     TCP/IP connections on port 5432? 
at /usr/local/share/perl/5.14.2/Bucardo.pm line 4941. Line: 2718
(11974) [Wed May  8 16:34:19 2013] CTL Created new kid 12270 for sync 
"testsync"
(12275) [Wed May  8 16:34:30 2013] KID New kid, sync "testsync" alive=1 
Parent=11974 PID=12275 kicked=1
(12275) [Wed May  8 16:34:30 2013] KID Missing testa_remote database handle
(12275) [Wed May  8 16:34:30 2013] KID Missing testb database handle
(12275) [Wed May  8 16:34:30 2013] KID Missing testc database handle
(12275) [Wed May  8 16:34:30 2013] KID Missing testd database handle
(12275) [Wed May  8 16:34:30 2013] KID Kid 12275 exiting at cleanup_kid. 
Sync "testsync" Reason: DBI 
connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could 
not connect to server: Connection refused     Is the server running on 
host "192.168.97.93" and accepting     TCP/IP connections on port 5432? 
at /usr/local/share/perl/5.14.2/Bucardo.pm line 4941. Line: 2718
(11974) [Wed May  8 16:34:30 2013] CTL Created new kid 12275 for sync 
"testsync"


But status is still reporting good:

  Name       State    Last good    Time     Last I/D    Last bad Time
==========+========+============+========+===========+===========+=======
  testsync | Good   | 16:33:28   | 1m 33s | 0/4       | 16:32:55  | 2m 6s



I try inserting a few more rows into the online node for luck, but 
status is not affected.

Now I bring the offline node back up.  Replication catches up correctly, 
but the 'Last Bad' was not registered'

  Name       State    Last good    Time    Last I/D    Last bad Time
==========+========+============+=======+===========+===========+========
  testsync | Good   | 16:36:20   | 4s    | 0/12      | 16:32:55  | 3m 29s


and the log file ...

(12381) [Wed May  8 16:36:01 2013] MCP Activating sync "testsync"
(12381) [Wed May  8 16:36:01 2013] MCP Running validate_sync on "testsync"
(12381) [Wed May  8 16:36:01 2013] MCP Connecting to database "testa" 
(source)
(12381) [Wed May  8 16:36:01 2013] MCP DB "testa" Local epoch: 
1368027361.55505  DB epoch: 1368027361.55507
(12381) [Wed May  8 16:36:01 2013] MCP DB "testa" Local time: Wed May  8 
16:36:01 2013  DB time: 2013-05-08 16:36:01.555066+01
(12381) [Wed May  8 16:36:01 2013] MCP DB "testa" Local timezone: BST 
(+0100)  DB timezone: GB
(12381) [Wed May  8 16:36:01 2013] MCP DB "testa" Postgres version: 90109
(12381) [Wed May  8 16:36:01 2013] MCP DB "testa" Database port: 5432
(12381) [Wed May  8 16:36:01 2013] MCP Connecting to database 
"testa_remote" (source)
(12381) [Wed May  8 16:36:01 2013] MCP Warning: Killed (line 100): DBI 
connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could 
not connect to server: Connection refused
     Is the server running on host "192.168.97.93" and accepting
     TCP/IP connections on port 5432? at 
/usr/local/share/perl/5.14.2/Bucardo.pm line 4941.
(12381) [Wed May  8 16:36:01 2013] MCP Database problem, will respawn 
after a short sleep: 15
(12381) [Wed May  8 16:36:03 2013] MCP End of cleanup_mcp. Sys time: Wed 
May  8 16:36:03 2013. Database time: 2013-05-08 16:36:03.076622+01
(12381) [Wed May  8 16:36:03 2013] MCP Sleep time: 15
(12387) [Wed May  8 16:36:09 2013] KID New kid, sync "testsync" alive=1 
Parent=11974 PID=12387 kicked=1
(12387) [Wed May  8 16:36:09 2013] KID Missing testa_remote database handle
(12387) [Wed May  8 16:36:09 2013] KID Missing testb database handle
(12387) [Wed May  8 16:36:09 2013] KID Missing testc database handle
(12387) [Wed May  8 16:36:09 2013] KID Missing testd database handle
(12387) [Wed May  8 16:36:09 2013] KID Kid 12387 exiting at cleanup_kid. 
Sync "testsync" Reason: DBI 
connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could 
not connect to server: Connection refused     Is the server running on 
host "192.168.97.93" and accepting     TCP/IP connections on port 5432? 
at /usr/local/share/perl/5.14.2/Bucardo.pm line 4941. Line: 2718
(11974) [Wed May  8 16:36:09 2013] CTL Created new kid 12387 for sync 
"testsync"
(12381) [Wed May  8 16:36:18 2013] MCP Respawn attempt: 
/usr/local/bin/bucardo  start 'Attempting automatic respawn after MCP death'
(12394) [Wed May  8 16:36:18 2013] MCP Starting Bucardo version 4.99.7
(12394) [Wed May  8 16:36:18 2013] MCP Log level: normal
(12395) [Wed May  8 16:36:18 2013] MCP Master DB Local epoch: 
1368027378.27916  DB epoch: 1368027378.27921
(12395) [Wed May  8 16:36:18 2013] MCP Master DB Local time: Wed May  8 
16:36:18 2013  DB time: 2013-05-08 16:36:18.279208+01
(12395) [Wed May  8 16:36:18 2013] MCP Master DB Local timezone: BST 
(+0100)  DB timezone: GB
(12395) [Wed May  8 16:36:18 2013] MCP Master DB Postgres version: 90109
(12395) [Wed May  8 16:36:18 2013] MCP Master DB Database port: 5432
(12395) [Wed May  8 16:36:18 2013] MCP PID: 12395
(12395) [Wed May  8 16:36:18 2013] MCP Postgres backend PID: 12396
(12395) [Wed May  8 16:36:18 2013] MCP Postgres library version: 90106
(12395) [Wed May  8 16:36:18 2013] MCP bucardo: /usr/local/bin/bucardo
(12395) [Wed May  8 16:36:18 2013] MCP Bucardo.pm: 
/usr/local/share/perl/5.14.2/Bucardo.pm
(12395) [Wed May  8 16:36:18 2013] MCP OS: linux  Perl: /usr/bin/perl 5.14.2
(12395) [Wed May  8 16:36:18 2013] MCP DBI version: 1.622  DBD::Pg 
version: 2.19.2 (21902) DBIx::Safe version: 1.2.5
(12395) [Wed May  8 16:36:18 2013] MCP Bucardo object:
  batch               => '0'
  bcverbose           => '1'
  created             => 'Wed May  8 16:36:18 2013'
  dbdpgversion        => '21902'
  dbhlist             => 'HASH(0x1adaf48)'
  dbhost              => '127.0.0.1'
  dbname              => 'bucardo'
  dbpass              => '<not shown>'
  dbport              => '5432'
  dbuser              => 'bucardo'
  dryrun              => '0'
  exit_on_nosync      => '0'
  extraname           => ''
  listening           => 'HASH(0x1eedf98)'
  logclean            => '0'
  logcodes            => 'ARRAY(0x19eb378)'
  logdest             => 'ARRAY(0x1630218)'
  logextension        => ''
  logpid              => '12395'
  logprefix           => 'MCP'
  logseparate         => '0'
  masterdbh           => 'DBI::db=HASH(0x1c633d0)'
  mcp_backend         => '12396'
  mcp_clock_timestamp => 'clock_timestamp()'
  mcppid              => '12381'
  pidfile             => '/var/run/bucardo/bucardo.mcp.pid'
  pidmap              => 'HASH(0x17b4378)'
  sendmail            => '0'
  sendmail_file       => ''
  sqlprefix           => '/* Bucardo 4.99.7 */'
  stopfile            => '/var/run/bucardo/fullstopbucardo'
  verbose             => '1'
  version             => '4.99.7'
  warning_file        => ''
(12395) [Wed May  8 16:36:18 2013] MCP Bucardo config:
  autosync_ddl              => 'newcol'
  bucardo_current_version   => '4.99.7'
  bucardo_vac               => '1'
  bucardo_version           => '4.99.6'
  ctl_checkonkids_time      => '10'
  ctl_createkid_time        => '0.5'
  ctl_sleep                 => '0.2'
  default_conflict_strategy => 'bucardo_latest'
  default_email_from        => 'root'
  default_email_host        => 'localhost'
  default_email_to          => 'root'
  email_debug_file          => ''
  endsync_sleep             => '1.0'
  flatfile_dir              => '.'
  host_safety_check         => ''
  kid_deadlock_sleep        => '0.5'
  kid_nodeltarows_sleep     => '0.5'
  kid_pingtime              => '60'
  kid_restart_sleep         => '1'
  kid_serial_sleep          => '0.5'
  kid_sleep                 => '0.5'
  log_conflict_file         => '/var/log/bucardo/conflict.log'
  log_level                 => 'normal'
  log_level_number          => '2'
  log_microsecond           => '0'
  log_showlevel             => '0'
  log_showline              => '0'
  log_showpid               => '1'
  log_showtime              => '3'
  mcp_dbproblem_sleep       => '15'
  mcp_loop_sleep            => '0.2'
  mcp_pingtime              => '60'
  mcp_vactime               => '60'
  piddir                    => '/var/run/bucardo'
  reason_file               => '/var/log/bucardo/restart.reason'
  semaphore_table           => 'bucardo_status'
  statement_chunk_size      => '10000'
  stats_script_url          => 'http://www.bucardo.org/'
  stopfile                  => 'fullstopbucardo'
  syslog_facility           => 'log_local1'
  tcp_keepalives_count      => '0'
  tcp_keepalives_idle       => '0'
  tcp_keepalives_interval   => '0'
  vac_run                   => '30'
  vac_sleep                 => '120'
  warning_file              => ''
(12395) [Wed May  8 16:36:18 2013] MCP Activating sync "testsync"
(12395) [Wed May  8 16:36:18 2013] MCP Running validate_sync on "testsync"
(12395) [Wed May  8 16:36:18 2013] MCP Connecting to database "testa" 
(source)
(12395) [Wed May  8 16:36:18 2013] MCP DB "testa" Local epoch: 
1368027378.91922  DB epoch: 1368027378.91924
(12395) [Wed May  8 16:36:18 2013] MCP DB "testa" Local time: Wed May  8 
16:36:18 2013  DB time: 2013-05-08 16:36:18.919237+01
(12395) [Wed May  8 16:36:18 2013] MCP DB "testa" Local timezone: BST 
(+0100)  DB timezone: GB
(12395) [Wed May  8 16:36:18 2013] MCP DB "testa" Postgres version: 90109
(12395) [Wed May  8 16:36:18 2013] MCP DB "testa" Database port: 5432
(12395) [Wed May  8 16:36:18 2013] MCP Connecting to database 
"testa_remote" (source)
(12395) [Wed May  8 16:36:18 2013] MCP DB "testa_remote" Local epoch: 
1368027378.94052  DB epoch: 1368027378.62312
(12395) [Wed May  8 16:36:18 2013] MCP DB "testa_remote" Local time: Wed 
May  8 16:36:18 2013  DB time: 2013-05-08 16:36:18.623115+01
(12395) [Wed May  8 16:36:18 2013] MCP DB "testa_remote" Local timezone: 
BST (+0100)  DB timezone: GB
(12395) [Wed May  8 16:36:18 2013] MCP DB "testa_remote" Postgres 
version: 90109
(12395) [Wed May  8 16:36:18 2013] MCP DB "testa_remote" Database port: 5432
(12395) [Wed May  8 16:36:18 2013] MCP Connecting to database "testb" 
(source)
(12395) [Wed May  8 16:36:18 2013] MCP DB "testb" Local epoch: 
1368027378.94788  DB epoch: 1368027378.94789
(12395) [Wed May  8 16:36:18 2013] MCP DB "testb" Local time: Wed May  8 
16:36:18 2013  DB time: 2013-05-08 16:36:18.94789+01
(12395) [Wed May  8 16:36:18 2013] MCP DB "testb" Local timezone: BST 
(+0100)  DB timezone: GB
(12395) [Wed May  8 16:36:18 2013] MCP DB "testb" Postgres version: 90109
(12395) [Wed May  8 16:36:18 2013] MCP DB "testb" Database port: 5432
(12395) [Wed May  8 16:36:18 2013] MCP Connecting to database "testc" 
(source)
(12395) [Wed May  8 16:36:18 2013] MCP DB "testc" Local epoch: 
1368027378.95366  DB epoch: 1368027378.95371
(12395) [Wed May  8 16:36:18 2013] MCP DB "testc" Local time: Wed May  8 
16:36:18 2013  DB time: 2013-05-08 16:36:18.953713+01
(12395) [Wed May  8 16:36:18 2013] MCP DB "testc" Local timezone: BST 
(+0100)  DB timezone: GB
(12395) [Wed May  8 16:36:18 2013] MCP DB "testc" Postgres version: 90109
(12395) [Wed May  8 16:36:18 2013] MCP DB "testc" Database port: 5432
(12395) [Wed May  8 16:36:18 2013] MCP Connecting to database "testd" 
(source)
(12395) [Wed May  8 16:36:18 2013] MCP DB "testd" Local epoch: 
1368027378.96011  DB epoch: 1368027378.96012
(12395) [Wed May  8 16:36:18 2013] MCP DB "testd" Local time: Wed May  8 
16:36:18 2013  DB time: 2013-05-08 16:36:18.960125+01
(12395) [Wed May  8 16:36:18 2013] MCP DB "testd" Local timezone: BST 
(+0100)  DB timezone: GB
(12395) [Wed May  8 16:36:18 2013] MCP DB "testd" Postgres version: 90109
(12395) [Wed May  8 16:36:18 2013] MCP DB "testd" Database port: 5432
(12395) [Wed May  8 16:36:19 2013] MCP   Inspecting source table 
"public.test" on database "testa"
(12395) [Wed May  8 16:36:19 2013] MCP Active syncs: 1
(12395) [Wed May  8 16:36:19 2013] MCP Entering main loop
(12405) [Wed May  8 16:36:19 2013] VAC New VAC daemon. PID=12405
(12405) [Wed May  8 16:36:19 2013] VAC Connected to database "testd" 
with backend PID of 12407
(12405) [Wed May  8 16:36:19 2013] VAC Connected to database "testb" 
with backend PID of 12408
(12405) [Wed May  8 16:36:19 2013] VAC Connected to database "testa" 
with backend PID of 12409
(12405) [Wed May  8 16:36:19 2013] VAC Connected to database "testc" 
with backend PID of 12410
(12405) [Wed May  8 16:36:19 2013] VAC Connected to database 
"testa_remote" with backend PID of 6085
NOTICE:  Rows deleted from delta_public_test: 3 Rows deleted from 
track_public_test: 3
(12395) [Wed May  8 16:36:19 2013] MCP Created VAC 12405
NOTICE:  Rows deleted from delta_public_test: 0 Rows deleted from 
track_public_test: 0
NOTICE:  Rows deleted from delta_public_test: 0 Rows deleted from 
track_public_test: 0
NOTICE:  Rows deleted from delta_public_test: 0 Rows deleted from 
track_public_test: 0
NOTICE:  Rows deleted from delta_public_test: 0 Rows deleted from 
track_public_test: 0
(12411) [Wed May  8 16:36:19 2013] CTL New controller for sync 
"testsync". Relgroup is "testherd", dbs is "testgroup". PID=12411
(12411) [Wed May  8 16:36:19 2013] CTL   stayalive: 1 checksecs: 0 kicked: 1
(12411) [Wed May  8 16:36:19 2013] CTL   kidsalive: 1 onetimecopy: 0 
lifetimesecs: 0 (NULL) maxkicks: 0
(12411) [Wed May  8 16:36:19 2013] CTL Database "testa" backend PID: 12413
(12395) [Wed May  8 16:36:19 2013] MCP Created controller 12411 for sync 
"testsync". Kick is 1
(12411) [Wed May  8 16:36:19 2013] CTL Database "testa_remote" backend 
PID: 6086
(12411) [Wed May  8 16:36:19 2013] CTL Database "testb" backend PID: 12414
(12411) [Wed May  8 16:36:19 2013] CTL Database "testc" backend PID: 12415
(12411) [Wed May  8 16:36:19 2013] CTL Database "testd" backend PID: 12416
(12417) [Wed May  8 16:36:19 2013] KID New kid, sync "testsync" alive=1 
Parent=12411 PID=12417 kicked=1
(11974) [Wed May  8 16:36:19 2013] CTL Got a stop sync request, so exiting
(11974) [Wed May  8 16:36:19 2013] CTL Warning! Controller for 
"testsync" was killed at line 1779: Stop sync request at 
/usr/local/share/perl/5.14.2/Bucardo.pm line 1779.
(12417) [Wed May  8 16:36:19 2013] KID Got a stop sync request, so exiting
(12417) [Wed May  8 16:36:19 2013] KID Kid 12417 exiting at cleanup_kid. 
Sync "testsync" public.test Reason: Stop sync request at 
/usr/local/share/perl/5.14.2/Bucardo.pm line 2754. Line: 4762
(12411) [Wed May  8 16:36:19 2013] CTL Created new kid 12417 for sync 
"testsync"
(11974) [Wed May  8 16:36:19 2013] CTL Controller 11974 exiting at 
cleanup_controller. Reason: Stop sync request at 
/usr/local/share/perl/5.14.2/Bucardo.pm line 1779.
(12425) [Wed May  8 16:36:20 2013] KID New kid, sync "testsync" alive=1 
Parent=12411 PID=12425 kicked=1
(12411) [Wed May  8 16:36:20 2013] CTL Created new kid 12425 for sync 
"testsync"
(12425) [Wed May  8 16:36:20 2013] KID Delta count for 
testa.public.test        : 3
(12425) [Wed May  8 16:36:20 2013] KID Totals: deletes=0 inserts=12 
conflicts=0


Thanks for any help with this, I hope it is useful for your beta testing.

Regards,

Mike Tonks



More information about the Bucardo-general mailing list