[Bucardo-general] KID dies with 'could not serialize access due to concurrent update' (prevents replication)
bill at bfccomputing.com
Wed Dec 7 07:06:11 UTC 2011
I'm working to track down the cause of (and get restarted!), a sync that looks to have just stopped working yesterday in the middle of the day. The last good log entry is a tiny update and then nothing has been replicated on one of the syncs since then (another sync is operating normally). There are two postgresql-8.1.23 machines running swap (multimaster) currently running bucardo 4.4.8.
I did the usual debugging of stopping/starting bucardo, validating the syncs, even restarting postgresql to be certain. System logs, disk, etc. all look good.
When bucardo starts up, it appears to be doing quite a bit of work and the logs fill with tons of activity. I do see some SQL errors logged and some constraint errors being bounced back but I see those kinds of messages in the logs while the thing has been working (quite well for over a year).
In the warnings log I'll see:
[Wed Dec 7 01:22:49 2011] CTL Warning! Kid 8295 seems to have died. Sync "ecommerce_left"
When I grep the full log, e.g. for 8295, I'll see a metric ton of work going on, then the kid will die, usually due to a 'could not serialize access due to concurrent update' error:
Dec 7 01:22:47 database3 Bucardo: KID [11412/22170] public.saved_cart UPDATE source to target pk 141183544|36648877
Dec 7 01:22:47 database3 Bucardo: KID Warning! Aborting due to exception for public.saved_cart.cart_id|item_line_id: 141183544|36648877 Error was DBD::Pg::st execute failed: ERROR: could not serialize access due to concurrent update at /usr/lib/perl5/site_perl/5.8.8/Bucardo.pm line 5747.
Dec 7 01:22:47 database3 Bucardo: KID Final database backend PID is 8877
Dec 7 01:22:47 database3 Bucardo: KID Kid exiting at cleanup_kid. Reason: Died at /usr/lib/perl5/site_perl/5.8.8/Bucardo.pm line 5818.
Dec 7 01:22:47 database3 Bucardo: KID Removed pid file "/var/run/bucardo/bucardo.kid.sync.ecommerce_left.ecommerce_right.pid"
Dec 7 01:22:49 database3 Bucardo: CTL Rows updated child 8295 to aborted in q: 1
Dec 7 01:22:49 database3 Bucardo: CTL Warning! Kid 8295 seems to have died. Sync "ecommerce_left"
(If I strace the target process on the other machine, I'll eventually see 'ROLLBACK' come across at the end.)
In the postgresql log on the other machine, I'll see, e.g.:
[2011-12-07 01:22:47.317 EST] 10.1.250.40:ecommerce ERROR: could not serialize access due to concurrent update
[2011-12-07 01:22:47.317 EST] 10.1.250.40:ecommerce STATEMENT: UPDATE public.saved_cart SET item_id=$1,qty=$2,save_date=$3 WHERE cart_id = $4 AND item_line_id = $5
It's not always the same statement. It's not always prepared statements (sometimes just a large DELETE) but the 'concurrent update' is consistent.
None of our code is setting transactions to SERIALIZABLE, and I've seen such warnings before in the bucardo log (which seem to have been transient - I see the code just sleeps and tries again). But in this case, it's halting the works, and my "KID Total delta count" is only growing. Is this bucardo stepping on itself? If so, is there a way to get it to be less parallel so that it can finish one transaction?
All suggestions for figuring out what's going on here much appreciated! Even if there's a way to just get past the immediate problem data, that would be a fine workaround for now. I'm reluctant to jump to 4.99 during the busy season here, but that's probably better than not getting it working.
[root at database3 bucardo]# bucardo_ctl --verbose status ecommerce_left
Days back: 3 User: bucardo Database: bucardo
Sync name: ecommerce_left
Current state: WAIT:38s (PID = 30454)
Source herd/database: ecommerce_herd / ecommerce_left
Target database: ecommerce_right
Tables in sync: 39
Last good: 12h 45m 31s (time to run: 1s)
Last good time: Dec 06, 2011 12:20:54 Target: ecommerce_right
Ins/Upd/Del: 18 / 8 / 4
Last bad: 26m 46s (time to run: 2m 26s)
Last bad time: Dec 07, 2011 00:39:39 Target: ecommerce_right
Latest bad reason: ?
PID file: /var/run/bucardo/bucardo.ctl.sync.ecommerce_left.pid
PID file created: Wed Dec 7 00:05:49 2011
Overdue time: 00:00:00
Expired time: 00:00:00
Stayalive: yes Kidsalive: yes
Rebuild index: 0 Do_listen: no
Ping: yes Makedelta: no
Bill McGonigle, Owner
BFC Computing, LLC
Email, IM, VOIP: bill at bfccomputing.com
Social networks: bill_mcgonigle/bill.mcgonigle
More information about the Bucardo-general