[Bucardo-general] CTL Creating Too Many syncrun Rows

Thu Jul 25 12:47:49 UTC 2013

Fellow Bucardoans,

We rolled out a bunch of new databases on Bucardo this week, and in the process I noticed something unusual. Have a look at this status:

$ bucardo status
PID of Bucardo MCP: 11016
 Name        State                  Last good               Time           Last I/D/C    Last bad                Time        
===========+======================+=======================+==============+=============+=======================+=============
 foo_sync  | Good                 | Jul 22, 2013 23:33:13 | 60h 51m 28s  | 0/1         | none                  |             
 bar_sync  | Finished (KID 11063) | 06:20:09              | 6h 4m 32s    | 1/4         | Jul 24, 2013 16:09:06 | 20h 15m 35s 
 yo_sync   | Bad                  | Jul 24, 2013 19:37:57 | 16h 46m 44s  | 2/2         | Jul 24, 2013 19:39:14 | 16h 45m 27s 

The question is, what’s with that “Finished” state for bar_sync? If a sync is finished, should it not be “Good”? The status command gets this data from the syncrun table, so let’s have a look there:

bucardo=# select started, ended, lastgood, lastbad, lastempty, deletes, inserts, details from syncrun where sync = 'bar_sync' and (lastgood or lastbad or lastempty or ended is null) order by started;
            started            |             ended             | lastgood | lastbad | lastempty | deletes | inserts | details 
-------------------------------+-------------------------------+----------+---------+-----------+---------+---------+---------
 2013-07-24 15:59:29.089763+00 | 2013-07-24 16:09:06.351826+00 | f        | t       | f         |       1 |       2 | 
 2013-07-24 19:17:38.356111+00 | 2013-07-24 19:17:38.643608+00 | f        | f       | t         |       0 |       0 | 
 2013-07-24 23:10:18.340036+00 |                               | f        | f       | f         |      24 |      37 | 
 2013-07-24 23:10:25.330737+00 |                               | f        | f       | f         |      22 |      34 | 
 2013-07-24 23:26:20.990398+00 |                               | f        | f       | f         |      15 |      24 | 
 2013-07-24 23:31:12.116793+00 |                               | f        | f       | f         |      13 |      21 | 
 2013-07-25 00:08:19.921732+00 |                               | f        | f       | f         |       9 |      15 | 
 2013-07-25 00:14:13.30168+00  |                               | f        | f       | f         |       2 |       5 | 
 2013-07-25 06:20:09.493359+00 | 2013-07-25 06:20:09.741927+00 | t        | f       | f         |       1 |       4 | 

It is those rows with no "ended" value that are confusing the `status` command. I *think* what’s happening is this:

* A bunch of tables are updated in a single transaction in the
  bar database, some by triggers
* Each has autokick turned on, so each sends a NOTIFY
* At transaction commit, Postgres sends multiple NOTIFYs
* MCP Gets them all and duly NOTIFY's the bar_sync CTL.
* The bar_sync CTL gets them all, and, for each, duly creates
  a syncrun for each and sends a NOTIFY to the bar_sync KID
* The KID gets the first NOTIFY and starts work
* When the KID finishes its work, the CTL has by now created
  multiple syncrun records
* So the KID gets confused, and leaves errors in the log such
  as this:

        (11068) [Wed Jul 24 17:08:33 2013] KID Expected one row from end_syncrun, but got 8

The problem is that CTL creates the syncrun, then tells KID to run. But it does not tell it *which* syncrun record to update. The KID is understandably confused (in end_syncrun()) when there seem to be multiple syncs going at once.

There are a few ways we might address this problem:

1. Add a SERIAL primary key to syncrun, and have the CTL pass it to the KID. Then the KID knows what syncrun to update. Cons: Need new PK column on syncrun, and the upgrade script will need to backfill it with values; might lead to many unnecessary syncs (if a txn sends 4 kicks, only one needs to run)

2. Change CTL to wait for an existing sync to finish, then call again only once for any more received in the interrim. Con: the CTL could get hung by a long-running sync.

3. Add the transaction ID to the kick payload and add pid and txnid columns to syncrun. Then change CTL so that it does not add a syncrun row if there is already one for that server PID and transaction ID. Cons: Need new syncrun columns; more data to parse from the payload; still might need to push the PID and txnid to KID, since other syncruns can be added at the same time by other processes.

I think #3 is probably the best approach, though maybe there is already some code somewhere to make sure that only one sync runs at any one time, and it is not currently working properly. If so, someone please point me at it!

Comments?

Thanks,

David