[Bucardo-general] My locks overfloweth

Fri Apr 12 20:15:49 UTC 2013

Has anyone seen this before?

We're using pg 9.1.5 on solaris 10, perl 5.8.4, and bucardo 4.99.7. We 
have two boxes in a master-master swap sync, conflict_strategy=latest, 
comprising around 39 tables in 3 databases of roughly 50k rows in all.  
We have it divided into 3 dbgroups and 4 syncs total for convenience.  
With some data in both boxes' databases, mostly in sync and our app 
layer turned off (no other db clients), we can start bucardo and end up 
with locks looking something like this:

    # psql -Upostgres -c "select pid,count(pid) from pg_locks group by
    pid order by count(pid)"
      pid  | count
    ------+-------
      7404 |     2
      6947 |    16
      6942 |    32
      6930 |    74
      6855 |    84
      6940 |  1678
    (6 rows)

That's 1600 locks just for bucardo.  Notably this is only on one box; 
the other bucardo is using around 400 locks to perform the same sync.  
Looking for culprit queries above, that pid 6940 is mostly idle on 
different tables with no query:

    # psql -Upostgres -c "select database, relation, pid, mode,
    current_query  from pg_locks join  pg_stat_activity on (pid=procpid)"
    database | relation | pid  |      mode       |  current_query
      ----------+----------+------+------------------+----------------------
         27512 |    28671 | 6940 | SIReadLock      | <IDLE>
         27512 |    28524 | 6940 | SIReadLock      | <IDLE>
         27512 |    28562 | 6940 | SIReadLock      | <IDLE>
         27512 |    28671 | 6940 | SIReadLock      | <IDLE>
         27512 |    28726 | 6940 | SIReadLock      | <IDLE>
         27512 |    28634 | 6940 | SIReadLock      | <IDLE>
    .... etc ....

Now, if I start our app which will begin doing writes on both boxes, 
we'll see transient activity with our queries and with bucardo doing its 
COPY ... STDIN work but those queries do their work and go away.   The 
lock count will rise over several minutes up to 20,000 or 30,000 on one 
box in similar manner.  Around that point we'll start to hit this:

    (28706) [Fri Apr 12 18:28:54 2013] KID New kid, sync "o1_sync"
    alive=1 Parent=27035 PID=28706 kicked=1
    (28706) [Fri Apr 12 18:28:54 2013] KID DBD::Pg::db pg_result failed:
    ERROR:  out of shared memory HINT:  You might need to increase
    max_pred_locks_per_transaction. at /tpapp/tpdb/lib/perl5/Bucardo.pm
    line 3140. Line: 4801 Main DB state: ? Error: none DB source_ossgw
    state: 53200 Error: 7 DB target_ossgw state: ? Error: none
    (28706) [Fri Apr 12 18:28:54 2013] KID Kid 28706 exiting at
    cleanup_kid. Sync "o1_sync" public.commonobjects Reason: DBD::Pg::db
    pg_result failed: ERROR:  out of shared memory HINT:  You might need
    to increase max_pred_locks_per_transaction. at
    /tpapp/tpdb/lib/perl5/Bucardo.pm line 3140. Line: 4801 Main DB
    state: ? Error: none DB source_ossgw state: 53200 Error: 7 DB
    target_ossgw state: ? Error: none 

and things start failing on random tables and queries everywhere for all 
clients.  We've tried bumping the pg lock settings to no avail. We'd 
like to understand the lock usage which we're assuming is root cause here.

Any ideas appreciated.  Thanks!

Confidentiality Notice: This e-mail (including any attachments) is intended only for the recipients named above. It may contain confidential or privileged information and should not be read, copied or otherwise used by any other person. If you are not a named recipient, please notify the sender of that fact and delete the e-mail from your system.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.endcrypt.com/pipermail/bucardo-general/attachments/20130412/ffc09a36/attachment.html>