[Bucardo-general] Soemthing weird....

Michelle Sullivan michelle at sorbs.net
Thu Jan 16 13:17:49 UTC 2014


This is weird (to me)...

So I have a stuck replication at the moment and debugging it I noticed this:

sorbs=# select * from process_list where current_query !=
'<IDLE>';                                                                                                                                                                                             
procpid |                                            current_query
---------+-----------------------------------------------------------------------------------------------------
   15626 | <IDLE> in transaction
   15628 | <IDLE> in transaction
   15629 | <IDLE> in transaction
   15633 |
         :                     SELECT  DISTINCT "evidid","hostsid"
         :                     FROM    bucardo.delta_public_hosts2evid d
         :                     WHERE   NOT EXISTS (
         :                                SELECT 1
         :                                FROM  
bucardo.track_public_hosts2evid t
         :                                WHERE  d.txntime = t.txntime
         :                                AND    (t.target = 'dbgroup
sorbsmasters'::text OR t.target ~ '^T:')
         :                             )
         :
   15645 | select * from process_list where current_query != '<IDLE>';
(5 rows)


So I should no problem - lets see how many rows..

sorbs=# explain analyse SELECT  DISTINCT "evidid","hostsid" FROM   
bucardo.delta_public_hosts2evid d WHERE   NOT EXISTS ( SELECT 1 FROM  
bucardo.track_public_hosts2evid t WHERE  d.txntime = t.txntime AND   
(t.target = 'dbgroup sorbsmasters'::text OR t.target ~ '^T:') );
                                                                      
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=1116651.22..1116651.23 rows=1 width=16) (actual
time=93606.995..120148.882 rows=28464412 loops=1)
   ->  Hash Anti Join  (cost=57.55..1116651.21 rows=1 width=16) (actual
time=0.076..59250.810 rows=28464412 loops=1)
         Hash Cond: (d.txntime = t.txntime)
         ->  Seq Scan on delta_public_hosts2evid d 
(cost=0.00..1009852.12 rows=28464412 width=24) (actual
time=0.017..20291.299 rows=28464412 loops=1)
         ->  Hash  (cost=57.40..57.40 rows=12 width=8) (actual
time=0.002..0.002 rows=0 loops=1)
               ->  Seq Scan on track_public_hosts2evid t 
(cost=0.00..57.40 rows=12 width=8) (actual time=0.001..0.001 rows=0 loops=1)
                     Filter: ((target = 'dbgroup sorbsmasters'::text) OR
(target ~ '^T:'::text))
 Total runtime: 138771.432 ms
(8 rows)

Then I wondered if it was because of the sequence scan (I have had
issues before where sequence scans were taking way too long compared to
index scans) so I disabled sequence scans and re-ran the query:

sorbs=# set enable_seqscan = 'false';
SET
sorbs=# explain analyse SELECT  DISTINCT "evidid","hostsid" FROM   
bucardo.delta_public_hosts2evid d WHERE   NOT EXISTS ( SELECT 1 FROM  
bucardo.track_public_hosts2evid t WHERE  d.txntime = t.txntime AND   
(t.target = 'dbgroup sorbsmasters'::text OR t.target ~ '^T:') );
                                                                                     
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.00..127077133.70 rows=1 width=16) (actual
time=0.127..151490.022 rows=28464412 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..127077133.69 rows=1 width=16)
(actual time=0.124..109351.834 rows=28464412 loops=1)
         ->  Index Scan using dex2_public_hosts2evid on
delta_public_hosts2evid d  (cost=0.00..1476733.17 rows=28464412
width=24) (actual time=0.102..25704.371 rows=28464412 loops=1)
         ->  Index Scan using dex3_public_hosts2evid on
track_public_hosts2evid t  (cost=0.00..4.40 rows=1 width=8) (actual
time=0.001..0.001 rows=0 loops=28464412)
               Index Cond: (d.txntime = t.txntime)
               Filter: ((t.target = 'dbgroup sorbsmasters'::text) OR
(t.target ~ '^T:'::text))
 Total runtime: 169875.139 ms
(7 rows)


... slightly longer, but no worries, means finally I have hardware that
is good for databases (I think)...

However at this point I noticed that the bucardo connection is *still*
running the same query...  Now I happen to note that most of the rows
are deletes and this is gigabit network...  any reason why the query
would be taking so long, or is it being done multiple times? (It could
of course be network congestion or the perl interpreter taking its time
to allocate memory for the results.)

FYI this is the summary of changes:

(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.account2email   :      232
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.audit           :  1673269
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_galaxi.public.audit              : 28464411
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.authorisation   :      280
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.comments        :      400
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.emails          :      232
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.evid2comm       :     1073
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.evid2raw        :    81662
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.evid2rt         :     3833
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.evidence        :    81662
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.hosts           :     4261
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_galaxi.public.hosts              :        1
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.hosts2evid      :  1663496
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_galaxi.public.hosts2evid         : 28464412
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.hosts2uris      :   321396
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_galaxi.public.hosts2uris         : 11457200
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.htimestamps     :     7974
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_galaxi.public.htimestamps        :        6
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.latest_spam     :    17895
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.nets2evid       :    82594
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.networks        :     5769
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.rawevidence     :    81662
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.reportq         :    81678
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.timestamps      :    19693
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.useraccounts    :      343
(1785) [Thu Jan 16 10:23:18 2014] KID (sorbsmm) Delta count for
sorbs_corkscrew.public.userprefs       :        5

Most of the changes are due to row deletes.

The last thing I see which might be the biggest thing...

After waiting for one of the hosts to get the changes (thinking it was
the host lagging) I see another host getting the same changes - this
sequential delete *IS* causing massive delays, as the deletes take
around 2-3 hours on each server... surely they should be all at the same
time ... or is this a duplicate transaction?

-- 
Michelle Sullivan
http://www.mhix.org/



More information about the Bucardo-general mailing list