[Bucardo-general] Test Failures: Serialized Isolation

David E. Wheeler david at justatheory.com
Wed Oct 24 00:22:28 UTC 2012


On Oct 22, 2012, at 10:47 AM, David E. Wheeler <david at justatheory.com> wrote:

> Maybe it's related to the support for more than two masters in 5.0? In such a case, when you kick, can you predict which masters will sync to which others and in what order?

So I've been comparing the log output for test successes and test failures. They start failing irregularly once the pgtest5 sync is kicked. When that kick takes a *long* time to run (5s or so), I know it's going to fail. A successful kick look like this:

> (62939) [Tue Oct 23 16:53:54.578 2012] #5370 KID Got NOTICE run_pgtest5 from 62964 (line 2665)
> (62939) [Tue Oct 23 16:53:54.579 2012] #2667 KID $BUCARDO1 = {
>   'run_pgtest5' => {
>     'count' => 1,
>     'firstpid' => 62964,
>     'pid' => {
>       '62964' => 1
>     }
>   }
> };
> (62967) [Tue Oct 23 16:53:54.667 2012] #5370 KID Got NOTICE run_pgtest5 from 62964 (Bucardo DB) (line 2665)
> (62967) [Tue Oct 23 16:53:54.668 2012] #2667 KID $BUCARDO1 = {
>   'run_pgtest5' => {
>     'count' => 1,
>     'firstpid' => 62964,
>     'pid' => {
>       '62964' => 1
>     }
>   }
> };
> (62967) [Tue Oct 23 16:53:54.673 2012] #2822 KID Set database "A" to serializable read write
> (62967) [Tue Oct 23 16:53:54.674 2012] #2822 KID Set database "B" to serializable read write
> (62967) [Tue Oct 23 16:53:54.674 2012] #2932 KID Checking truncate_trigger table on database "A"
> (62967) [Tue Oct 23 16:53:54.676 2012] #2932 KID Checking truncate_trigger table on database "B"
> (62967) [Tue Oct 23 16:53:54.680 2012] #3005 KID Sequence public.bucardo_test_seq3 from db A is the highest
> (62967) [Tue Oct 23 16:53:54.682 2012] #3005 KID Sequence public.bucardo_test_seq1 from db A is the highest
> (62967) [Tue Oct 23 16:53:54.683 2012] #3005 KID Sequence public.bucardo_test_seq2 from db A is the highest
> (62967) [Tue Oct 23 16:53:54.706 2012] #3094 KID Delta count for A.public."bucardo space test" : 2
> (62967) [Tue Oct 23 16:53:54.706 2012] #3094 KID Delta count for B.public."bucardo space test" : 1
> (62967) [Tue Oct 23 16:53:54.707 2012] #3094 KID Delta count for A.public.bucardo_test1        : 2
> (62967) [Tue Oct 23 16:53:54.707 2012] #3094 KID Delta count for B.public.bucardo_test1        : 1
> (62967) [Tue Oct 23 16:53:54.707 2012] #3094 KID Delta count for A.public.bucardo_test10       : 2
> (62967) [Tue Oct 23 16:53:54.708 2012] #3094 KID Delta count for B.public.bucardo_test10       : 1

And lost more of that delta stuff. A failed kick looks like this:

> (65063) [Tue Oct 23 17:06:36.365 2012] #5371 KID Got NOTICE run_pgtest5 from 65084 (line 2665)
> (65063) [Tue Oct 23 17:06:36.365 2012] #5371 KID Got NOTICE run_samedb from 65055 (Bucardo DB) (line 2665)
> (65063) [Tue Oct 23 17:06:36.365 2012] #2667 KID $BUCARDO1 = {
>   'run_pgtest5' => {
>     'count' => 1,
>     'firstpid' => 65084,
>     'pid' => {
>       '65084' => 1
>     }
>   },
>   'run_samedb' => {
>     'count' => 1,
>     'firstpid' => 65055,
>     'pid' => {
>       '65055' => 1
>     }
>   }
> };
> (65063) [Tue Oct 23 17:06:36.370 2012] #2822 KID Set database "A" to serializable read write
> (65063) [Tue Oct 23 17:06:36.370 2012] #2822 KID Set database "A1" to serializable read write
> (65063) [Tue Oct 23 17:06:36.371 2012] #2932 KID Checking truncate_trigger table on database "A"
> (65063) [Tue Oct 23 17:06:36.372 2012] #3094 KID Delta count for A.public.bucardo_test1  : 1
> (65063) [Tue Oct 23 17:06:36.373 2012] #3105 KID Total delta count: 1
> (65063) [Tue Oct 23 17:06:36.374 2012] #3706 KID Rows to push from A.public.bucardo_test1: 1
> (65063) [Tue Oct 23 17:06:36.375 2012] #8323 KID Rows deleted from A1.bucardo_test1_copy: 0
> (65063) [Tue Oct 23 17:06:36.375 2012] #8523 KID Copying from A.public.bucardo_test1
> (65063) [Tue Oct 23 17:06:36.376 2012] #8657 KID Rows copied to A1.bucardo_test1_copy: 1
> (65063) [Tue Oct 23 17:06:36.376 2012] #3896 KID Totals: deletes=0 inserts=1 conflicts=0
> (65063) [Tue Oct 23 17:06:36.377 2012] #3971 KID Rows inserted to bucardo_track for A.public.bucardo_test1 : 1
> (65063) [Tue Oct 23 17:06:36.377 2012] #4372 KID Issuing final commit for all databases
> (65063) [Tue Oct 23 17:06:36.378 2012] #4389 KID All databases committed
> (65063) [Tue Oct 23 17:06:36.381 2012] #4439 KID Total time for sync "samedb" (1 rows): 0.01 seconds
> (65063) [Tue Oct 23 17:06:36.382 2012] #5266 KID Sending NOTIFY "ctl_syncdone_samedb" (line 4531)
> (65063) [Tue Oct 23 17:06:36.383 2012] #2667 KID $BUCARDO1 = {};
> (65053) [Tue Oct 23 17:06:36.502 2012] #5316 CTL 'bucardo_ctl', 65064, 'syncdone_samedb'
> (65053) [Tue Oct 23 17:06:36.503 2012] #5371 CTL Got NOTICE syncdone_samedb from 65064 (line 1698)
> (65053) [Tue Oct 23 17:06:36.503 2012] #1731 CTL Kid 65064 has reported that sync samedb is done
> (65053) [Tue Oct 23 17:06:36.504 2012] #5266 CTL Sending NOTIFY "syncdone_samedb" (line 1772)
> (65083) [Tue Oct 23 17:06:36.580 2012] #5316 CTL 'bucardo_ctl', 65064, 'syncdone_samedb'
> (65083) [Tue Oct 23 17:06:36.581 2012] #5371 CTL Got NOTICE syncdone_samedb from 65064 (line 1698)
> (65037) [Tue Oct 23 17:06:36.583 2012] #5316 MCP 'bucardo', 65055, 'syncdone_samedb'
> (65037) [Tue Oct 23 17:06:36.584 2012] #5371 MCP Got NOTICE syncdone_samedb from 65055 (line 831)
> (65037) [Tue Oct 23 17:06:36.584 2012] #0889 MCP Sync samedb has finished
> (65037) [Tue Oct 23 17:06:36.585 2012] #5266 MCP Sending NOTIFY "syncdone_samedb" (line 892)
> (65063) [Tue Oct 23 17:06:36.885 2012] #2667 KID $BUCARDO1 = {};
> (65063) [Tue Oct 23 17:06:37.387 2012] #2667 KID $BUCARDO1 = {};
> (65063) [Tue Oct 23 17:06:37.889 2012] #2667 KID $BUCARDO1 = {};
> (65063) [Tue Oct 23 17:06:38.391 2012] #2667 KID $BUCARDO1 = {};
> (65063) [Tue Oct 23 17:06:38.893 2012] #2667 KID $BUCARDO1 = {};
> (65063) [Tue Oct 23 17:06:39.393 2012] #2667 KID $BUCARDO1 = {};
> (65063) [Tue Oct 23 17:06:39.895 2012] #2667 KID $BUCARDO1 = {};
> (65063) [Tue Oct 23 17:06:40.397 2012] #2667 KID $BUCARDO1 = {};
> 

And lots more of those empty $BUCARDO1 lines (I put a Dumper call in for that bit). 

I think the key point here, though, is that in the first example, only the pgtest5 sync is sent. In the second, both the pgtest5 and the samedb syncs are sent to the same kid! However, the name of the sync is sent to the kid when it starts, so this is the line that triggers the run:

                elsif ($name eq "run_$syncname") {
                    $dorun = 1;
                }

When both syncs are set, what is the value of $syncname? It looks as though $syncname is set at KID start time. If so, we should either:

1. Fix the bug that sends the samedb sync to the wrong KID; or
2. Fix the bug where the KID thinks it manages only one sync

Does that sound right?

Thanks,

David



More information about the Bucardo-general mailing list