[Bucardo-general] Locking issues migrating from PG9.6 to PG11 in AWS Aurora

Sat Feb 8 02:27:46 UTC 2020

> On Feb 7, 2020, at 3:34 PM, Alfonso Perez <fon at revenuecat.com> wrote:
> 
> Thanks for the reply David,
> 
> Yes, all the triggers were added correctly.
> 
> After adding the sync, we run bucardo start and everything was looking
> good, some smaller tables got copied over.
> 
> But after about an hour, the database became unreachable.
> 
>> I’d look at locks initially.
> 
> Thing is that we haven't had any problems with locks until we added
> all bucardo triggers
> It’s worth saying that our service has very high traffic, and the dry
> run (without traffic) took ~20 hours for the initial copy of all the
> tables.

If you have a ton of writes the triggers will amplify those just having to note each row that has changed. With Bucardo this is unavoidable, unfortunately, but shouldn’t contribute to locking issues per se.  Do you have other triggers on these tables besides Bucardo?

>> you could look at adding the sync without the onetimecopy option and transfer the data out-of-band for the initial pgdump.
> 
> When this issue happened, we stopped bucardo, and the COPY got killed,
> although the issue didn't resolve until we dropped the bucardo schema,
> so could the triggers themselves in some way be causing this and not
> the copy itself?
> 
> Do you think having autokick=ON could have been the reason?

Depending on your replication tolerances you could switch from auto kick and move to a timed sync, say every minute or so. But without seeing the cluster it’s hard to say for sure. 

> Thank you for your help!,
> 
> Best
> 
> 
> 
>> On Fri, Feb 7, 2020 at 12:24 PM David Christensen <david at endpoint.com> wrote:
>> 
>> 
>>>> On Feb 7, 2020, at 12:39 PM, Alfonso Perez <fon at revenuecat.com> wrote:
>>> 
>>> Hi list,
>>> 
>>> We have attempted to use Bucardo to migrate from PG9.6 to PG11 in AWS
>>> Aurora[1], but unfortunately, after about an hour of running sync,
>>> locking queries start piling up like crazy until the DB becomes
>>> unreachable, a failover is triggered, and we are able to drop the
>>> bucardo schema, resolving the locking, which gets the DB healthy
>>> again.
>> 
>> Hi Alfonso,
>> 
>> So to confirm, the sync got added successfully, it added all of the triggers, etc, and this is happening in the initial copy phase?
>> 
>>> Any tips on where to start trying to figure out the cause?
>>> Unfortunately, I do not have available backups of the database when
>>> the schema was created, and bucardo logs don't show anything
>>> meaningful.
>> 
>> I’d look at locks initially.  Broadly speaking for a workaround, you could look at adding the sync without the onetimecopy option and transfer the data out-of-band for the initial pgdump.
>> 
>>> [1] bucardo add sync pg11migration relgroup=alpha
>>> dbs=aurora9:source,aurora11:target onetimecopy=2
>>> conflict_strategy=bucardo_source
>> 
>> BTW, you will not need to specify a conflict strategy if there is only a single source.
>> 
>> Best,
>> 
>> David
>> --
>> David Christensen
>> Senior Software and Database Engineer
>> End Point Corporation
>> david at endpoint.com
>> 785-727-1171
>>