[Bucardo-general] If a db is unreachable then nothing works!

Sun Apr 29 18:45:52 UTC 2012

On Wed, Apr 18, 2012 at 09:00:44PM +0200, Rainer Brestan wrote:
...
> Consider following setup.
> Sync set S1 has master DB A and slave DB B.
> Sync set S2 has master DB A and slave DB C.
> Whenever DB C fails, any sync set will finish working, even if 
> all DB for a sync set are online. There is no automatic rety. 
> The data for sync are not lost (go to delta table on DB A), 
> so automatic retry will transfer remaining data as soon as DB C 
> become online again.

Okay.

> Therefore, i have modified Bucardo.pm to support a retry 
> mechanism for each individual sync set.
> 
> Basically it does following things.
> - The function connect_database does not terminate with unavailable database, it reports it as "offline".
> - MCP regularly checks only the main database, but not source and target.

We've had problems doing it like that in the past. We have the MCP check 
for two reasons:

1) We know sooner rather than later that something is dead
2) The MCP won't keep trying to resurrect failed CTLs

> - CTL removes each "offline" database from its list (but only in memory, not in the configuration).
> - If CTL has no source database, it terminates.
> - If CTL has no target database left, it terminates.
> - If KID has either no source or target, it terminates.

So that partially solves #2 above, but still has the issue of the MCP 
constantly restarting the CTL.

> MCP detects a CTL dead and try to restart it, so this functionality 
> is the CTL restart and it was already existing in B4. CTL detects KID 
> dead and restart it, this was already existing in B4.
> 
> With this patch there come up some other issues, which has been solved.
> - The KID restart is as fast as possible, so when the KID dies again, 
> it will consume 100% of CPU to restart KID. This was solved by using 
> ctl_checkonkids_time as a delay for restarting KIDs.

Sounds good.

> - CTL termination is not correct, it misses cleanup_controller call, 
> instead it calls die.

Not sure why you would need this. die() calls SIG{__DIE__}, which 
calls cleanup_controller.

> - The dead column in the attnum ordering was done on colinfo 
> instead of targetcolinfo.

Okay. Do you have a patch for the above? I think this is a great start 
towards a desired behavior. I will summarize that in a seperate email.

> I have also added a new copy type (field was already present in 
> sync set table) named "insert". This copy type does not use the COPY 
> statement for data transfer, it uses INSERT. The reason is that some 
> middleware products for PostgreSQL have problems with COPY FROM and COPY TO.

Ouch, that's some crappy middleware. :)

-- 
Greg Sabino Mullane greg at endpoint.com
End Point Corporation
PGP Key: 0x14964AC8
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 163 bytes
Desc: not available
URL: <https://mail.endcrypt.com/pipermail/bucardo-general/attachments/20120429/4c7ab529/attachment.sig>