[Bucardo-general] If a db is unreachable then nothing works!
Greg Sabino Mullane
greg at endpoint.com
Sun Apr 29 18:45:52 UTC 2012
On Wed, Apr 18, 2012 at 09:00:44PM +0200, Rainer Brestan wrote:
...
> Consider following setup.
> Sync set S1 has master DB A and slave DB B.
> Sync set S2 has master DB A and slave DB C.
> Whenever DB C fails, any sync set will finish working, even if
> all DB for a sync set are online. There is no automatic rety.
> The data for sync are not lost (go to delta table on DB A),
> so automatic retry will transfer remaining data as soon as DB C
> become online again.
Okay.
> Therefore, i have modified Bucardo.pm to support a retry
> mechanism for each individual sync set.
>
> Basically it does following things.
> - The function connect_database does not terminate with unavailable database, it reports it as "offline".
> - MCP regularly checks only the main database, but not source and target.
We've had problems doing it like that in the past. We have the MCP check
for two reasons:
1) We know sooner rather than later that something is dead
2) The MCP won't keep trying to resurrect failed CTLs
> - CTL removes each "offline" database from its list (but only in memory, not in the configuration).
> - If CTL has no source database, it terminates.
> - If CTL has no target database left, it terminates.
> - If KID has either no source or target, it terminates.
So that partially solves #2 above, but still has the issue of the MCP
constantly restarting the CTL.
> MCP detects a CTL dead and try to restart it, so this functionality
> is the CTL restart and it was already existing in B4. CTL detects KID
> dead and restart it, this was already existing in B4.
>
> With this patch there come up some other issues, which has been solved.
> - The KID restart is as fast as possible, so when the KID dies again,
> it will consume 100% of CPU to restart KID. This was solved by using
> ctl_checkonkids_time as a delay for restarting KIDs.
Sounds good.
> - CTL termination is not correct, it misses cleanup_controller call,
> instead it calls die.
Not sure why you would need this. die() calls SIG{__DIE__}, which
calls cleanup_controller.
> - The dead column in the attnum ordering was done on colinfo
> instead of targetcolinfo.
Okay. Do you have a patch for the above? I think this is a great start
towards a desired behavior. I will summarize that in a seperate email.
> I have also added a new copy type (field was already present in
> sync set table) named "insert". This copy type does not use the COPY
> statement for data transfer, it uses INSERT. The reason is that some
> middleware products for PostgreSQL have problems with COPY FROM and COPY TO.
Ouch, that's some crappy middleware. :)
--
Greg Sabino Mullane greg at endpoint.com
End Point Corporation
PGP Key: 0x14964AC8
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 163 bytes
Desc: not available
URL: <https://mail.endcrypt.com/pipermail/bucardo-general/attachments/20120429/4c7ab529/attachment.sig>
More information about the Bucardo-general
mailing list