[Bucardo-general] Sync stops when one servers fails
johan at ndt.lu
Tue Oct 20 10:07:55 UTC 2015
I just saw that, when a server is down, and I wait a little longer (5mins), that Bucardo is restarting and a new KID is started.
This KID will then die because the failed database server is still down.
If I bring the failed database server back to life and a new KID is (automatic) started, then this kid will stay alive and the sync continues as usual.
Can you confirm that this is true and explain how the KID is restarted ?
I'm currently modifying the bucardo.pm file, so that it will restart the sync without the missing database/server.
The failed server should be brought back again in the sync manually, so this person is responsible that the database on this server will be the same as the other ones.
So what I'm planning to do is:
- add an error-count column in the table dbmap, in the bucardo main database.
- when an error occurs (when database is down), increase the error for this database in table dbmap.
- because kids are restarting and again error are generated, the error counter will increase.
- when the sync is suddenly ok again, reset the error counters to zero.
- on a certain threshold, example error count is 3, the Bucardo script will delete the failed database from the table dbmap.
- on the next KID restart, it will only sync the databases which are in the table dbmap and it will continue.
Then I can fix the failed server, reinstall and stuff.
Then stop bucardo completely and restore a new generated database copy to the fixed server.
Add the missing server back in the table dbmap and run 'bucardo validate all'.
Then all is back ok again and we start bucardo :)
Also I'm planning to create a web-based control panel, where I can modify parameters, watch syncs and status easily.
From: Greg Sabino Mullane [mailto:greg at endpoint.com]
Sent: Monday, October 19, 2015 18:10
To: Johan Peeters <johan at ndt.lu>
Cc: bucardo-general at bucardo.org
Subject: Re: [Bucardo-general] Sync stops when one servers fails
On Fri, Oct 16, 2015 at 03:08:01PM +0000, Johan Peeters wrote:
> We have a 3 DB's setup in master-master environment.
> I think the scenario when a server fails should be like this:
> - the synchronization between the other running servers continues
> -- the failed server should be deactivated in the sync (and have to be
> activated manually later) OR
> -- Bucardo tries to make connection to the failed one (and when its
> online sync the missing records)
> Is there a workaround ?
No direct workaround, no. It is a known problem, with no easy solution, as keeping track of which rows have been replicated to which fraction of an existing sync is a tricky (but solveable) problem.
A future version will have a sync-level option that tells Bucardo to drive on, continuing to replicate to what it can, and then autmatically cathching the flaky servers back up once they come alive.
Greg Sabino Mullane greg at endpoint.com
End Point Corporation
PGP Key: 0x14964AC8
More information about the Bucardo-general