-
Notifications
You must be signed in to change notification settings - Fork 91
PITR
Enabling/disabling of the PITR sampling is made via config value pitr.enabled
.
$ pbm config --set=pitr.enabled=true
$ pbm config --set=pitr.enabled=false
Enabling PITR means some agent on each replicaset is going to save oplog chunks to the storage defined in the config. Detailed on the storage layout see PITR: storage layout. Each chunk, with some exceptions described below, captures about 10 min span of oplog events. In order for chunks to be useful for the restore, the chunk consequence should follow two restrictions:
- The chunks chain should form a continuous timeline. Since any gap would violate the consistency of data changes.
- The chunks chain should begin after some backup snapshot. Since the oplog is roughly speaking a log of events applied to some data. And during the restore we need to recover this data in the first place before replying the events and "moving" to the specified point in time.
- The chunks timeline shouldn't be overlapped with any restore events. Since the restore is intruding the events timeline making it invalid - chunk couldn't continue its lineage to the snapshot anymore since the "snapshot data" was rewritten.
On the start, each agent spawns a background process that constantly (now each 15 sec) checking the pitr.enabled
state and if its "on" checks if anyone else in the replicaset is doing the job (there is a lock for PITR operation and it is not stale). And if no one, the agent will try to acquire the lock for "PITR" operation and start slicing. If pitr.enabled
is "off" it will send a cancellation signal to the slicing routine if it has any.
Staring the silicon the agent, first of all, will do a "catch up" - define the start point. The starting point sets to the last backup's or last PITR chunk's TS whichever is the most recent. It also checks if there is no restore intercepted the timeline (hence there are no restores after the most recent backup).
Next, the slicing routine will be started. It runs an infinite loop wich on each step blocks for 10 minutes and then gets oplog slice from the last point until now. Save it to the storage, adds chunk metadata to the pbmPITRChunks
collection and updates the "last point". There are two ways to wake up the slicing process before the 10min timeout. The first way is to send a wake-up signal. In that case, again it will. The behaviour will be the same as with "scheduled wakeup". It's just a way not to wait up to 10 min. The second way is to send a cancellation signal (for example when PITR was switched off in the config). In that case, the routine will do the same - capture and save the oplog from the last point to now, but it exits after finishing this cycle.
After slicing routine wakes up it checks if the agent still owns the lock. If the agent along with the Mongo node was separated from the cluster for quite a time its lock will become stale and some other node will acquire a new lock and continue slicing. But if separated node returns to the cluster we don't want to have more then one node doing the same job.
On each cycle before any action node also checks if it is yet any good for doing backups.
Starting the snapshot backup we have coordinate it with the PITR slicer so the last will do the slice up to the snapshot start time, then "pause" slicing and resume it after snapshot backup is done.
On each agent the snapshot process does the next:
- Sets a backup's intent so the PITR routine won't try to proceed, hence trying to acquire a lock.
- Waits for pitrCheckPeriod * 1.1 to be sure PITR routine observed the state.
- Removes PITR lock and wake up PITR backup process so it may finish its PITR-stuff (nothing gonna be done if no PITR process run by current agent)
- Tries to acquire a backup's lock and move further with the backup
- Despite any further plot development, before exit, it clears the backup's intent to unblock PITR routine.
PITR slicer routine receiving a wake-up any other interrupt signal in turn does: Checks if it still got a lock. And if no (remember a snapshot process will remove a PITR-lock and acquire its own):
- if there is another lock and it is the backup operation - wait for the backup to start, make the last slice up unlit backup StartTS and return;
- if there is no other lock, we have to wait for the snapshot backup - see above (snapshot cmd can delete PITR-lock but might not yet acquire the own one)
- if there another lock and that is PITR - return, probably the split happened and a new worker was elected
- any other case (including no lock) is the undefined behaviour - return instantly.
After the snapshot is finished all backups intents will be cleared along with the backup locks. So arbitrary PITR process will be able to acquire the PITR-lock and continue slicing from the snapshot's last captured event.
Basically all agents in any replicaset constantly check if PITR is on and if so, if any agent does the job and the lock isn't stale. If agent currently in charge of the operation or the lock is stale some agent will (re)start slicing.
Since any restore drastically changes the timeline there is no way to continue PITR slicing not having a snapshot after the restore.
In order to make any restore (point-in-time or snapshot), PITR should be turned off (pbm config --set=pitr.enabled=false
). And to resume PITR after the restore snapshot should be made first (pbm backup
). So we have the next consequence:
pbm config --set=pitr.enabled=false
pbm restore ...
pbm backup
pbm config --set=pitr.enabled=true
There is no "automatisation" on that matter so given operation should be done manually.
Deletion of any snapshot will invalidate all PITR chunks that carry its lineage from it. So all such chunks will be deleted along with the snapshot regardless PITR is ON to not.
pbm delete-backup --older-than ...
:
- Delete all snapshots older than the given time.
- Find the oldest snapshot.
- Delete all PITR chunks which are made before the snapshot.
pbm delete <backup-name>
- Delete given snapshot.
- Delete all PITR chunks which are made after that snapshot until the next snapshot.
If PITR is ON the last (most recent) won't be deleted