Hi All,
I am looking for some feedback to ensure my replication, backup and recovery procedure makes sense.
My virtual machines are important but can be quickly rebuilt.
My data served by Samba is critical and cannot be replaced (i.e. family photos)
I have adequate snapshots for my virtual machines (7 days of hourly)
I have adequate snapshots for Samba, 365 daily locally and 1095 daily remote
I have 4 servers in total.
Primary: hosts virtual machines in one pool and Samba data in a second pool
Secondary: acts as a hot spare, receives (push) replication of my virtual machines in one pool and samba data in second pool
remote1 on the west coast: a remote host to receive (pull) snaphosts from primary of my Samba data to a single pool
remote2 on the east coast: a remote host to receive (pull) snaphosts from secondary of my Samba data to a single pool
My replication data flows look as follows:
My thinking here is that;
If Primary has a hardware failure I can bring up my VM’s on the Secondary and start the samba service to serve my data. The VM’s will continue to operate using their same IP’s so no interruption there and minimal services such as scanning will require a manual IP change since the samba share is on a new IP (of the secondary host). I reconfigure sanoid on the Secondary to the snapshot schedule that was on Primary
- remote1 will be unable to pull snapshots since primary is down, its replication state will begin to age
- remote 2 will continue to replicate from secondary and will have an up to date replication status
- Secondary will also be up to date since its now acting as the primary
When the primary comes back online I need to make sure to ensure that snaphots and replication are disabled
I then disable snaphots in the secondary and replicate the state back to the primary
I then reestate the snaphosts and replication on primary to ensure the secondary is kept up to date
remote1 will then be able to pull snaphots from primary to catch up to the current state
remote2 will continue pulling from secondary and maintain its current state
I am hoping this strategy balances risk of data loss noting I am operating in a degraded state while the primary is down with risk of misconfiguration due to stresses of recovering from an failure.
Appreciate any thoughts.
Thanks,
Adam
