Should I use separate datasets for each person in this scenario?

dmpmz · September 26, 2024, 12:27pm

I’m going to have two servers, which are Lenovo M700 or M720q each with one or two internal NVMEs for Proxmox and VMs/LXCs, and a single 16TB USB HDD for data. I might use two HDD in a mirror on my main server just for a bit of redundancy and automatic error correction, and just have one on the server at my Dad’s house, because the plan is to have the data synced between them, and I’ll have a separate backup of everything on another 16TB anyway, plus cloud backups of anything critical.

I decided to use LUKS encryption with ZFS on top, mainly because I read about some performance issues with native encryption but also because I can use things like dropbear and mandos to remotely/automatically decrypt LUKS drives on boot. I’ll be using Tailscale to connect the two servers and do the syncing, so the connection will be secure even though I’m not using ZFS encryption.

I’m trying to decide how many datasets to use. I’m thinking of just having four, Media (movies, series, music), Software, Games, and Personal (mainly backups of the data from each family members Windows PCs, created using something like Veeam Agent and probably password-encrypted so they can be safely synced to the cloud, but maybe also plain copies of photos, documents, etc.).

The Media, Software and Games datasets won’t change that often, so I can just sync them to the other server once a month, whereas the Personal one will be updated daily and will need to be synced two-ways, as my PCs will be backed up to my server first, and my Dad and my Mum’s PCs will be backed up to his server first.

Would there be any benefit in having datasets for each family member instead of putting everything under Personal, so one for Dad, Mum, myself, etc? That way, my Dad’s server could be set to sync the datasets for Dad and Mum to my server, and my server can be set to sync the datasets for myself and my siblings to his server, which maybe offers some advantage in terms of minimising the damage if his server gets infected and all the data in the Personal dataset is corrupted and then synced to my server, overwriting the uncorrupted data, but I’m not sure if that’s a realistic risk or if I’m overthinking this.

kaihp · September 26, 2024, 6:21pm

I have set up each /home/$USER as a separate dataset. This way you can set quota on the individual users (although it’s my account that uses all the space).

As for corruption in a cybersecurity event, if the attacker manages to break into the computer, I would expect most or all of the data to be corrupted/encrypted. This is a case where snapshots can come handy, because the attached cannot modify those.

dmpmz · September 26, 2024, 8:35pm

Using quotas might be useful yeah, just to make sure someone doesn’t use too much space. For my scenario there’s probably no point in mounting each person’s dataset at /home/$USER, as they’ll just contain backups/copies of data from their Windows PC’s, rather than files that they’re going to access on Linux PCs.

Snapshots will be a useful extra layer of protection against corruption. Maybe not strictly necessary if I also have an offline backup, but I guess it will be quicker to rollback to a snapshot than to restore from the backup. What sort of system do you have for making snapshots? Do you just create a new one once a week or something?

As the data on my two servers will differ between each sync (for example, if the Windows PCs are backed up daily, but the servers are only synced once a week) would I need to create separate snapshots on each server, rather than creating them on mine and syncing them to my Dad’s?

OtherJohnGray · September 27, 2024, 12:32am

A seperate dataset per user sounds like a good idea, particularly if you want to do rollbacks, or migrate them to separate machines in the future.

Also, with two always-connected machines like these, I don’t see any reason not to have syncoid replicating them every 10 minutes or so for everything - it’s really very efficient and there is little overhead if there are no changes, so you might as well minimise your potential data loss.