ZFS for CCTV footage

Hey! I am setting up a small-ish 7 camera CCTV system and have 3x SATA Innodisk 1TB 3MV2-P high write endurance, PLP industrial drives. I wonder if a raid0 stripe using ZFS is suitable for this use case, or if any drawbacks of a COW filesystem has negative impact of the longevity of the drives. I plan on having constant recording of all streams along with object detection. What is your thoughts on this? Should I go with ZFS raid0 or LVM/md-raid0? I do not need redundancy for this use case and I’m fine with raid0.

I’d be inclined to the LVM if there isn’t some ZFS feature you were planning to use. Is there? Compression? You lose most of the benefit of ZFS automatically checksumming everything if there’s no redundancy to self-heal with.

Zfs is perfectly suitable for this use case, and offers you the potential for rapid asynchronous replication. It should also be more resistant to corruption due to hard crashes (eg power loss) than ext4.

Don’t get me wrong, ext4 is a metadata-journaling filesystem and is quite resilient to crashes. But ZFS is by nature even more so. There’s a reason ext4 still has (and sometimes needs) an fsck utility, and ZFS does not.

Thanks for the answers! I’m inclined to choose ZFS due to the other disks in the system slso using ZFS and I’m comfortable using it. Would there be write amplification that degrade disks faster than using a conventional file system?

With rapid asynchronous replication, do you mean replicating the CCTV footage to a remote server? Do you have any tips for how this could be achieved in practice? I’m planning this exact feature but I’m undure how to set it up. Snapshot replication, restreaming of the footage via SSH or Wireguard tunneling to the remote server.

1 Like

If you use sanoid, it’s extremely easy.

  1. create dataset on CCTV server
  2. ssh-copy-id from backup server to CCTV server so you’ve got passwordless login
  3. backup:~# syncoid -r root@cctv:poolname/footage localpool/footage

That’s it, although you can dress things up by eg using ZFS delegation so you don’t have to allow root logins. It’ll go INCREDIBLY faster than rsync, and with much, MUCH less load placed on the boxes on either end.

Ideally, you’d probably also want to set up sanoid to automatically take (and destroy) snapshots according to policy. But in a pinch, syncoid’s default “sync snapshots” will be enough to get the job done, if you only have a single backup target.

That’s awesome! I will go for this solution as it sounds really simple. I haven’t yet had the opportunity to try those tools but I for sure will set it up according to your instructions. Thanks.

I have some additional questions:
How often does the default syncoid configuration take snapshots?
Can I use syncoid to snapshot into a encrypted dataset on the remote (from an unencrypted dataset)?

Syncoid itself does not take snapshots except when you actually ask it to replicate, in which case it immediately takes a sync snapshot, replicates it to the target host, then destroys older sync snapshots–not the one it just created and replicated!–on both source and target hosts, leaving only the most recent sync snapshot (to base future incremental replication on).

Sanoid takes and destroys snapshots according to policy. If you use sanoid, you need to use it on both source and target hosts. On the source host, you use the production template, which takes hourly, daily, and monthly snapshots. If you don’t tell it differently, the production template maintains 30 hourlies, 30 dailies, and 3 monthlies–but that is something you can easily change to suit your preferences; you can create your own template based on the provided production template, or you can individually override defaults on the production template. It’s easy, when you see the config file.

On the target (backup) host, you use the backup template (if you’re replicating in once daily) or the hotspare template (if you’re replicating in hourly). These templates do not take snapshots locally–there’s no point and some possible harm in locally taking snapshots on a target; at best, they get wiped out in the next incoming replication. At worst, an identically named local snapshot may confuse syncoid and cause replication to fail unless you remove the offending snapshot itself. But running sanoid on the target is necessary if you run it on the source, otherwise the target fills up, because destroying a snapshot on the host does not directly affect any snapshots on the target.

This also allows you, btw, to have a backup target with lots of big, cheap, slow drives that keeps more snapshots than you can afford to keep on a source with smaller, faster, more expensive drives.

You’ll also want to take a look at sanoid --monitor-snapshots as a very easy way to make sure your replication is happening properly.

I see. So if I want to have quick replication of all footage to a remote host for in the case of CCTV server theft, there neds to be set up a cron job or similar?

I’m sorry if I misunderstand here, is this a one-shot command or does it run as a background service that synchronizes to a remote filesystem with set intervals?

My goal is to have a near synchronized remote disk so in the unfortunate cas of server theft or manipulation there is a backup that has (almost) all footage intact.

You need to copy the SSH key only once, but after that, syncoid works essentially like rsync would. So, yes, you run it from cron (or from a systemd timer).

Typically I replicate hourly to a local hotspare system and daily to an off-site remote DR system, but there are quite a few people out there doing much more frequent replication; 15m isn’t that uncommon, and I’ve seen 5m intervals.

All depends on how much data you accumulate versus how much network throughput is available, really.

Thanks Jim, really helpful.

1 Like

one other thing to keep in mind: the more frequent your replication is, the more frequently you can screw BOTH systems up.

for the most part, this isn’t too much of a concern as long as you’re keeping some snapshot depth, because if somebody deletes all your data on the source, the deletions WILL replicate to the target, but you can just roll the target (and usually the source) back to prior to the malicious deletions.

but there are a FEW classes of screw-up that can be truly catastrophic and will replicate very disastrously. this could be a zero-day catastrophic filesystem bug, or, it could be my favorite “advanced incompetence” story. buckle up for this one:

so a client of mine hires an in-house IT guy, and asks me to train this guy up on the systems. I don’t mind doing that. I didn’t get a very good impression of the dude, he did not seem that sharp and did not pay that much attention, but hey, I build these practically immortal systems with local and remote replication hourly and daily, automated snapshots, the works, what’s to worry?

well, one day–I have NO idea why–the guy attempts to replicate one of the gold images (fully installed OS, without any post-installation configuration done, useful for spinning new VMs up rapidly) ON TOP OF one of their most important production VMs. and he succeeds! the gold image still had a snapshot in common with the VM, which was based on that gold image, so… it rolled the mission-critical production VM all the way back to gold, a completely unconfigured fresh installation of Windows Server.

Obviously, the VM immediately went down hard. But did the guy call me? No. Did he tell his boss what he’d done? No. He just sits on his hands.

At the top of the hour, the on-site hotspare server faithfully replicates in from production. This, once again, effectively rolls the HOTSPARE’s copy of the mission-critical VM all the way back to gold.

Dude still just sits on his hands. Another like eight hours go by. Finally, I get a call from the owner of the business, asking me to look at why his production VM is down, because his new hire can’t seem to figure out how to fix it. So I shell in, expecting to just roll it back to the last good snapshot… only there are no last good snapshots, because the damn thing rolled all the way back to gold, destroying all snapshots along the way.

So I check the hotspare. Same story.

And with LESS THAN AN HOUR TO GO before the offsite daily replication would have fired off and destroyed the LAST copy of the production VM, I hurriedly shell into the DR box, kill the replication task before it can fire off, and then I had to spin up the VM itself directly ON the DR box–because the business could not operate for the week or more it would take to replicate it back in full over the rather underwhelming internet connection they had–and then temporarily reconfigure the VM to operate on the LAN subnet at the DR facility, then update the client’s Active Directory DNS to point to the new IP address for the VM at the DR facility.

With that done, the client was able to function reasonably well for the week plus it took to get a full replication done BACK from the DR facility to the production facility, after which I could then shut down the temporary copy in DR, replicate one last time (which didn’t take long) to production to catch any last-minute changes, THEN finally spin the damn thing back up in the production facility, reconfigure it to operate on the production subnet, then put everything back to normal in the client’s AD DNS again.

That was “fun.” And the moral of the story is, it is not a bad thing to have at least one backup that doesn’t update super duper rapidly, so that you’ve got a fighting chance at catching any really, really ugly problems you never even realized COULD happen before they wipe out your DR.

4 Likes

Interesting story and much wisdom to learn there! I take note! Thanks again Jim.

1 Like