Config and Workflow for Distributed Video Editors

A friend asked me for some thoughts on a goal and I’m curious if any of you have input too.

He and three other video editors all work for one org. Right now, when they do a shoot, one person ‘owns’ the project. They take all the data home, do the edit, and publish the video. It’s educational stuff that is posted to their platform. They do not backup their data, there is no central archive, and collaboration is pretty impossible. Needing a previous clip, for instance, means you have to figure out who did a given video, and see if they have the asset you needed and can find it.

Not ideal.

This all started when he said to me last night ‘I need two Petabytes of storage’.

As we began talking the conversation turned more to a discussion about workflows, and goals. The storage is the easy part.. making it work for them is a much bigger challenge.

I think the dumb simple is they keep doing what they are doing now, but copy their assets and completed projects to a backup server. This would be an improvement but still leaves a fair bit of risk and inflexibility.

He’s a ‘Do it once’ sort of guy and while they do NOT need 2PB today, he doesn’t want to have to think about this again for 10 years. Their current data is more like 500TB, and they create about 100TB/year. I think a chassis with room to expand is probably a stronger starting place and we can add new pools each year, or whatever, to keep up with expansion. No one is expected to be editing off of whatever the ‘central host’ is. And, I feel like buying that much storage to sit idle is a waste at this stage. I’d rather have a plan for meeting their needs without locking them into dozens of disks that will go EOL before they get much data on them.

I started to wonder if it might be wise to look at one central unit with all the data, and then each Editor then having a smaller unit with a subset of working data. They edit in a mix of NLEs, but perhaps they could create a dataset per project?

The goal would be that an editor makes a dataset, loads the data in and gets to work. The ZFS box at home, perhaps small TrueNAS Minis or something, would then replicate that up to the main system. The next piece though would be allowing an editor to ‘check out’ a project from the main system and have it replicate down to their unit …

I know of no tooling to do that smoothly. Replication in one direction? Sure. Easy. But, it’s this idea of taking a project local from the main host that gives me pause. These are video editors.. smart people to be sure .. but not nerds.

I don’t expect I’ll have any success getting them to craft custom and ever changing sanoid.conf files..

Other ideas? Am I missing the obvious? (I hope so…)

You want to plan for about five years before expansion. Any more than that, you’ve not only spent more by buying the hardware before costs go down, and lost potential gains you could have made by using the money more productively during those five years–you’ve also taken a giant whack off the lifespan of those extra drives before you even begin to need to actually use them.

With that said, at 500TB now and projected accumulation of another 100TB/year, you’d be at 1PB by that five year mark. Which means 1PB isn’t enough.

You generally want to plan for near future expansion by the time you hit 80% full. So with 1PB of projected data at the five year mark, you need about a 1.25PB array (usable, after parity) just to be ready to expand without immediate desperation kicking in.

If you do a 2PB pool, at the ten year mark you’re projecting 1.5PB of data–pretty much the exact target for “we want to be about 80% full by the time we plan the next upgrade.” So your boss was dead on with that one.

I’d say you probably want that 1.25PB array now, with plans to add roughly another 1PB to it in five years. Somewhere in the vicinity of the ten year mark, you’re going to want to proactively replace the whole thing with whatever technology makes sense in 2036 or 2037. You can’t really plan for that yet, it’s too far off–guesses at your own data accumulation rate aren’t even going to be accurate that far out, let alone guesses at hardware availability and price.

Ha! As ever, I’m sure you’re right.

Getting asked, casually, for 2PB was just such a surprise that my knee jerk reaction was .. slow down!

But, yeah - The math checks out. He’s a friend, not a boss, so I have no actual insight into the data at all, and no reason to not assume he’s not fully aware of what he needs (and wants!).

1 Like

I mean, I do still think you’re correct that buying ten years worth of raw storage is not the right answer–especially during a price spike, which we are very much experiencing right now.

The 0.75PB usable storage discrepancy between the five year and ten year plan, if we assume six wide Z2 (therefore roughly 33% bump for redundancy) becomes a 1.2PB discrepancy in raw storage. At current prices, that’s about a $30K (rust, 20T Ironwolf) delta (assuming you’ve got sixty empty bays lying around already).

Almost certainly better to save that extra $30K for now, and run on that 1.2T (after parity) for the next five years, with confidence, rather than spending the extra $30K (not counting additional JBOD shelves or controllers) and going to 2T usable right now.

Side note: thanks to the AI price spike, that $30K delta between the five and ten year plans would be a HALF MILLION DOLLAR discrepancy in even inexpensive flash…!