Zpool mirror seems to have lost all data? (Data is back after long scrub and reboot)

EDIT: The data is back after a long scrub and a reboot, see last reply, not idea what happened. Any ideas? How to prevent this?

Hi all,

This morning I found my server unresponsive (a NixOS install with Gnome, because I sometimes use it as a desktop). It’s likely that leaving Firefox with many tabs open somehow filled the memory and it ground to a halt.

I hard to hard-reset the server, when it came back online I found my zpool mirror empty, whereas before it contained about 800 GBs of data.

It does say that it is scrubbing:

[freek@trantor:/zpool0/data1]$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
zpool0/data1    2,8T  128K  2,8T   1% /zpool0/data1

[freek@trantor:/zpool0/data1]$ zpool status -v
  pool: zpool0
 state: ONLINE
  scan: scrub in progress since Tue Oct  1 01:09:34 2024
	771G / 771G scanned, 207G / 771G issued at 137M/s
	0B repaired, 26.79% done, 01:10:18 to go
config:

	NAME                        STATE     READ WRITE CKSUM
	zpool0                      ONLINE       0     0     0
	  mirror-0                  ONLINE       0     0     0
	    wwn-0x50014ee26604ad95  ONLINE       0     0     0
	    wwn-0x50014ee2605e147a  ONLINE       0     0     0

errors: No known data errors

The whole process is moving much slower than any previous scrubs I did manually (I did set it to weekly scrubs in my Nix configuration file).

I haven’t made any snapshots yet, I was going to try that next (this data was just for testing ZFS out, nothing has been really lost btw).

Should I just wait for the scrub to finish? Is data normally unavailable during scrubbing? I searched for this but didn’t find anything. But I’m a total noob, it’s my first attempt at using ZFS.

Edit: Maybe I deleted data myself while my screen was off? (Highly unlikely) maybe my disks renamed (since I used sda/sdb as names)? Is there some kind of log I can check to see the last actions on the disk?

Gnome disks reports both disks as unmounted btw, is that normal?

Sorry, new users can only add 1 image.

1 Like

Ok, the scrub finished succesfully:

[freek@trantor:/zpool0/data1]$ zpool status -v
  pool: zpool0
 state: ONLINE
  scan: scrub repaired 0B in 09:35:57 with 0 errors on Tue Oct  1 10:45:31 2024
config:

	NAME                        STATE     READ WRITE CKSUM
	zpool0                      ONLINE       0     0     0
	  mirror-0                  ONLINE       0     0     0
	    wwn-0x50014ee26604ad95  ONLINE       0     0     0
	    wwn-0x50014ee2605e147a  ONLINE       0     0     0

errors: No known data errors

Still now data

[freek@trantor:/zpool0/data1]$ df -h /zpool0/data1/
Filesystem      Size  Used Avail Use% Mounted on
zpool0/data1    2,8T  128K  2,8T   1% /zpool0/data1

Ok, and one reboot later, there is my data again:

[freek@trantor:~]$ zpool status -v
  pool: zpool0
 state: ONLINE
  scan: scrub repaired 0B in 09:35:57 with 0 errors on Tue Oct  1 10:45:31 2024
config:

	NAME                        STATE     READ WRITE CKSUM
	zpool0                      ONLINE       0     0     0
	  mirror-0                  ONLINE       0     0     0
	    wwn-0x50014ee26604ad95  ONLINE       0     0     0
	    wwn-0x50014ee2605e147a  ONLINE       0     0     0

errors: No known data errors

[freek@trantor:~]$ df -h /zpool0/data1/
Filesystem      Size  Used Avail Use% Mounted on
zpool0/data1    3,6T  772G  2,8T  22% /zpool0/data1

I mean, it’s nice but I still really wonder what happened!

1 Like

Glad it’s back up and running. I imagine a few changes of underwear was required today.

You could check the pool history with:

zpool history

2 Likes

Gnome Disks doesn’t understand ZFS enough to display anything more then that there is a ZFS partition type.

In addition to zpool history I would be looking at SMART stats for the drives involved. Also check dmesg output (before rebooting.) There might be helpful messages in the system logs.

Edit: I’m a little curious about “I found my zpool mirror empty.” How did yncidentally, ZFS is designed to survive a hard reset so something else might have gone wrong, requiring a resilver and scrub.

3 Likes

Either the design has errors or the implementation. Still cannot imagine how the 4 zfs label/disk can be at a same time corrupted (as this is unreal for any bitrot) as by zfs itself after a power fail or even some get such after a regular reboot. Maybe better in a future pool release design.

Nothing special in the history of you ask me:

2024-09-24.23:34:01 zfs set nixos:shutdown-time=di 24 sep 2024 23:34:00 CEST zpool0
2024-09-24.23:36:11 zpool import -d /dev/disk/by-id -N zpool0
2024-09-26.17:09:20 zfs set nixos:shutdown-time=do 26 sep 2024 17:09:18 CEST zpool0
2024-09-26.17:11:31 zpool import -d /dev/disk/by-id -N zpool0
2024-10-01.08:47:09 zpool import -d /dev/disk/by-id -N zpool0
2024-10-01.12:01:52 zfs set nixos:shutdown-time=di  1 okt 2024 12:01:51 CEST zpool0
2024-10-01.12:03:57 zpool import -d /dev/disk/by-id -N zpool0
2024-10-01.12:24:55 zfs set nixos:shutdown-time=di  1 okt 2024 12:24:53 CEST zpool0
2024-10-01.12:27:03 zpool import -d /dev/disk/by-id -N zpool0

Gnome disks does give nice SMART data and self tests, everything reported as OK.

Someone suggested something fstab related? OS continued boot without properly finishing mounting the pool?

I didn’t need clean underware btw because this is my first ZFS project and I only use it for extra stuff for now :slight_smile: I do have bigger plans though…

1 Like

Based on what you’ve posted, my hunch is that due to the unclean shutdown, a forced scrub happened (maybe as part of the import). The scrub process clearly shows that there IS data (~772GB) - which is/was good. Even your df shows the data but I think things weren’t mounted properly post scrub and so you couldn’t “see” it.
You then rebooted, and this time it was a proper ordered shutdown and so on boot the pool imported properly and mounted as usual and voila… things look good and the zpool was saved.

You should look at the old logs to see if the OOM condition killed anything (if so what) and how that may have affected zfs on the machine. It’ll at least give you a clue on the goings on.

Glad you’re back to normal now.

3 Likes

It does indeed mention 771 GB of data during the scrub, you’re right!