At work we’re using SmartOS, an Open Solaris clone featuring all kinds of cool stuff. One of the best things is the underlying file system: ZFS.
With ZFS it is easy to create, mirror, extend storage pools, it’s also very easy to snapshot pools and backup them using
zfs send and
In the process of a manual backup of one of the pools today I wanted to see the status of the whole system by using
This is what it showed:
$ zpool status -v pool: zones state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A config: NAME STATE READ WRITE CKSUM zones DEGRADED 16 0 0 mirror-0 DEGRADED 32 0 0 c0t4d0 DEGRADED 32 0 0 too many errors c0t6d0 DEGRADED 32 0 0 too many errors logs c0t9d0 ONLINE 0 0 0 cache c0t8d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: zones/dump:<0x1>
At first this looks a litte bit weird. What is this
zones/dump even for? Why is it broken?
The answer: Solaris dumps the memory onto the disk on a system crash.
I tried googling this error, why it would get corrupt, if the disks are really broken or if it is just a software error.
Turns out this bug is known. We recently upgraded our SmartOS, which brings up this issue. The disk and the pool are not really broken, but simply the data is misinterpreted. To correct it you must replace the dump and later scrub the whole pool again. I executed the following commands to do this (found them in a forum post):
zfs create -o refreservation=0 -V 4G zones/dump2 dumpadm -d /dev/zvol/dsk/zones/dump2 zfs destroy zones/dump zfs create -o refreservation=0 -V 4G zones/dump dumpadm -d /dev/zvol/dsk/zones/dump zfs destroy zones/dump2
This will first create a new file system, swap it in as the dump file system, delete the old one and once again create a new one with the old name and putting it back in place.
In case the
dumpadm -d part fails, complaining about the file system being to small, just resize it:
zfs set volsize=20G zones/dump2
The scrubbing took 21 hours with our large data set, but it was not noticable in running machines on this host due to its low priority. The final status:
pool: zones state: DEGRADED … errors: Permanent errors have been detected in the following files: <0x17f>:<0x1>
Well, now the
zones/dump:<0x1> is gone. But it still shows an error for the same file system, just that it is not named anymore. We’re scheduling a maintenance soon to reboot the machine. Let’s hope this will clear the error. Otherwise we will replace the HDD.