So you have a failed disk in a ZFS pool and you want to fix it? Routine disk failures are really a non-event with ZFS because the volume management makes replacing them so dang easy. In many cases, unlike hardware RAID or older volume management solutions, the replacement disk doesn’t even need to be exactly the same as the original. So let’s get started replacing our failed disk. These instructions will be for a Solaris 10 system, so a few of the particulars related to unconfiguring the disk and device paths will vary with different flavors of UNIX.
First, take a look at the zpools to see if there are any errors. The -x flag will only display status for pools that are exhibiting errors or are otherwise unavailable.
Note: If the disk is actively failing (a process that sometimes takes a while as the OS offlines it), any commands that use storage related system calls will hang and take a long time to return. These include “zpool” and “format”, so just be patient; they will eventually return.
# zpool status -x
pool: data state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scrub: none requested config: NAME STATE READ WRITE CKSUM data DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 FAULTED 1 81 0 too many errors mirror-1 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 errors: No known data errors
So we can easily see that c1t5d0 has failed. Take a look at the “format” output do get the particulars about the disk:
Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c1t0d0
/pci@0/pci@0/pci@2/scsi@0/sd@0,0 1. c1t1d0 /pci@0/pci@0/pci@2/scsi@0/sd@1,0 2. c1t2d0 /pci@0/pci@0/pci@2/scsi@0/sd@2,0 3. c1t3d0 /pci@0/pci@0/pci@2/scsi@0/sd@3,0 4. c1t4d0 /pci@0/pci@0/pci@2/scsi@0/sd@4,0 5. c1t5d0 /pci@0/pci@0/pci@2/scsi@0/sd@5,0 Specify disk (enter its number):
Get your hands on a replacement disk that is as similar as possible to a SEAGATE-ST914602SSUN146G-0603-136.73GB. I was only able to dig up a HITACHI-H103014SCSUN146G-A2A8-136.73GB, so I’ll be using that instead of a direct replacement.
Next, use “cfgadm” to look at the disks you have and their configuration status:
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition c1 scsi-sata connected configured unknown c1::dsk/c1t0d0 disk connected configured unknown c1::dsk/c1t1d0 disk connected configured unknown c1::dsk/c1t2d0 disk connected configured unknown c1::dsk/c1t3d0 disk connected configured unknown c1::dsk/c1t4d0 disk connected configured unknown c1::dsk/c1t5d0 disk connected configured unknown
We want to replace t5, so we prepare it for removal by unconfiguring it:
# cfgadm -c unconfigure c1::dsk/c1t5d0
The “safe to remove” led should turn on and you can pull the disk, remembering to allow it several seconds to spin down. Replace it with the new disk and take a look at “cfgadm -al” output again to ensure that it has been automatically configured. If it has not, you can manually configure it like below:
# cfgadm -c configure c1::dsk/c1t5d0
Now, it’s a simple matter of a quick “zpool replace” to get things rebuilding:
# zpool replace data c1t5d0
You can use the output of zpool status to watch the resilver process…