david.fdez.camb at gmail
Apr 9, 2012, 5:03 PM
Recovering a RAID 4 aggregate on FAS2020
This is in short our present (sad) toaster situation: we had a FAS2020 system with two controllers, a base shelf of 12 SAS 300Gb disks and an external shelf of 14 1TB SATA disks. We had, among others, an agreggate made of 6 SATA disks configured with RAID4 (5 data disks + 1 parity disk) and 1 spare disk available.
One week ago, one of the disks failed. The system substituted it by the spare disk and begun to reconstruct the RAID. However 20 hours later and, before finishing RAID reconstruction, another disk failed. Reconstruction should have ended on that period as the system was not loaded at all (10-12 hours should have lasted as I've read), but unfortunately took longer.
When the second disk failed, the first controller made a takeover and set the aggregate offline. After contacting Netapp support, they forced a giveback that led to the following situation:
Aggregate aggr1 (failed, raid4, partial) (block checksums)
Plex /aggr1/plex0 (offline, failed, inactive)
RAID group /aggr1/plex0/rg0 (partial)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
parity 0a.22 0a 1 6 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data 0a.16 0a 1 0 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data FAILED N/A 847555/1735794176
data 0a.26 0a 1 10 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data 0a.24 0a 1 8 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data 0a.28 0a 1 12 FC:A - ATA 7200 847555/1735794176 847827/1736350304 (reconstruction 99% completed)
Raid group is missing 1 disk.
That is, of a RAID4 6 disks aggregrate we have 4 disks OK, one reconstructed at 99% (as message suggests) and a broken one.
My question is: should we have any hope to recover the information or we should directly forgot and recover all from backups?
To make things worst, the two disks that failed are apparently working (at least they are detected by the system; could have been a software problem), but an improper sequence of commands have partially 'zeroed' them.
Any idea, comment or opinion is welcome. Thanks in advance,
Toasters mailing list
Toasters [at] teaparty