Mon, 08 Aug 2005
Storage array failure
We had a pretty bizzare disk array failure on Friday afternoon. We had ordered three new disks to this array in order to make a new volume for backups. When we put the disks to the array, it started to act funny - one of our RAID-1 volumes stopped responding to OS requests. It turned out that one of the new disks was not clean, and in fact it contained metadata from the same type of storage array. So our array happily read a metadata from the new disk, and started to think it has a new RAID-5 volume, with all-but-one disks missing, and with the same LUN as one of pre-existing volumes. Moreover, removing the disk in question did not help anything, because the array controller already had the metadata cached in its database.
To make a long story short, 3.5 hours of downtime later, after numerous removing and inserting drives, and clicking to the storage array manager, we had our storage back and working. The last few configuration steps had to be done in the storage manager from another vendor, as the original one stopped talking to the array saying "Cannot read configuration" (we have two almost identical disk arrays from two well-known vendors, but both are in fact manufactured by LSILogic and branded by Big UNIX Vendors, so both can be more-or-less correctly configured with the other vendor's application).
Why we got an already-used disk, and why the array blindly reads the metadata from a newly-inserted disk, even though it disrupts working of existing volumes, remains beyond my comprehension. Anyway, another not-so-well spent Friday afternoon.