Mon, 31 Oct 2005

Power failure, storage failures, IBM support failures

We had a power outage last Wednesday, apparently caused by a faulty breaker or something like that. Even UPSes and a generator were not able to bridge over the failure, so the whole server room went down.

Aside from the usual problems like "this server booted up before that one, so that this service was not working", two hard drives went faulty after the power outage (one on Thursday, and the second one on Friday). I suspect they were faulty immediately after the power outage, just the storage array discovered the failure while running some kind of internal tests on Thursday, or Friday, respectively.

So, we have met the IBM support again. We have bought the storage from some IBM reseller, and they claimed they can handle the support for us. However, we experience the same problems every time we try to handle a disk failure:

WTF? Why the reseller cannot communicate with IBM themselves as they promised? And what is worse, it seems that the reseller or the IBM hotline demand different parameters of our storage array every time - sometimes it is the serial number of the array itself, or the serial number of the drive, or the entire storage array profile, and the last one was some IBM part number which is even not visible remotely from the storage manager, and it is just printed on the array itself (so we had to walk to the array, which is located in a remote server room). The support of SGI is definitely better, altough we had a similar problem last time (they demanded something called the "revision number" of the faulty drive, which is only printed on the drive itself and cannot be read remotely by the storage manager).

