Tue, 06 Sep 2005

Guess what has failed!

4:35 am - SMS arrived: the whole cluster is down. Getting up and going to the computer: strange, our router is perfectly OK, just the interface to that subnet seems to be down. Restarting the transceiver using mii-tool -R. No change. Stopping the interface via ifconfig and starting it up again - no change. mii-tool still reports that the link is down. Bad cable? Extremely improbable. A NIC or driver failure? Should I reboot the router?

A message from Nagios: the switch for that subnet is down. Indeed it does not respond to ping. The neighbour switch reports that the port connected to that switch is down. Hmm, do we have a spare 24-port gigabit switch? Fortunately we do. I am packing up my laptop and leaving home.

In the server room. The switch indeed seems to be dead - no lights. Removing and re-inserting the power cord: no change. Hmm, it looks like the entire rack is dead: none of the computers seem to be powered up. Even the control light on the power outlet strip is off. Maybe the power breaker went off? Trying to put the power cord to another plug in the wall: no change. Bad power outlet strip? It seems so. Replacing it with the spare, and carefully powering up one server after another. Good. The master switch shows that the entire rack takes between 6 and 7 Amps, the outlet strip fuse is rated at 10 Amps. Maybe the fuse was bad or something.

Fixing up few minor problems like that one of the servers was not configured to power itself up after the power loss. 6:12 am: all servers seem to be up. Unfortunately, the main database server cannot start the database. The error message in response to the oracle_start script is strange: "the database instance is already down. Talk about informative messages. Fortunately, our DBA already woke up, so I am leaving this to him. Now if only I can find some time to sleep.

