Fri, 07 Jul 2006
We have an off-site backup server for the most important data. Several months ago it started to crash - and it crashed during the backup almost every Thursday morning.
At first we have suspected the hardware. However, I was able to run
parallel kernel compiles for a week or so, with some disk copying processes
on background. The next suspicious party were the backups themselves:
we have tried to isolate which of the backups flowing to this host was the
cause. But there was nothing interesting. We have checked our
cron(8) jobs, but there was nothing special scheduled for
Thursday mornings only (the
cron.daily scripts run, well, daily,
cron.weekly scripts run on Sunday morning.
When upgrading the disks this Tuesday I began to think that there was a problem with the power system - my theory was that on Thursdays, some other server in the same room runs something power-demanding, which causes power instability, and our backup server crashes.
Yesterday the backup server crashed even without the backup actually running. I have decided to re-check our cron jobs, and I have found the cause of the problem: we run daily the S.M.A.R.T. self-tests of our disk drives, and the script has been written to run "short" self-test every day except Thursdays - on Thursdays, it ran "long" self-tests. I wrote it this way so that in case of a faulty drive we can have two days (Thursday and a less-busy Friday) for fixing up the problem. So I have tried to run a "long" self-test on all six drives by hand, and the server has crashed within an hour.
It seems the backup server has a weak power supply or something, and running the "long" self-test on all the drives was too much for it. So I have added a two-hours sleep between the self-test runs on individual drives, and we will see if it solves the problem. Otherwise I would have to replace the power supply. Another hardware mystery solved.