Yenya's World

Fri, 07 Jul 2006

Weekly Crashes

We have an off-site backup server for the most important data. Several months ago it started to crash - and it crashed during the backup almost every Thursday morning.

At first we have suspected the hardware. However, I was able to run parallel kernel compiles for a week or so, with some disk copying processes on background. The next suspicious party were the backups themselves: we have tried to isolate which of the backups flowing to this host was the cause. But there was nothing interesting. We have checked our cron(8) jobs, but there was nothing special scheduled for Thursday mornings only (the cron.daily scripts run, well, daily, and the cron.weekly scripts run on Sunday morning.

When upgrading the disks this Tuesday I began to think that there was a problem with the power system - my theory was that on Thursdays, some other server in the same room runs something power-demanding, which causes power instability, and our backup server crashes.

Yesterday the backup server crashed even without the backup actually running. I have decided to re-check our cron jobs, and I have found the cause of the problem: we run daily the S.M.A.R.T. self-tests of our disk drives, and the script has been written to run "short" self-test every day except Thursdays - on Thursdays, it ran "long" self-tests. I wrote it this way so that in case of a faulty drive we can have two days (Thursday and a less-busy Friday) for fixing up the problem. So I have tried to run a "long" self-test on all six drives by hand, and the server has crashed within an hour.

It seems the backup server has a weak power supply or something, and running the "long" self-test on all the drives was too much for it. So I have added a two-hours sleep between the self-test runs on individual drives, and we will see if it solves the problem. Otherwise I would have to replace the power supply. Another hardware mystery solved.

Section: /computers (RSS feed) | Permanent link | 0 writebacks

0 replies for this story:

Reply to this story:

URL/Email: [http://... or mailto:you@wherever] (optional)
Title: (optional)
Key image: key image (valid for an hour only)
Key value: (to verify you are not a bot)


Yenya's World: Linux and beyond - Yenya's blog.


RSS feed

Jan "Yenya" Kasprzak

The main page of this blog



Blog roll:

alphabetically :-)