Wed, 21 Dec 2005

Rotting bits

Yesterday after the midnight the X server on my workstation has crashed in the middle of work. Strange, I thought, but it was a nice opportunity to install a new kernel. So I've downloaded 2.6.15-rc6, recompiled new udev, and rebooted. The kernel has crashed on me soon after the reboot, and even one of the filesystems got into an inconsistent state.

At first I suspected the new kernel option, CONFIG_CC_OPTIMIZE_FOR_SIZE. However, I was not able to recompile the kernel - the gcc kept crashing with an internal error. I have decided to boot into the previous kernel version. I booted into the single-user runlevel, and ran fsck manually. It found some errors on my filesystems. I tried to recompile the kernel again without optimizing for size, but the gcc crashed even on the older kernel. I started to suspect the hardware.

It turned out that the problem was one faulty DIMM - the computer did not even survive a few minutes of the memtest86 test (there were at least 100-200 bad bits in that memory module). I wonder how a DIMM can get faulty in the middle of work, after few weeks of uptime, inside the computer, with no physical movement of the computer, and without powering the computer down. There was no significant amount of dust either, the box has been vacuum-cleaned a month or so ago.

The positive outcome is, that on 2.6.15-rc6 I have finally got S.M.A.R.T. working even on SATA disks (via smartctl -d ata /dev/sda).

