Tue, 11 Jul 2006
3ware Disk Latency
Odysseus with the new hardware seems to be pretty stable. However, there is still a problem: it seems that with the new 3ware 9550SX disk controller, the drives have much bigger latency than they had with the older controller (7508).
The system apparently has a bigger overall throughput, but the latency sucks. It is most visible on Qmail - with the old setup, Qmail was able to send about 2-4k individual mails per 5 minutes. With the new setup, this number is in low hundreds of messages per 5 minutes. With this slowness, Odysseus is not even able to keep up with the incoming queue. After the new HW was installed, the delay of the mail queue was several days(!).
I have found this two years old message to LKML, where they try to solve the same
problem with disk latency. It seems that the 3ware driver allow up to
254 requests in flight to a single SCSI target, while the kernel's block layer
nr_requests) is only 128 requests deep. This means
that the controller sucks all the outstanding requests to itself, and the
kernel's block request scheduler does not have an opportunity to do anything.
So I have lowered the per-target number of requests to 4, and disabled
the NCQ on the most latency-sensitive drives (i.e. those which carry the
/var volume), and the performance looks much better now.
I think the main difference between the old HW and the new one is that
the new controller has much bigger cache, so it can allow more requests
in-flight. So the kernel scheduler cannot prioritize the requests it considers
important, causing the whole latency to go up.
I hope I have solved the latency problem for now, but during summer holidays the FTP server load is usually lower, so the problem may return back.