Yenya's World

Mon, 31 Jul 2006

XFS Corruption

Odysseus has been hit by the infamous XFS endianity bug in the 2.6.17 kernel (for those who do not know, this in some rare cases can create a filesystem corruption which xfs_repair still cannot fix).

I have tried to fix the filesystem according to the guidelines in the FAQ, but there must be yet another problem in my XFS volume or in the kernel, as it keeps crashing when I try to actually use that volume. Since Friday I have been trying to put Odysseus back in shape, so far without success. Now I am copying the local (non-mirrored) data to a spare disk, and I will recreate the entire volume from scratch.

I feel sorry for the XFS developers, as they have always been helpful when I had a problem with XFS. But I am afraid I will have to use something different than XFS. I am considering JFS (which was the fastest filesystem when I did my own testing), but I will probably use ext3: this is - from my experience - a very stable filesystem, with one important feature: a rock-solid e2fsck, which can even fix the filesystem corrupted by a hardware bug (unlike reiserfsck, and unfortunately xfs_repair as well).

I hope the FTP service of ftp.linux.cz will be restored tomorrow.

Section: /computers (RSS feed) | Permanent link | 0 writebacks

Fri, 28 Jul 2006

Bloover

A coworker of mine showed me an interesting tool: Bloover. It is a security auditing tool for BlueTooth-enabled phones. It seems my Nokia has a huge security hole - Bloover running on his phone (it is a Java ME application) can download the whole contact list, list of recent calls, and few other things from my Nokia, even though the devices are not paired with each other, and my phone is not set to be visible via BlueTooth.

I ended up disabling BlueTooth on my phone, and enabling it only when I need it. Now I have to find whether this particular hole has been patched by Nokia, and whether they will provide a new firmware for free. I am afraid they won't.

This is the same problem with all closed-source devices. They cannot be fixed without the vendor's help. And some vendors are extremely unhelpful with fixing their devices (I have to name Cisco as well as Nokia here). For example, HP does this better with their switches. While the firmware is not open source, they provide all the firmware upgrades freely downloadable from their web site.

This problem will become more and more common, as more and more devices will have some sort of CPU and firmware inside. So I wonder what my next mobile phone will be so that I will not fall to the same firmware upgrade trap? Maybe some Linux-based Motorola.

Section: /computers (RSS feed) | Permanent link | 0 writebacks

Thu, 27 Jul 2006

Hotmail and UTF-8

A new user of IS MU has redirected her e-mail to Hotmail [?], and complained that mail from IS MU is not displayed correctly (the diacritical characters were wrong). I wondered whether it was true, and even tried to create a Hotmail account for testing purposes.

After accepting at least 50 cookies from their authentication service, they have finally accepted my registeration. However, I could not log in, the web site said some internal error message (I have tried it once again, and it returned me back to the login screen). I then tried to use MSIE from our remote desktop server, and I have been able to log in. So it seems MSIE is the only browser allowed to access Hotmail.

I have sent a test mail to this new mailbox then. It contained a text in Czech, and two words in Japanese (Katakana and Kanji). It ended up in the spambox in my Hotmail account, and even there it was not displayed correctly. So it seems that Hotmail cannot correctly handle mail with Czech or Japanese language. I have not found the "display all headers" option on Hotmail, so I cannot even verify if the headers arrived correctly to Hotmail.

Moreover, the mail from that user of IS MU (sent from Hotmail) was broken itself - it contained characters outside of US-ASCII, yet there was no mention about the transfer encoding, MIME version, or character set of the mail body in the headers. Talk about respecting the widely-accepted well established standards.

I have recommended the user to use a different mail service, preferably somewhere where they can be standards-compliant.

Section: /computers (RSS feed) | Permanent link | 0 writebacks

Mon, 24 Jul 2006

Netbox Voice

After some time of using Ekiga as my softphone, I have decided to acquire a public phone number, reachable even from the PSTN. There are definitely cheaper providers, but I have chosen Netbox, my ISP. They can put the VoIP traffic to a different band of their bandwidth limiter, so I can use my network connection at the full speed, while using the softphone without loss of quality.

I use the software phone only - I do not want to have another blackbox (read: the Linksys VoIP gateway) at home. Ekiga is pretty easy to use, and I have my PC always on anyway.

There were some problems with setting it up, though:

So after a year or so, we have again a "land line", this time without a monthly fee, with much lower call rates than Český Telecom (now Telefónica) offers, and with immediately available calls history on the Netbox customer's web pages.

Section: /computers (RSS feed) | Permanent link | 0 writebacks

Wed, 19 Jul 2006

A Slightly Better Wheel

In the world of Open source software, one can often spot a phenomenon, which I hereby name "A Slightly Better Wheel Syndrome(tm)". I often see this in bachelors' projects of our students, but sometimes also in the works of my colleagues or other computer professionals. The Slightly Better Wheel Syndrome is - strictly speaking - reinventing the wheel. It is less harmful than plain old reinvention of the wheel, because the result is - well - slightly better than the original. Except that often, in the big picture, it is not. Today I have seen an outstanding example of this syndrome.

I have read an article about driving simulation game named VDrift. I wanted to try it, and (because it is not in Fedora Extras), I wanted to package it. So I downloaded the source, and wanted to compile it. There was no Makefile, no configure, nothing familiar. So I have read the docs, and I have found that SCons is used instead of make to build VDrift.

I have tried to find out WTF SCons is, and why should I use instead of make. They have a section titled "What makes SCons better?" in their home page: almost all features listed there fall to the category "make+autotools can do this as well (sometimes in a less optimal way)". Nothing exceptional, what would justify writing yet another make replacement. What they do not tell you, are the drawbacks. And those are pretty serious:

The first one is, that virtually everybody is familiar with make. Every programmer, many system admins, etc. When something fails, it is easy to find the right part in Makefile which needs to be fixed (it is true even for generated Makefiles, such as an automake output). Everybody can do at least a band-aid fix.

The second problem is, that their SConstruct files (an equivalent of Makefile) are in fact Python scripts, interpreted by Python. So the errors you get from SCons are not an ordinary errors, they are cryptic Python backtraces. I got the following one when trying to build VDrift:

TypeError: __call__() takes at most 4 arguments (5 given):
  File "SConstruct", line 292:
    SConscript('data/SConscript')
  File "/usr/lib/scons/SCons/Script/SConscript.py", line 581:
    return apply(method, args, kw)
  File "/usr/lib/scons/SCons/Script/SConscript.py", line 508:
    return apply(_SConscript, [self.fs,] + files, {'exports' : exports})
  File "/usr/lib/scons/SCons/Script/SConscript.py", line 239:
    exec _file_ in stack[-1].globals
  File "data/SConscript", line 21:
    SConscript('tracks/SConscript')
  File "/usr/lib/scons/SCons/Script/SConscript.py", line 581:
    return apply(method, args, kw)
  File "/usr/lib/scons/SCons/Script/SConscript.py", line 508:
    return apply(_SConscript, [self.fs,] + files, {'exports' : exports})
  File "/usr/lib/scons/SCons/Script/SConscript.py", line 239:
    exec _file_ in stack[-1].globals
  File "data/tracks/SConscript", line 10:
    env.Distribute (bin_dir, 'track_list.txt.full', 'track_list.txt.minimal')
  File "/usr/lib/scons/SCons/Environment.py", line 149:
    return apply(self.builder, (self.env,) + args, kw)

So what's going on here? With make, there would be a simple syntax error with the line number. With SCons, there is a cryptic Python backtrace, written in an order reverse to what anybody else (gdb, Linux Kernel, Perl, etc.) uses. The line 149 in Environment.py is this:

148:     def __call__(self, *args, **kw):
149:         return apply(self.builder, (self.env,) + args, kw)

So what the error message is about? __call__() is defined with three parameters, yet the message complains that it has at most four, and it is called with five. Moreover, this is apparently called in some magic way (there is no explicit call to the __call__() function) from data/tracks/SConscript line 10, which is a call with three arguments, and a call to something different that that __call__() function. There is no way to fix the problem without the deep knowledge of Python and SCons.

I have googled the error message, and found this thread, which said that it is possible to build VDrift after commenting out the line 10 in data/tracks/SConscript. But I still have no idea about what was wrong, and wheter some edit of the line 10 would be better instead of commenting it out.

So SCons is definitely another example of a Slightly Better Wheel Syndrome. In a hypotetical SCons authors' Ideal World(tm) where everybody uses SCons and everybody knows Python, it would have been possible for SCons to be better than make+Autotools combo, but in the Real World(tm), no way.

It is definitely harder to make an existing solution fit your needs instead of rewriting it from scratch. This is because it is harder to read the other people's code than to write your own. When using an existing solution, however, it is often gained more flexibility, maintainability, and features which "I just don't need" (read: don't need now, but in the future they might be helpful).

So the morale is: Please, please! Try to use (and maybe even improve) existing widespread solution, even when it at the first sight does not exactly fit your needs. Do not reinvent a Slightly Better (in fact sometimes much worse) Wheel. The world does not need yet another slightly better, yet in fact broken, PHP-based discussion board, PHP-based photo galery, or a make replacement.

Section: /computers (RSS feed) | Permanent link | 0 writebacks

Tue, 18 Jul 2006

Comma-Separated Values?

While migrating IS MU to UTF-8, I rewrote the code for exporting tabular data to CSV file for MS Excel, factoring it out to a separate module. When I was at it, I have also added the Content-Disposition header, so that the exported file is saved under a sane filename, instead of the default of some_application.pl. So now the Excel exports are saved as files ending with the .csv suffix. Which is, interestingly enough, the source of problems and incompatibilities with MS Excel.

As I have verified, when I save the CSV file as file.pl, excel reads it correctly - it asks whether the TAB character is the field separator (indeed it is), whether Windows-1250 is the file encoding (it is), and happily imports the file. When the same file is named file.csv, Excel opens it without any question, but somehow does not recognize the TAB character as the field separator. So all the fields are merged to the first column, and the TAB characters are displayed as those ugly rectangles.

When I try to separate the fileds with semicolons, Excel happily opens the file (when named as *.csv), but with another file name, it is necessary to explicitly choose the semicolon as a separator. Just another example of MS stupidity - why the separator cannot be the same regardless of the file name? And by the way, what does CSV stand for? Comma-separated values? Colon-separated values? It does not work for commas nor colons. Just semicolons are detected correctly. Maybe it is some kind of newspeak invented by Microsoft.

I guess I keep the exports TAB-delimited, and just change the file name in the Content-Disposition header to use the .txt extension instead (altough something like .its_csv_you_stupid_excel would probably be more appropriate).

Section: /computers (RSS feed) | Permanent link | 5 writebacks

Mon, 17 Jul 2006

IS in UTF-8

Our Information System is running with UTF-8 support even at the application layer since Friday. Finally the work which took the most of my work time is almost finished. Now we are fixing the parts of the system which are not running directly in Apache (cron jobs, etc), and minor glitches which survived our prior testing.

We do not allow arbitrary characters everywhere, because we must maintain some attributes in the form suitable for printing through TeX or exporting to the external systems, which are ISO 8859-2 or Windows-1250-based mostly. We allow almost all of Latin-1 and Latin-2 characters in most applications, though.

While it has been hard to convert the whole system to UTF-8, I must say that the UTF-8 support in Perl is well architected (and from what I have read, definitely better than in other scripting languages).

Section: /computers (RSS feed) | Permanent link | 2 writebacks

Tue, 11 Jul 2006

3ware Disk Latency

Odysseus with the new hardware seems to be pretty stable. However, there is still a problem: it seems that with the new 3ware 9550SX disk controller, the drives have much bigger latency than they had with the older controller (7508).

The system apparently has a bigger overall throughput, but the latency sucks. It is most visible on Qmail - with the old setup, Qmail was able to send about 2-4k individual mails per 5 minutes. With the new setup, this number is in low hundreds of messages per 5 minutes. With this slowness, Odysseus is not even able to keep up with the incoming queue. After the new HW was installed, the delay of the mail queue was several days(!).

I have found this two years old message to LKML, where they try to solve the same problem with disk latency. It seems that the 3ware driver allow up to 254 requests in flight to a single SCSI target, while the kernel's block layer queue (nr_requests) is only 128 requests deep. This means that the controller sucks all the outstanding requests to itself, and the kernel's block request scheduler does not have an opportunity to do anything.

So I have lowered the per-target number of requests to 4, and disabled the NCQ on the most latency-sensitive drives (i.e. those which carry the /var volume), and the performance looks much better now. I think the main difference between the old HW and the new one is that the new controller has much bigger cache, so it can allow more requests in-flight. So the kernel scheduler cannot prioritize the requests it considers important, causing the whole latency to go up.

I hope I have solved the latency problem for now, but during summer holidays the FTP server load is usually lower, so the problem may return back.

Section: /computers (RSS feed) | Permanent link | 0 writebacks

Fri, 07 Jul 2006

Weekly Crashes

We have an off-site backup server for the most important data. Several months ago it started to crash - and it crashed during the backup almost every Thursday morning.

At first we have suspected the hardware. However, I was able to run parallel kernel compiles for a week or so, with some disk copying processes on background. The next suspicious party were the backups themselves: we have tried to isolate which of the backups flowing to this host was the cause. But there was nothing interesting. We have checked our cron(8) jobs, but there was nothing special scheduled for Thursday mornings only (the cron.daily scripts run, well, daily, and the cron.weekly scripts run on Sunday morning.

When upgrading the disks this Tuesday I began to think that there was a problem with the power system - my theory was that on Thursdays, some other server in the same room runs something power-demanding, which causes power instability, and our backup server crashes.

Yesterday the backup server crashed even without the backup actually running. I have decided to re-check our cron jobs, and I have found the cause of the problem: we run daily the S.M.A.R.T. self-tests of our disk drives, and the script has been written to run "short" self-test every day except Thursdays - on Thursdays, it ran "long" self-tests. I wrote it this way so that in case of a faulty drive we can have two days (Thursday and a less-busy Friday) for fixing up the problem. So I have tried to run a "long" self-test on all six drives by hand, and the server has crashed within an hour.

It seems the backup server has a weak power supply or something, and running the "long" self-test on all the drives was too much for it. So I have added a two-hours sleep between the self-test runs on individual drives, and we will see if it solves the problem. Otherwise I would have to replace the power supply. Another hardware mystery solved.

Section: /computers (RSS feed) | Permanent link | 0 writebacks

Mon, 03 Jul 2006

New hardware in Odysseus

After several months of having new disks and controller at my desk, I have finally managed to install them to Odysseus. Now Odysseus (ftp.linux.cz, ftp.cstug.cz, etc.) has a shiny new SATA-2 PCI-X controller (3ware 9550SX-8LP) with eight new 500 GB drives, almost doubling its previous storage capacity.

It seems the new controller (with NCQ and bigger cache memory) is according to iostat(8) able to keep all the drives busy at 100% sustained when the demand is high enough. The previous one - 3ware 7508 - was apparently not able to distribute the load equally: when the load was high, the drives on which the busiest volume (/var) was, were at 100%, while the others peaked at about 75-80%.

I even had to upgrade my MRTG configuration, raising the maximum theoretical bandwidth of each drive. It seems the new drives have noticably higher throughput. However, it feels like the latency of the drives has increased during the load peaks. I am not sure what is the cause (maybe it is simply a throughput-for-latency tradeoff because of bigger caches everywhere). From the graphs it seems that the imbalanced HDD utilization problem is gone (altough it might have been a problem of a particular HDD firmware).

Another thing to note is, that SATA cables require bigger space behind the drive, because the SATA connector is almost a centimeter deeper than PATA one. I think I have cracked the connector in one drive while trying to fit the cable between the drive and the fan behind it (but the drive fortunately works).

When upgrading the system, I screwed up when moving the data to the new host: I ran the following command on the upgraded system, which has still been running on a temporary IP address:

rsync -aHSvx -e ssh --delete odysseus:/export/ /export/ && echo "OK"|mail -s rsync kas

Which used the address 127.0.0.1 for odysseus (which was in /etc/hosts, because Anaconda sets it up this way). So the above command on a temporary host actually did not do anything, as it tried to synchronize the /export volume from itself. You can laugh at me now. Oh well.

Section: /computers (RSS feed) | Permanent link | 0 writebacks

About:

Yenya's World: Linux and beyond - Yenya's blog.

Links:

RSS feed

Jan "Yenya" Kasprzak

The main page of this blog

Categories:

Archive:

Blog roll:

alphabetically :-)