Yenya's World

Thu, 14 Feb 2008

AMD versus Intel

For a long time, we have been using AMD Opterons and Athlons 64 for our web and application servers. Everybody says that Intel has made a big progress in the last year or so, so I wondered whether an Intel architecture would be better than AMD one for our upcoming distributed computing project. Usually all benchmarks display things like raw memory throughput, encoding/decoding video (which can be done using the SIMD instructions), etc. However, how would the architectures perform on heavily branched code?

A part of our project is sorting big quantities of data. We have chunks consisting of 4-byte key, 4-byte value pairs, which have to be sorted according to the key. Since the data is being generated relatively slowly, I have decided to pre-sort it using a bucket sort into a set of 256 (for now) bucket files, then sort each bucket file separately, and finally concatenate the results. I have tried to measure how long the "sort all buckets" step will take on a single core:

Machine	cc -Os	cc -O6	cc -O6 w/o memcpy()
Athlon64 FX-51 2.2 GHz/1 MB L2	16.9s	12.5s	9.6s
Athlon64 x2 5600+ 2.8 GHz/2x 1 MB L2	12.5s	8.3s	7.1s
Pentium D 3.0 GHz/2 MB L2	9.6s	9.0s	8.8s

The first two variants used memcpy() inside the quicksort routine for swapping the two entries (in order to be prepared for possible future variable data size), the last one used single 64-bit instruction instead. There are two interesting observations there:

AMD is apparently slightly faster there.
The -Os (optimize for size) GCC option is useless. I wonder why it is the default optimization option for kernel compiles nowadays.

Another interesting part was the cache size effect: four biggest buckets had 1088232, 1046624, 872792, and 776224 bytes, respectively. Sorting those four buckets took 2.26, 2.22, 0.63, and 0.21 seconds (on the above FX-51 machine). This means that somewhere around 800 KB of data size, the algorithm could no longer fit the data into the L2 cache, resulting in a big slowdown: these four buckets together took more time to sort than the remaining 254 buckets, even though they contain only 2.23 % of the total data size. I guess I will just use more (512?) buckets in the production version.

So, what is your experience with compiler optimization settings, and with speed of various CPUs and architectures?

Section: /computers (RSS feed) | Permanent link | 3 writebacks

3 replies for this story:

finn wrote:

As far as I know, -O3 is the best optimization level for gcc (you are using gcc, aren't you?).


Name:
URL/Email:	[http://... or mailto:you@wherever] (optional)
Title:	(optional)
Comments:

Key image:	(valid for an hour only)
Key value:	(to verify you are not a bot)

Yenya's World

Thu, 14 Feb 2008

AMD versus Intel

3 replies for this story:

finn wrote:

Yenya wrote: Re: finn

jura wrote:

Reply to this story:

About:

Links:

Categories:

Archive:

Blog roll: