-- Leo's gemini proxy

-- Connecting to gemini.exoticsilicon.com:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini; charset=utf-8

Benchmarking bitshifts

Introduction

Don't believe that our hand-coded assembler optimisation was worthwhile?

We always value feedback on our presentations, but we were more than a little surprised to hear that some readers were doubtful about the potential of the performance increase gained by our hand optimisation of the bit-shifting code in our Candlelit Console kernel patchset.

We were always confident of the benefits, but since people have shown an interest, we decided to take some time out to do some benchmarks, and see just what was going on.

The code in question

In the original article we mentioned:

> `The bit shifting calculation above is actually much easier in X86 assembler than it is in C.'

Find the original article here

Actually, the main point that we intended to make here was that the split register concept which is fairly unique to X86 CPUs is not really taken advantage of by C compilers, but that it can nevertheless be made use of in cases were we are dealing with several 8-bit values packed into a 32-bit word, and when you happen to want to modify the element stored in the lower bits. This happened to be the blue channel in our 32-bit RGB value.

The `best' code that the C compiler produced with a short test program that did bit shifts on random numbers, seemed to be this:

 movl      %eax, %ecx              # Copy the value returned from arc4random_uniform from eax to ecx
 andl      $16776960, %ecx         # ecx & 0xffff00
 shrl      %eax                    # eax >> 1
 andl      $127, %eax              # eax & 0x7f
 leal      (%rax,%rcx), %esi       # esi = (rax + rcx)

It's difficult to believe that this could possibly be faster than a single opcode to shift the lower eight bits of the accumulator, even given the complex pipeling and caching that goes on within modern CPUs.

Testing methodology

We'll test the two approaches in two different ways. First, a synthetic benchmark to test the raw speed of the corresponding instructions outside of a wider program. Next, we'll benchmark the bit-shifting within the actual rasops code where we originally put it for testing.

Note that in the final version of the Candlelit Console kernel patch, the bit-shifting was moved from the putchar32 function, where it was called for every single character output, into the device colour map initialisation code, which was then called every time our modified rasops_eraserows function was called. Whilst in the putchar function, the overhead could plausibly be significant, in this new location any overhead would be almost undetectable on any reasonably modern hardware, so any performance gain from using assembler is somewhat moot anyway, but it's the principle that we're discussing here.

Synthetic benchmark take one

Building a purely synthetic benchmark is fairly straightforward. We just need to loop a fixed number of times and shift the lower eight bits right by one bit on each iteration of the loop.

 .section ".note", "a"
 .p2align 2
 .long 0x08
 .long 0x04
 .long 0x01
 .ascii "OpenBSD\0"
 .long 0x00
 .p2align 2
 .section .text
 .globl _start
 _start:
 /* Clear the accumulator */
 xor %rax, %rax
 /* %rbx is our loop counter */
 mov $0xFFFFFFFF, %rbx
 /* Now loop doing right shifts on the lower 8 bits of the accumulator: */
 _loop:
 shrb      %al
 dec       %rbx
 cmp       $0, %rbx
 jnz       _loop
 /* Exit */
 mov $0x01, %rax
 syscall

After assembling and linking the above program, we can time how long it takes to run to completion using the shell:

 # as -o benchmark1.o benchmark1.S
 # ld --Bstatic --no-pie benchmark1.o
 # time ./a.out
     0m00.95s real     0m00.95s user     0m00.00s system

Hand written assembler: 0m00.95s real

Synthetic benchmark take two

Now let's repeat the benchmark using code that follows what we saw the C compiler emit:

 .section ".note", "a"
 .p2align 2
 .long 0x08
 .long 0x04
 .long 0x01
 .ascii "OpenBSD\0"
 .long 0x00
 .p2align 2
 .section .text
 .globl _start
 _start:
 /* Clear the accumulator */
 xor %rax, %rax
 /* %rbx is our loop counter */
 mov $0xFFFFFFFF, %rbx
 /* Now loop doing right shifts on the lower 8 bits of the accumulator: */
 _loop:
 movl      %eax, %ecx
 andl      $0xffff00, %ecx
 shrl      %eax
 andl      $0x7f, %eax
 leal      (%rax, %rcx), %esi
 movl      %esi, %eax
 dec       %rbx
 cmp       $0, %rbx
 jnz       _loop
 /* Exit */
 mov $0x01, %rax
 syscall

Do we really expect this to be faster?

 # as -o benchmark2.o benchmark2.S
 # ld --Bstatic --no-pie benchmark2.o
 # time ./a.out
     0m03.02s real     0m03.03s user     0m00.00s system

Assembler based on C compiler output: 0m03.02s real

Analysing the results

The difference is quite obvious, the single op-code version of the bit-shifting code is three times as fast on the test machine. Testing on two more X86 machines, the speed ratio was consistently almost identical at about three to one.

Considering these results, we conclude that the hand assembler optimisation easily wins the synthetic benchmark on all of the test hardware.

Assembler: 1, C compiler: 0

Real world benchmarks

`But wait! That's a synthetic benchmark!'

OK, synthetic benchmarks are, afterall, synthetic. Maybe real-world performance would be different. Maybe the %al opcode will stall a pipeline, maybe our extract of assembler code in the second example wasn't typical of what the C compiler would produce in real usage, or perhaps the code just doesn't even get called often enough in putchar32 to even matter.

Fair enough, let's test the theory.

To do this, we'll use /bin/cat to display a very long text file on the terminal. Each character that is drawn and causes a call to putchar32, will invoke our bit-shifting code in the place where we originally put it for the first test of the colour shifting effect in the Candlelit Console patchset.

We'll perform the tests in single user mode, repeat them a total of five times, and discard the timings from the first run because the text file would have been read from disk rather than from the buffer cache.

First, let's test performance without the bit-shifting code. Times listed are in seconds:

Kernel without bitshifting code:
Run #           real       user     system     notes
1               5.40       0        5.34       values discarded
2               5.27       0        5.28
3               5.27       0        5.28
4               5.32       0        5.33
5               5.37       0        5.38
Low:            5.27       0        5.28
High:           5.37       0        5.38
Mean average:   5.3075     0        5.3175

Now, with the inline assembler code added to putchar32:

Kernel with asm bitshifting code
Run #           real       user     system     notes
1               5.54       0        5.49       values discarded
2               5.32       0        5.33
3               5.27       0        5.27
4               5.28       0        5.28
5               5.33       0        5.33
Low:            5.27       0        5.27
High:           5.33       0        5.33
Mean average:   5.3000     0        5.3025

Surprisingly, the benchmark runs very slightly faster with the extra opcode in place! The conclusion that we can probably draw from this result is that the difference is so small that it is within the margin of error that we are measuring on subsequent runs of the code. In other words, it's `in the noise'.

Now, we'll test the C version of the code:

Kernel with C bitshifting code
Run #            real      user     system     notes
1                5.40      0        5.33       values discarded
2                5.30      0        5.31
3                5.51      0        5.51
4                5.43      0        5.43
5                5.52      0        5.54
Low:             5.30      0        5.31
High:            5.52      0        5.54
Mean average:    5.4400    0        5.4475

The C implementation genuinely seems to be slower by about 2.5%.

The C code is about 2.5% slower in real world usage than our hand-optimised assembler!

Closing thoughts

Our quick benchmarks seem to show that there really is performance to be gained by using hand-written in-line assembler in this particular case.

It's important to note, though, that moving the code from the putchar function to the device colour map initialisation function effectively makes this a moot point, as the bit-shifting is then no longer being done on a character by character basis, and so performance of the frequently called putchar function should remain unaffected.

In general, micro-optimisations are best avoided if they detract in any significant way from the readability, clarity, or quality of the source code. In the case of the Candlelit Console patchset, a different approach eliminates the need for the in-line assembly, and we get the best of both worlds, (performance and clearer source code). We left the assembly code in the final version of the patch mainly as an example for readers who are learning about kernel programming by reading it.

Nevertheless, there are times when absolute performance or code compactness is required, and in those moments, there is often no substitute for hand written assembler by experienced programmers. You know where to find us!

Assembler: 2, C compiler: 0

Home page of the Exotic Silicon gemini capsule.

Your use of this gemini capsule is subject to the terms and conditions of use.

-- Response ended

-- Page fetched on Sun May 12 06:02:56 2024