Tuesday, December 12, 2017

Benchmarking strchr vs memchr

Summary: memchr is faster, but the obvious implement seems to beat the builtin versions.

There are two related C functions for finding the next character in a string - strchr which assumes the string has a NUL character at the end, and memchr which takes the string length as an argument. For strings where you have the size and a NUL terminator, which is fastest? Using gcc 6.2.0 64bit MSYS2 on Windows 10, searching for a single byte 10M bytes along a string, the times were (fastest to slowest):

  • 11.05ms memchr implemented the obvious way.
  • 14.82ms strchr implemented the obvious way.
  • 14.96ms memchr provided by GCC.
  • 19.63ms strchr provided by GCC.

Trying on 3 different Windows computers, the results are all similar (but scaled).

Given the choice, you should prefer memchr over strchr.

Surprise result

The optimised implementations shipped with GCC are slower than the obvious C implementations taken from a wiki. I have absolutely no idea why. From what I can tell, the builtin versions are coded in assembly, operating on multiple bytes at a time, using SSE instructions. In contrast, the C variants operate on a single byte at a time, and aren't vectorised by the optimiser according to Godbolt. If anyone has an explanation I'd be keen to hear it.

Benchmark Code

To benchmark the variants I wrote a Haskell program using criterion. The full code and build instructions are available in this gist. I compiled the C code with -O3, using the gcc shipped with GHC 8.2.1. I've reproduced the Haskell code below, with some comments:

-- Import all the necessary pieces
import qualified Data.ByteString as BS
import qualified Data.ByteString.Unsafe as BS
import Criterion.Main
import Foreign
import Foreign.C.Types
import Data.Monoid

-- Make all the C imports
foreign import ccall unsafe "string.h memchr" memchr_std :: Ptr Word8 -> CInt -> CSize -> IO (Ptr Word8)
foreign import ccall unsafe "string.h strchr" strchr_std :: Ptr Word8 -> CInt -> IO (Ptr Word8)
foreign import ccall unsafe memchr_c :: Ptr Word8 -> CInt -> CSize -> IO (Ptr Word8)
foreign import ccall unsafe strchr_c :: Ptr Word8 -> CInt -> IO (Ptr Word8)

-- Method for ignoring the size when using strchr
ignoreSize f a b _ = f a b

-- Build a suitable string with an interesting character i bytes along
cstr i = BS.replicate i 32 <> BS.singleton 64 <> BS.replicate i 32 <> BS.singleton 0

-- The functions to benchmark
funs =
    [("memchr_std", memchr_std)
    ,("strchr_std", ignoreSize strchr_std)
    ,("memchr_c", memchr_c)
    ,("strchr_c", ignoreSize strchr_c)]

-- The main function, using Criterion
main = defaultMain
    [ seq bs $ bench (show i ++ " " ++ name) $ whnfIO $ test fun bs
    | i <- [1,10,100,1000,10000,100000,1000000,10000000]
    , let bs = cstr i
    , (name, fun) <- funs]

-- The function under test and input string
{-# NOINLINE test #-}
test fun bs =
    BS.unsafeUseAsCStringLen bs $ \(ptr,len) ->
        fun (castPtr ptr) 64 (fromIntegral len)

5 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. On my machine Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz I obtains those results with gcc 7.1.1 on linux.


    benchmarking 10000000 memchr_std
    time 673.7 μs (667.4 μs .. 681.0 μs)
    0.999 R² (0.999 R² .. 1.000 R²)
    mean 670.0 μs (666.1 μs .. 674.0 μs)
    std dev 13.07 μs (11.38 μs .. 15.32 μs)

    benchmarking 10000000 strchr_std
    time 844.6 μs (831.2 μs .. 858.7 μs)
    0.998 R² (0.997 R² .. 0.999 R²)
    mean 830.4 μs (824.6 μs .. 839.2 μs)
    std dev 23.82 μs (18.38 μs .. 32.75 μs)
    variance introduced by outliers: 18% (moderately inflated)

    benchmarking 10000000 memchr_c
    time 8.169 ms (8.043 ms .. 8.325 ms)
    0.998 R² (0.997 R² .. 1.000 R²)
    mean 8.041 ms (7.995 ms .. 8.102 ms)
    std dev 155.3 μs (115.1 μs .. 223.3 μs)

    benchmarking 10000000 strchr_c
    time 8.027 ms (7.933 ms .. 8.142 ms)
    0.999 R² (0.999 R² .. 1.000 R²)
    mean 7.967 ms (7.932 ms .. 8.010 ms)
    std dev 113.2 μs (89.25 μs .. 142.7 μs)


    So the optimized version is indeed the best in my case.

    ReplyDelete
  3. and glibc version 2.25-2

    ReplyDelete
  4. Anonymous2:56 PM

    use 'gcc -dumpspecs' to see if you have some other default for march&mtune than 'native'
    if not, file a gcc bugreport :)
    (and maybe find which arch is the fastest on your machine)

    ReplyDelete
  5. On linux, You can also take a look into the Main binary to find the function that is called with
    objdump -t Main | grep memchr

    You will get something like
    00000000004038d0 F *UND* 0000000000000000 memchr@@GLIBC_2.2.5

    after do a
    ldd Main
    to get the librairy path
    ...
    libc.so.6 => /usr/lib/libc.so.6 (0x00007ff01c4d9000)
    ...

    and finally dissas the .so in order to the assembly code of the function
    objdump -d /usr/lib/libc.so.6 | grep -A500 ':' | less

    you will get
    0000000000085100 :
    85100: 66 48 0f 6e ce movq %rsi,%xmm1
    85105: 48 89 f9 mov %rdi,%rcx
    85108: 66 0f 60 c9 punpcklbw %xmm1,%xmm1
    8510c: 48 85 d2 test %rdx,%rdx
    8510f: 0f 84 3b 03 00 00 je 85450
    85115: 66 0f 60 c9 punpcklbw %xmm1,%xmm1
    85119: 48 83 e1 3f and $0x3f,%rcx
    8511d: 66 0f 70 c9 00 pshufd $0x0,%xmm1,%xmm1
    85122: 48 83 f9 30 cmp $0x30,%rcx
    ....

    It should be possible to do on Window if have mingw, but I don't know much windows.

    ReplyDelete