Tuesday, December 06, 2016

Undefined Behaviour in C

Summary: I tripped over undefined behaviour in C. It's annoying.

I've recently been writing some C code to parse XML quickly. While working on that project, I inadvertently wrote some code which is undefined according to the C language standard. The code compiled and ran fine using Visual Studio, but under gcc (even at -O0) it corrupted memory, sometimes leading to a segfault, but usually just leading to a wrong answer. The code in question was (see full code at GitHub):

d->nodes.nodes[0].nodes = parse_content(d);

To give some context, d is a structure that contains various pieces of state - what the string to be parsed is, how much we have parsed, along with a pointer to the output nodes. The parse_content function parses the bit inside an XML tag, returning the indicies in nodes which it used.

The complication comes from nodes not being a fixed size, but dynamically resized if the number of nodes exceeds the capacity. For big documents that means parse_content will reallocate d->nodes.nodes.

According to the C spec, the compiler can evaluate the LHS and RHS of an assignment in any order. Since gcc computes the location of d->nodes.nodes[0] before calling parse_content it uses the address of the node before reallocation. After reallocation the address will have changed, and the assignment will be made to the wrong location.

I spotted the bug by inserting printf statements, and in doing so, I had to rewrite the code to:

str content = parse_content(d);
d->nodes.nodes[0].nodes = content;

That fixes the issue, since now the evaluation order is strictly defined. As a simplified example of the same issue:

char* array;

char f() {
    array = malloc(42);
    return 'x';    
}

void test() {
    array = malloc(0);
    array[0] = f();
}

Here the line array[0] = f() might assign to either the result of malloc(0) or malloc(42), at the compilers discretion.

I manually checked if I had made any other such mistakes, and I couldn't find any. Naturally, I wanted to find a static checker that could detect such a mistake, so I tried a bunch of them. I wasn't very successful:

  • Visual Studio 2015 code analysis made me write assert after each malloc, but nothing further.
  • PVS Studio found nothing.
  • Clang undefined behaviour found nothing, and seemingly doesn't work on Windows.
  • GCC undefined behaviour found nothing, and seemingly doesn't work on Windows.
  • RV-Match hit a stack-overflow when running the program.

14 comments:

Anonymous said...

Maybe give KCC a try: https://github.com/kframework/c-semantics

Neil Mitchell said...

RV Match essentially involves running kcc, and is based on kcc, so I suspect with RV Match I covered everything KCC does, but I'm not 100% sure.

greg said...

I'm pretty sure ASan will find this at runtime.
http://clang.llvm.org/docs/AddressSanitizer.html

Neil Mitchell said...

greg, just tried it with both clang and gcc address sanitisers and no luck. The problem is that in truth I allocate a shared buffer that stores the document structure, a small number of attributes, and a small number of nodes, so that small parse trees require only one allocation. If the nodes grow, then I allocate some space just for some nodes, and if they grow more, I repeat that process. It means the initial small nodes buffer remains valid on reallocation, it just happens to point at somewhere I no longer care about, and the address sanitiser is perfectly happy.

Neil Mitchell said...

I was also hoping to find the undefined behaviour even if the compiler did pick RHS before LHS, as they are free to do, which would avoid any problems.

Bence Kodaj said...
This comment has been removed by the author.
Bence Kodaj said...

Did you try FlexeLint? I pasted your simplified example (after adding #include <stdlib.h>) into http://gimpel-online.com//cgi-bin/genPage.py?srcFile=diy.c&cgiScript=analyseCode.py&title=Do-It-Yourself+Example+%28C%29&intro=This+example+allows+you+to+specify+your+own+C+code.&compilerOption=online32.lnt&includeOption={{quotedIncludeOption}} , and it gave me a warning: "Warning 591: Variable 'array' depends on the order of evaluation; it is modified through function 'f(void)' via calls: f()"

greg said...

Hmm, then my only hope is UBSan
http://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

If it doesn't detect the bug, the folks on the team will likely want to hear about it.

enobayram said...

Try running your debug binary under Valgrind, it'll immediately tell you which line is offending. I treat Valgrind like a last-resort antibiotics, as using it too often can develop lazy habits. In the world of C/C++ you have to be on your toes all the time. 60% of your brain should be spared for proving that what you're doing at a given moment is not UB.

George Colpitts said...

FWIW, http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
Three part blog entry, What Every C Programmer Should Know About Undefined Behavior by LLVM founder Chris Lattner. C is a vey scary language.

Anonymous said...

Actually, the root cause of your issue is unspecified behavior, where the C standard specifies that the order of evaluation of the LHS with respect to the RHS is not specified. The undefined behavior then happens in only one of the two cases, which might explain why it is difficult for the static analysis tools to spot it.

Neil Mitchell said...

In fact, I have allocated memory at the point in both cases, so I think both cases are fully defined. I guess I had hoped that UBSan also found unspecified behaviour, but maybe not.

Ioasaph Vespertine said...

Scan-build was my first thought: http://clang-analyzer.llvm.org/ So, I aimed scan-build from checker-279 at your simplified example, and it didn't catch anything other than the `malloc(0)`. Enabling additional checkers that seemed somewhat related didn't change this.

Cranking up warnings on clang itself griped about missing prototypes and that the global should be either static or extern, but didn't call out what's effectively a dead store (one way or the other).

I hoped that inlining f using a GNU statement expression might make it gripe, but that also didn't.

Random Rust Fan said...

A fast parser with a C interface for Haskell -- almost perfect excuse to try out Rust IMO. I would refrain from asking for a rewrite in Rust tho since the project is already published. But honestly, check it out at https://www.rust-lang.org , and maybe try it out on another system-level project.