Monday, April 29, 2019

foldr under the hood

Summary: The foldr function seems simple, but is actually very complex, with lots of layers. This post dives through the layers.

The foldr function takes a list and replaces all : (cons) and [] (nil) values with functions and a final value. It's available in the Haskell Prelude and described on Wikipedia. As some examples:

sum = foldr (+) 0
map f = foldr (\x xs -> f x : xs) []

But the simple foldr described on Wikipedia is many steps away from the one in the Haskell Prelude. In this post we'll peel back the layers, learning why foldr is a lot more complicated under the hood.

Layer 1: Wikipedia definition

The definition on Wikipedia is:

foldr :: (a -> b -> b) -> b -> [a] -> b
foldr f z []     = z
foldr f z (x:xs) = f x (foldr f z xs)

This recursive definition directly describes what foldr does. Given a list [1,2,3] we get f 1 (f 2 (f 3 z)).

Layer 2: Static argument transformation

The problem with this definition is that it is recursive, and GHC doesn't like to inline recursive functions, which prevents a lot of optimisation. Taking a look at sum, it's a real shame that operations like (+) are passed as opaque higher-order functions, rather than specialised to the machine instruction ADD. To solve that problem, GHC defines foldr as:

foldr f z = go
    where go []     = z
          go (x:xs) = f x (go xs)

The arguments f and z are constant in all sucessive calls, so they are lifted out with a manually applied static argument transformation.

Now the function foldr is no longer recursive (it merely has a where that is recursive), so foldr can be inlined, and now + can meet up with go and everything can be nicely optimised.

Layer 3: Inline later

We now have foldr that can be inlined. However, inlining foldr is not always a good idea. In particular, GHC has an optimisation called list fusion based on the idea that combinations of foldr and build can be merged, sometimes known as short-cut deforestation. The basic idea is that if we see foldr applied to build we can get rid of both (see this post for details). We remove foldr using the GHC rewrite rule:

{-# RULES "my foldr/build" forall g k z. foldr k z (build g) = g k z #-}

The most interesting thing about this rule (for this post at least!) is that it matches foldr by name. Once we've inlined foldr we have thrown away the name, and the rule can't fire anymore. Since this rule gives significant speedups, we really want it to fire, so GHC adds an extra pragma to foldr:

{-# INLINE [0] foldr #-}

This INLINE pragma says don't try and inline foldr until the final stage of the compiler, but in that final stage, be very keen to inline it.

Layer 4: More polymorphism

However, the foldr function in the Prelude is not the one from GHC.List, but actually a more general one that works for anything Foldable. Why limit yourself to folding over lists, when you can fold over other types like Set. So now foldr is generailsed from [] to t with:

foldr :: Foldable t => (a -> b -> b) -> b -> t a -> b

Where foldr on [] is GHC.List.foldr.

Layer 5: A default implementation

But foldr is actually in the type class Foldable, not just defined on the outside. Users defining Foldable can define only foldr and have all the other methods defined for them. But they can equally define only foldMap, and have an implicit version of foldr defined as:

foldr :: (a -> b -> b) -> b -> t a -> b
foldr f z t = appEndo (foldMap (Endo . f) t) z

Where Endo is defined as:

newtype Endo = Endo {appEndo :: a -> a}

instance Monoid (Endo a) where
    mempty = Endo id
    Endo a <> Endo b = Endo (a . b)

The function foldMap f is equivalent to mconcat . map f, so given a list [1,2,3] the steps are:

  • First apply map (Endo . f) to each element to get [Endo (f 1), Endo (f 2), Endo (f 3)].
  • Next apply mconcat to the list to get Endo (f 1) <> Endo (f 2) <> Endo (f 3).
  • Inline all the <> definitions to get Endo (f 1 . f 2 . f 3).
  • Apply the appEndo at the beginning and z at the end for (f 1 . f 2 . f 3) z.
  • Inline all the . to give f 1 (f 2 (f 3 z)), which is what we had at layer 1.

Layer 6: Optimising the default implementation

The real default implementation of foldr is:

foldr f z t = appEndo (foldMap (Endo #. f) t) z

Note that the . after Endo has become #.. Let's first explain why it's correct, then why it might be beneficial. The definition of #. is:

(#.) :: Coercible b c => (b -> c) -> (a -> b) -> (a -> c)
(#.) _ = coerce

Note that it has the same type as . (plus a Coercible constraint), but ignores it's first argument entirely. The coerce function transforms a value of type a into a value of type b with zero runtime cost, provided they have the same underlying representation. Since Endo is a newtype, that means Endo (f 1) and f 1 are implemented identically in the runtime, so coerce switches representation "for free". Note that the first argument to #. only serves to pin down the types, so if we'd passed an interesting function as the first argument it would have been ignored.

Of course, in normal circumstances, a newtype is free anyway, with no runtime cost. However, in this case we don't have a newtype, but a function application with a newtype. You can see the gory details in GHC ticket 7542, but at one point this impeeded other optimisations.

I tried a simulated version of foldr and found that if GHC can't tell that [] is the Foldable the code looks pretty bad, but if it can, at -O1 and above, the two implementations are 100% equialent (to the point that common subexpression elimination makes them actually the same). It's possible this final layer is a vestigial optimisation, or perhaps it's still important in some circumstances.

Tuesday, April 16, 2019

Code Statistics and Measuring Contributions

Summary: The only way to understand a code base is to ask someone who works on it.

This weekend a relative asked me how can we tell who wrote the code behind the black hole image, and was interested in the stats available on GitHub. There are lots of available stats, but almost every stat can be misleading in some circumstances. The only people who have the context to interpret the stats are those who work on the project, hence my default approach to assessing a project is to ask someone who works on it, with the understanding that they may look at relevant stats on GitHub or similar. In this post lets go through some of the reasons that a simplistic interpretation of the stats is often wrong.

These remarks all apply whether you're trying to assign credit for a photo, trying to do performance reviews for an employee or trying to manage a software project.

What to measure

There are broadly two ways to measure activity on the code in a repo. The first is additions/deletions of lines of code, where a modified line of code is usually measured as an addition and deletion. The second is number of commits or pull requests, which measures how many batches of changes are made. The problem with the latter is that different people have different styles - some produce big batches, some tiny batches - a factor of 10 between different developers is not unusual. There are also policy reasons that commits may be misleading - some projects squash multiple commits down to one when merging. The number of lines of code gives a better measure of what has changed, but it's merely better, not good - the rest of this post assumes people are focusing on number of lines of code changed.

All code is equal

Treating number of lines changed as the contribution assumes that every line is equally hard - but that's far from the case. At a previous company I worked on code that ranged from the internals of a compiler, to intricate C++ libraries, to Haskell GUI's. I estimate that I could produce 100x the volume of Haskell GUI's compared to C++ libraries. Other colleagues worked only only on the compiler, or only on GUIs - vastly changing how much code they produced per hour.

Similarly, each line of code is not equally important. Last week I wrote a few 100 lines of code. Of those, nearly all were done on Monday, and the remainder of the week involved a single line that is ferociously difficult with lots of obscure side conditions (libraries and link order...). That one line is super valuable, but simplistic measuring suggests I napped all Tuesday and Wednesday.

Code is attributed properly

Developers typically have user handles or email addresses that are used for code contributions. I currently have at least two handles, and in the past when we did stats on a $WORK project there were 6 different email addresses that I claimed ownership of. As a consequence, my work shows up under lots of different names, and counting it can be difficult. The longer a project runs, the more chance of developers changing identity.

The person who changed code did the work

A big part of software engineering is making old code obsolete. I was recently involved in deleting many thousands of lines that was no longer necessary. With a small team, we created a new project, implemented it, converted 90% of the uses over to the new code, and then stopped. Separately, someone else did the last 10% of the conversion, and then was able to delete a huge number of lines of code. There was definitely work in deleting the final bit of code, but the "labour" involved in that final deletion was mostly carried out months ago by others.

Similarly, when copying a new project in (often called vendoring) there is a big commit to add a lot of new code that was originally written by others, but which gets attributed to a named individual.

All code is in one place

Often projects will incorporate library code. For example, the official contribution of Niklas Broberg to HLint is 8 lines. However, he's called out explicitly in the README as someone whose contributions were essential to the project. In this case, because he wrote a library called haskell-src-exts without which HLint could not exist, and then continued to improve it for the benefit of HLint for many years.

Furthermore, projects like HLint rely on a compiler, libraries, operating system, and even a version control system. Usually these get overlooked when giving credit since they are relatively old and shared between many projects - but they are an essential part of getting things to work.

More code is better

The only purpose of code is to do a thing - whatever that thing might be. In all other ways, code is a liability - it has to be read, tested, compiled etc. Given the choice between 10 lines or 1,000,000 lines of code, I would always prefer 10 lines if they did the same thing. A smarter programmer who can do more with less lines of code is better. The famous quote attributed to Bill Gates is still true many decades later:

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

Code is the essence

Measuring code suggests that code is the thing that matters. The code certainly does matter, but the code is just a representation of an underlying algorithm. The code follows a high-level design. Often much more significant contributions are made by picking the right design, algorithm, approach etc.

Code is all that matters

In a large project there is code, but the code doesn't exist in a vacuum. There are other code-adjacent tasks to be performed - discussions, mentoring, teaching, reporting issues, buying computers etc. Many of these are never reflected in the code, yet if omitted, the code wouldn't happen, or would happen slower.

Sunday, April 07, 2019

Code Review: Approve with Suggestions

Summary: Code review is not a yes/no decision - mostly I say yes with suggestions.

As I wrote previously, I didn't used to be a fan of code review for $WORK code, but now I am. After I review some code there are three responses I might give:

  • "Request changes" - this code has some fatal flaw. Maybe I saw a race condition. Maybe there's insufficient testing. Maybe it's just a bad idea. Please fix it or convince me I'm wrong and I'll review again.
  • "Approved" - this code is great. Let's merge it. If the CI has already passed, I'll probably merge it now myself.
  • "Approved with suggestions" - this code is fine, I'm happy for it to be merged, but I thought of a few ways to make it better.

I think I use "Approved with suggestions" about 80% of the time. To use this status I think the code is correct, readable, and will have no negative effects - I'm happy for it to be merged. At the same time, I can think of a few ways to improve it - e.g. using some utility function, simplifying things a bit, making the documentation clearer. If the original author disagrees with me, I'm not going to bother arguing. I'm happy for these commits to be in a follow up PR, or pushed on top of this PR, whatever suits them.

What's different about approved with suggestions, at least for trusted individuals (e.g. colleagues), is that if they make these tweaks I have no real interest in rereviewing. I'm happy for my approval to remain sticky and for them to seek out rereview only if they think they need it. Importantly, this guideline is consistent with my reasons for reviewing. After the first review, if the code doesn't change meaningfully, a rereview offers none of the benefits that make me want to review in the first place.

Since the programming language DAML that I work on is now open source I can point at a concrete example using a pull request to our GHC fork. Here Shayne added a function:

qualifyDesugar :: (String -> OccName) -> String -> RdrName
qualifyDesugar occName =
  (mkRdrQual $ mkModuleName "DA.Internal.Desugar") . occName

It matches the stated intention, but it seems to do a bit too much - it turns a String into an OccName using a supplied function, when it could have just taken the OccName directly. Simpler, better, more maintainable. So I suggested:

qualifyDesugar :: OccName -> RdrName
qualifyDesugar = mkRdrQual $ mkModuleName "DA.Internal.Desugar"

A nice improvement. However, Shayne is responsible enough to make such simple tweaks that it didn't require another review. A typical case of Approve with suggestions.