Sunday, May 31, 2020

HLint --cross was accidentally quadratic

Summary: HLint --cross was accidentally quadratic in the number of files.

One of my favourite blogs is Accidentally Quadratic, so when the Haskell linter HLint suffered such a fate, I felt duty bound to write it up. Most HLint hints work module-at-a-time (or smaller scopes), but there is one hint that can process multiple modules simultaneously - the duplication hint. If you write a sufficiently large repeated fragment in two modules, and pass --cross, then this hint will detect the duplication. The actual application of hints is HLint is governed by:

applyHintsReal :: [Setting] -> Hint -> [ModuleEx] -> [Idea]

Given a list of settings, a list of hints (which gets merged to a single composite Hint) and a list of modules, produce a list of ideas to suggest. Usually this function is called in parallel with a single module at a time, but when --cross is passed, all the modules being analysed get given in one go.

In HLint 3, applyHintsReal became quadratic in the number of modules. When you have 1 module, 1^2 = 1, and everything works fine, but --cross suffers a lot. The bug was simple. Given a Haskell list comprehension:

[(a,b) | a <- xs, b <- xs]

When given the list xs of [1,2] you get back the pairs [(1,1),(1,2),(2,1),(2,2)] - the cross product, which is quadratic in the size of xs. The real HLint code didn't look much different:

[ generateHints m m'
| m <- ms
, ...
, (nm',m') <- mns'
, ...
]
where
    mns' = map (\x -> (scopeCreate (GHC.unLoc $ ghcModule x), x)) ms

We map over ms to create mns' containing each module with some extra information. In the list comprehension we loop over each module ms to get m, then for each m in ms, loop over mns' to get m'. That means you take the cross-product of the modules, which is quadratic.

How did this bug come about? HLint used to work against haskell-src-exts (HSE), but now works against the GHC parser. We migrated the hints one by one, changing HLint to thread through both ASTs, and then each hint could pick which AST to use. The patch that introduced this behaviour left ms as the HSE AST, and made mns' the GHC AST. It should have zipped these two together, so for each module you have the HSE and GHC AST, but accidentally took the cross-product.

How did we spot it? Iustin Pop filed a bug report noting that each hint was repeated once per file being checked and performance had got significantly worse, hypothesising it was O(n^2). Iustin was right!

How did we fix it? By the time the bug was spotted, the HSE AST had been removed entirely, and both m and m' were the same type, so deleting one of the loops was easy. The fix is out in HLint version 3.1.4.

Should I be using --cross? If you haven't heard about --cross in HLint, I don't necessarily suggest you start experimenting with it. The duplicate detection hints are pretty dubious and I think most people would be better suited with a real duplicate detection tool. I've had good experiences with Simian in the past.

Wednesday, May 27, 2020

Fixing Space Leaks in Ghcide

Summary: A performance investigation uncovered a memory leak in unordered-containers and performance issues with Ghcide.

Over the bank holiday weekend, I decided to devote some time to a possible Shake build system performance issue in Ghcide Haskell IDE. As I started investigating (and mostly failed) I discovered a space leak which I eventually figured out, solved, and then (as a happy little accident) got a performance improvement anyway. This post is a tale of what I saw, how I tackled the problem, and how I went forward. As I'm writing the post, not all the threads have concluded. I wrote lots of code during the weekend, but most was only to experiment and has been thrown away - I've mostly left the code to the links. Hopefully the chaotic nature of development shines through.

Shake thread-pool performance

I started with a Shake PR claiming that simplifying the Shake thread pool could result in a performance improvement. Faster and simpler seems like a dream combination. Taking a closer look, simpler seemed like it was simpler because it supported less features (e.g. ability to kill all threads when one has an exception, some fairness/scheduling properties). But some of those features (e.g. better scheduling) were in the pursuit of speed, so if a simpler scheduler was 30% faster (the cost of losing randomised scheduling), that might not matter.

The first step was to write a benchmark. It's very hard to synthesise a benchmark that measures the right thing, but spawning 200K short tasks into the thread pool seemed a plausible start. As promised on the PR, the simpler version did indeed run faster. But interestingly, the simplifications weren't really responsible for the speed difference - switching from forkIO to forkOn explained nearly all the difference. I'm not that familiar with forkOn, so decided to micro-benchmark it - how long does it take to spawn off 1M threads with the two methods. I found two surprising results:

  • The performance of forkOn was quadratic! A GHC bug explains why - it doesn't look too hard to fix, but relying on forkOn is unusual, so its unclear if the fix is worth it.
  • The performance of forkIO was highly inconsistent. Often it took in the region of 1 second. Sometimes it was massively faster, around 0.1s. A StackOverflow question didn't shed much light on why, but did show that by using the PVar concurrency primitive it could be 10x faster. There is a GHC bug tracking the issue, and it seems as though the thread gets created them immediately switches away. There is a suggestion from Simon Peyton Jones of a heuristic that might help, but the issue remains unsolved.

My desire to switch the Shake thread-pool to a quadratic primitive which is explicitly discouraged is low. Trying to microbenchmark with primitives that have inconsistent performance is no fun. The hint towards PVar is super interesting, and I may follow up on it in future, but given the remarks in the GHC tickets I wonder if PVar is merely avoiding one small allocation, and avoiding an allocation avoids a context switch, so it's not a real signal.

At this point I decided to zoom out and try benchmarking all of Ghcide.

Benchmarking Ghcide

The thread about the Shake thread pool pointed at a benchmarking approach of making hover requests. I concluded that making a hover request with no file changes would benchmark the part of Shake I thought the improved thread-pool was most likely to benefit. I used the Shake source code as a test bed, and opened a file with 100 transitive imports, then did a hover over the listToMaybe function. I know that will require Shake validating that everything is up to date, and then doing a little bit of hover computation.

I knew I was going to be running Ghcide a lot, and the Cabal/Stack build steps are frustratingly slow. In particular, every time around Stack wanted to unregister the Ghcide package. Therefore, I wrote a simple .bat file that compiled Ghcide and my benchmark using ghc --make. So I could experiment quickly with changes to Shake, I pulled in all of Shake as source, not as a separate library, with an include path. I have run that benchmark 100's of times, so the fact it is both simple (no arguments) and as fast as I could get has easily paid off.

For the benchmark itself, I first went down the route of looking at the replay functionality in lsp-test. Sadly, that code doesn't link to anything that explains how to generate traces. After asking on the haskell-ide-engine IRC I got pointed at both the existing functionality of resCaptureFile. I also got pointed at the vastly improved version in a PR, which doesn't fail if two messages race with each other. Configuring that and running it on my benchmark in the IDE told me that the number of messages involved was tiny - pretty much an initialisation and then a bunch of hovers. Coding those directly in lsp-test was trivial, and so I wrote a benchmark. The essence was:

doc <- openDoc "src/Test.hs" "haskell"
(t, _) <- duration $ replicateM_ 100 $
    getHover doc $ Position 127 43
print t

Open a document. Send 100 hover requests. Print the time taken.

Profiling Ghcide

Now I could run 100 hovers, I wanted to use the GHC profiling mechanisms. Importantly, the 100 hover requests dominates the loading by a huge margin, so profiles would focus on the right thing. I ran a profile, but it was empty. Turns out the way lsp-test invokes the binary it is testing means it kills it too aggressively to allow GHC to write out profiling information. I changed the benchmark to send a shutdown request at the end, then sleep, and changed Ghcide to abort on a shutdown, so it could write the profiling information.

Once I had the profiling information, I was thoroughly uniformed. 10% went in file modification checking, which could be eliminated. 10% seemed to go to hash table manipulations, which seemed on the high side, but not too significant (turned out I was totally wrong, read to the end!). Maybe 40% went in the Shake monad, but profiling exaggerates those costs significantly, so it's unclear what the truth is. Nothing else stood out, but earlier testing when profiling forkIO operations had shown they weren't counted well, so that didn't mean much.

Prodding Ghcide

In the absence of profiling data, I started changing things and measuring the performance. I tried a bunch of things that made no difference, but some things did have an impact on the time to do 100 hovers:

  • Running normally: 9.77s. The baseline.
  • Switching to forkOn: 10.65s. Suggestive that either Ghcide has changed, or the project is different, or platform differences mean that forkOn isn't as advantageous.
  • Using only one Shake thread: 13.65s. This change had been suggested in one ticket, but made my benchmark worse.
  • Avoid spawning threads for things I think will be cheap: 7.49s. A useful trick, and maybe one that will be of benefit in future, but for such a significant change a 25% performance reduction seemed poor.
  • Avoid doing any Shake invalidation: 0.31s. An absolute lower bound if Shake cheats and does nothing.

With all that, I was a bit dejected - performance investigation reveals nothing of note was not a great conclusion from a days work. I think that other changes to Ghcide to run Shake less and cache data more will probably make this benchmark even less important, so the conclusion worsens - performance investigation of nothing of note reveals nothing of note. How sad.

But in my benchmark I did notice something - a steadily increasing memory size in process explorer. Such issues are pretty serious in an interactive program, and we'd fixed several issues recently, but clearly there were more. Time to change gears.

Space leak detection

Using the benchmark I observed a space leak. But the program is huge, and manual code inspection usually needs a 10 line code fragment to have a change. So I started modifying the program to do less, and continued until the program did as little as it could, but still leaked space. After I fixed a space leak, I zoomed out and saw if the space leak persisted, and then had another go.

The first investigation took me into the Shake Database module. I found that if I ran the Shake script to make everything up to date, but did no actions inside, then there was a space leak. Gradually commenting out lines (over the course of several hours) eventually took me to:

step <- pure $ case v of
    Just (_, Loaded r) -> incStep $ fromStepResult r
    _ -> Step 1

This code increments a step counter on each run. In normal Shake this counter is written to disk each time, which forces the value. In Ghcide we use Shake in memory, and nothing ever forced the counter. The change was simple - replace pure with evaluate. This fix has been applied to Shake HEAD.

Space leak detection 2

The next space leak took me to the Shake database reset function, which moves all Shake keys from Ready to Loaded when a new run starts. I determined that if you didn't run this function, the leak went away. I found a few places I should have put strictness annotations, and a function that mutated an array lazily. I reran the code, but the problem persisted. I eventually realised that if you don't call reset then none of the user rules run either, which was really what was fixing the problem - but I committed the improvements I'd made even though they don't fix any space leaks.

By this point I was moderately convinced that Shake wasn't to blame, so turned my attention to the user rules in Ghcide. I stubbed them out, and the leak went away, so that looked plausible. There were 8 types of rules that did meaningful work during the hover operation (things like GetModificationTime, DoesFileExist, FilesOfInterest). I picked a few in turn, and found they all leaked memory, so picked the simple DoesFileExist and looked at what it did.

For running DoesFileExist I wrote a very quick "bailout" version of the rule, equivalent to the "doing nothing" case, then progressively enabled more bits of the rule before bailing out, to see what caused the leak. The bailout looked like:

Just v <- getValues state key file
let bailout = Just $ RunResult ChangedNothing old $ A v

I progressively enabled more and more of the rule, but even with the whole rule enabled, the leak didn't recur. At that point, I realised I'd introduced a syntax error and that all my measurements for the last hour had been using a stale binary. Oops. I span up a copy of Ghcid, so I could see syntax errors more easily, and repeated the measurements. Again, the leak didn't recur. Very frustrating.

At that point I had two pieces of code, one which leaked and one which didn't, and the only difference was the unused bailout value I'd been keeping at the top to make it easier to quickly give up half-way through the function. Strange though it seemed, the inescapable conclusion was that getValues must somehow be fixing the space leak.

If getValues fixes a leak, it is a likely guess that setValues is causing the leak. I modified setValues to also call getValues and the problem went away. But, after hours of staring, I couldn't figure out why. The code of setValues read:

setValues state key file val = modifyVar_ state $ \vals -> do
    evaluate $ HMap.insert (file, Key key) (fmap toDyn val) vals

Namely, modify a strict HashMap from unordered-containers, forcing the result. After much trial and error I determined that a "fix" was to add:

case HMap.lookup k res of
    Nothing -> pure ()
    Just v -> void $ evaluate v

It's necessary to insert into the strict HashMap, then do a lookup, then evaluate the result that comes back, or there is a space leak. I duly raised a PR to Ghcide with the unsatisfying comment:

I'm completely lost, but I do have a fix.

It's nice to fix bugs. It's better to have some clue why a fix works.

Space leak in HashMap

My only conclusion was that HashMap must have a space leak. I took a brief look at the code, but it was 20+ lines and nothing stood out. I wrote a benchmark that inserted billions of values at 1000 random keys, but it didn't leak space. I puzzled it over in my brain, and then about a day later inspiration struck. One of the cases was to deal with collisions in the HashMap. Most HashMap's don't have any collisions, so a bug hiding there could survive a very long time. I wrote a benchmark with colliding keys, and lo and behold, it leaked space. Concretely, it leaked 1Gb/s, and brought my machine to its knees. The benchmark inserted three keys all with the same hash, then modified one key repeatedly. I posted the bug to the unordered-containers library.

I also looked at the code, figured out why the space leak was occurring, and a potential fix. However, the fix requires duplicating some code, and its likely the same bug exists in several other code paths too. The Lazy vs Strict approach of HashMap being dealt with as an outer layer doesn't quite work for the functions in question. I took a look at the PR queue for unordered-containers and saw 29 requests, with the recent few having no comments on them. That's a bad sign and suggested that spending time preparing a PR might be in vain, so I didn't.

Aside: Maintainers get busy. It's no reflection negatively on the people who have invested lots of time on this library, and I thank them for their effort! Given 1,489 packages on Hackage depend on it, I think it could benefit from additional bandwidth from someone.

Hash collisions in Ghcide

While hash collisions leading to space leaks is bad, having hash collisions at all is also bad. I augmented the code in Ghcide to print out hash collisions, and saw collisions between ("Path.hs", Key GetModificationTime) and ("Path.hs", Key DoesFileExist). Prodding a bit further I saw that the Hashable instance for Key only consulted its argument value, and given most key types are simple data Foo = Foo constructions, they all had the same hash. The solution was to mix in the type information stored by Key. I changed to the definition:

hashWithSalt salt (Key key) = hashWithSalt salt (typeOf key) `xor` hashWithSalt salt key

Unfortunately, that now gave hash collisions with different paths at the same key. I looked into the hashing for the path part (which is really an lsp-haskell-types NormalizedFilePath) and saw that it used an optimised hashing scheme, precomputing the hash, and returning it with hash. I also looked at the hashable library and realised the authors of lsp-haskell-types hadn't implemented hashWithSalt. If you don't do that, a generic instance is constructed which deeply walks the data structure, completely defeating the hash optimisation. A quick PR fixes that.

I also found that for tuples, the types are combined by using the salt argument. Therefore, to hash the pair of path information and Key, the Key hashWithSalt gets called with the hash of the path as its salt. However, looking at the definition above, you can imagine that both hashWithSalt of a type and hashWithSalt of a key expand to something like:

hashWithSalt salt (Key key) = salt `xor` hash (typeOf key) `xor` (salt `xor` 0)

Since xor is associative and commutative, those two salt values cancel out! While I wasn't seeing complete cancellation, I was seeing quite a degree of collision, so I changed to:

hashWithSalt salt (Key key) = hashWithSalt salt (typeOf key, key)

With that fix in Ghcide, all collisions went away, and all space leaks left with them. I had taken this implementation of hash combining from Shake, and while it's not likely to be a problem in the setting its used there, I've fixed it in Shake too.

Benchmarking Ghcide

With the hash collisions reduced, and the number of traversals when computing a hash reduced, I wondered what the impact was on performance. A rerun of the original benchmark showed the time had reduced to 9.10s - giving a speed up of about 5%. Not huge, but welcome.

Several days later we're left with less space leaks, more performance, and hopefully a better IDE experience for Haskell programmers. I failed in what I set out to do, but found some other bugs along the way, leading to 9 PRs/commits and 4 ongoing issues. I'd like to thank everyone in the Haskell IDE team for following along, making suggestions, confirming suspicions, and generally working as a great team. Merging the Haskell IDE efforts continues to go well, both in terms of code output, and team friendliness.

Saturday, May 23, 2020

Shake 0.19 - changes to process execution

Summary: The new version of Shake has some tweaks to how stdin works with cmd.

I've just released Shake 0.19, see the full change log. Most of the interesting changes in this release are around the cmd/command functions, which let you easily run command lines. As an example, Shake has always allowed:

cmd "gcc -c" [source] "-o" [output]

This snippet compiles a source file using gcc. The cmd function is variadic, and treats strings as space-separated arguments, and lists as literal arguments. It's overloaded by return type, so can work in the IO monad (entirely outside Shake) or the Shake Action monad (inside Shake). You can capture results and pass in options, e.g. to get the standard error and run in a different directory, you can do:

Stderr err <- cmd "gcc -c" [source] "-o" [output] (Cwd "src")

Shake is a dynamic build system with advanced dependency tracking features that let's you write your rules in Haskell. It just so happens that running commands is very common in build systems, so while not really part of a build system, it's a part of Shake that has had a lot of work done on it. Since the command is both ergonomic and featureful, I've taken to using the module Develoment.Shake.Command in non-Shake related projects.

Recent cmd changes

The first API breaking change only impacts users of the file access tracing. The resulting type is now polymorphic, and if you opt to for the FSATrace ByteString, you'll get your results a few milliseconds faster. Even if you stick with FSATrace FilePath, you'll get your results faster than the previous version. Performance of tracing happened to matter for a project I've been working on :-).

The other changes in this release are to process groups and the standard input. In Shake 0.18.3, changes were made to switch to create_group=True in the process library, as that improves the ability to cancel actions and clean up sub-processes properly. Unfortunately, on Linux that caused processes that read from standard input to hang. The correlation between these events, and the exact circumstances that triggered it, took a long time to track down - thanks to Gergő Érdi for some excellent bisection work. Most processes that are run in a build system should not access the standard input, and the only reports have come from docker (don't use -i) and ffmpeg (pass -nostdin), but hanging is a very impolite way to fail. In older versions of Shake we inherited the Shake stdin to the child (unless you specified the stdin explicitly with Stdin), but now we create a new pipe with no contents. There are now options NoProcessGroup and InheritStdin which let you change these settings independently. I suspect a handful of commands will need flag tweaks to stop reading the stdin, but they will probably fail saying the stdin is inaccessible, so debugging it should be relatively easy.

In another tale of cmd not working how you might hope, in Shake 0.15.2 we changed cmd to close file handles when spawning a process. Unfortunately, that step is O(n) in the number of potential handles on your system, where n is RLIMIT_NOFILE and can be quite big, so we switched back in 0.18.4. Since 0.18.4 you can pass CloseFileHandles if you definitely want handles to be closed. It's been argued that fork is a bad design, and this performance vs safety trade-off seems another point in favour of that viewpoint.

The amount of work that has gone into processes, especially around timeout and cross-platform differences, has been huge. I see 264 commits to these files, but the debugging time associated with them has been many weeks!

Other changes

This release contains other little tweaks that might be useful:

  • Time spent in the batch function is better accounted for in profiles.
  • Finally deleted the stuff that has been deprecated since 2014, particularly the *> operator. I think a six year deprecation cycle seems more than fair for a pre-1.0 library.
  • Optimised modification time on Linux.

Wednesday, May 20, 2020

GHC Unproposals

Summary: Four improvements to Haskell I'm not going to raise as GHC proposals.

Writing a GHC proposal is a lot of hard work. It requires you to fully flesh out your ideas, and then defend them robustly. That process can take many months. Here are four short proposals that I won't be raising, but think would be of benefit (if you raise a proposal for one of them, I'll buy you a beer next time we are physically co-located).

Use : for types

In Haskell we use : for list-cons and :: for types. That's the wrong way around - the use of types is increasing, the use of lists is decreasing, and type theory has always used :. This switch has been joke-proposed before. We actually switched these operators in DAML, and it worked very nicely. Having written code in both styles, I now write Haskell on paper with : for types instead of ::. Nearly all other languages use : for types, even Python. It's sad when Python takes the more academically pure approach than Haskell.

Is it practical: Maybe. The compiler diff is quite small, so providing it as an option has very little technical cost. The problem is it bifurcates the language - example code will either work with : for types or :: for types. It's hard to write documentation, text books etc. If available, I would switch my code.

Make recursive let explicit

Currently you can write let x = x + 1 and it means loop forever at runtime because x is defined in terms of itself. You probably meant to refer to the enclosing x, but you don't get a type error, and often don't even get a good runtime error message, just a hang. In do bindings, to avoid the implicit reference to self, it's common to write x <- pure $ x + 1. That can impose a runtime cost, and obscure the true intent.

In languages like OCaml there are two different forms of let - one which allows variables to be defined and used in a single let (spelled let rec) and one which doesn't (spelled let). Interestingly, this distinction is important in GHC Core, which has two different keywords, and a source let desugars differently based on whether it is recursive. I think Haskell should add letrec as a separate keyword and make normal let non-recursive. Most recursive bindings are done under a where, and these would continue to allow full recursion, so most code wouldn't need changing.

Is it practical: The simplest version of this proposal would be to add letrec as a keyword equivalent to let and add a warning on recursive let. Whether it's practical to go the full way and redefine the semantics of let to mean non-recursive binding depends on how strong the adoption of letrec was, but given that I suspect recursive let is less common, it seems like it could work. Making Haskell a superset of GHC Core is definitely an attractive route to pursue.

Allow trailing variables in bind

When writing Haskell code, I often have do blocks that I'm in the middle of fleshing out, e.g.:

do fileName <- getLine
   src <- readFile fileName

My next line will be to print the file or similar, but this entire do block, and every sub-part within it, is constantly a parse error until I put in that final line. When the IDE has a parse error, it can't really help me as much as I'd like. The reason for the error is that <- can't be the final line of a do block. I think we should relax that restriction, probably under a language extension that only IDE's turn on. It's not necessarily clear what such a construct should mean, but in many ways that isn't the important bit, merely that such a construct results in a valid Haskell program, and allows more interactive feedback.

Is it practical: Yes, just add a language extension - since it doesn't actually enable any new power it's unlikely to cause problems. Fleshing out the semantics, and whether it applies to let x = y statements in a do block too, is left as an exercise for the submitter. An alternative would be to not change the language, but make GHC emit the error slightly later on, much like -fdefer-type-errors, which still works for IDEs (either way needs a GHC proposal).

Add an exporting keyword

Currently the top of every Haskell file duplicates all the identifiers that are exported - unless you just export everything (which you shouldn't). That approach duplicates logic, makes refactorings like renamings more effort, and makes it hard to immediately know if the function you are working on is exposed. It would be much nicer if you could just declare things that were exported inline, e.g. with a pub keyword - so pub myfunc :: a -> a both defines and exports myfunc. Rust has taken this approach and it works out quite well, modulo some mistakes. The currently Haskell design has been found a bit wanting, with constructs like pattern Foo in the export list to differentiate when multiple names Foo might be in scope, when attaching the visibility to the identifier would be much easier.

Is it practical: Perhaps, provided someone doesn't try and take the proposal too far. It would be super tempting to differentiate between exports to of the package, and exports that are only inside this package (what Rust clumsily calls pub(crate)). And there are other things in the module system that could be improved. And maybe we should export submodules. I suspect everyone will want to pile more things into this design, to the point it breaks, but a simple exporting keyword would probably be viable.

Friday, May 15, 2020

File Access Tracing

Summary: It is useful to trace files accessed by a command. Shake and FSATrace provide some tools to do that.

When writing a build system, it's useful to see which files a command accesses. In the Shake build system, we use that information for linting, an auto-deps feature and a forward build mode. What we'd like is a primitive which when applied to a command execution:

  1. Reports which files are read/written.
  2. Reports the start and end time for when the files were accessed.
  3. Reports what file metadata is accessed, e.g. modification times and directory listing.
  4. Lets us pause a file access (so the dependencies can be built) or deny a file access (so dependency violations can be rejected early).
  5. Is computationally cheap.
  6. Doesn't require us to write/maintain too much low-level code.
  7. Works on all major OSs (Linux, Mac, Windows).
  8. Doesn't require sudo or elevated privilege levels.

While there are lots of approaches to tracing that get some of those features, it is currently impossible to get them all. Therefore, Shake has to make compromises. The first fours bullet points are about features -- we give up on 2 (timestamps) and 4 (pause/deny); 1 (read/writes) is essential, and we make 3 (metadata) optional, using the imperfect information when its available and tolerating its absence. The last four bullet points are about how it works -- we demand 7 (compatibility) and 8 (no sudo) because Shake must be easily available to its users. We strive for 5 (cheap) and 6 (easy), but are willing to compromise a bit on both.

Shake abstracts the result behind the cmd function with the FSATrace return type. As an example I ran in GHCi:

traced :: [FSATrace] <- cmd "gcc -c main.c"
print traced

Which compiles main.c with gcc, and on my machine prints 71 entries, including:

[ FSARead "C:\\ghc\\ghc-8.6.3\\mingw\\bin\\gcc.exe"
, FSARead "C:\\Neil\\temp\\main.c"
, FSAWrite "C:\\Users\\ndmit_000\\AppData\\Local\\Temp\\ccAadCiR.s"
, FSARead "C:\\ghc\\ghc-8.6.3\\mingw\\bin\\as.exe"
, FSARead "C:\\Users\\ndmit_000\\AppData\\Local\\Temp\\ccAadCiR.s"
, FSAWrite "C:\\Neil\\temp\\main.o"
, ...
]

Most of the remaining entries are dlls that gcc.exe uses, typically from the Windows directory. I've reordered the list to show the flow more clearly. First the process reads gcc.exe (so it can execute it), which reads main.c and writes a temporary file ccAadCiR.s. It then reads as.exe (the assembler) so it can run it, which in turn reads ccAadCiR.s and writes main.o.

Under the hood, Shake currently uses FSATrace, but that is an implementation detail -- in particular the BigBro library might one day also be supported. In order to understand the limitations of the above API, it's useful to understand the different approaches to file system tracing, and which ones FSATrace uses.

Syscall tracing On Linux, ptrace allows tracing every system call made, examining the arguments, and thus recording the files accessed. Moreover, by tracing the stat system call even file queries can be recorded. The syscall tracking approach can be made complete, but because every syscall must be hooked, can end up imposing high overhead. This approach is used by BigBro as well as numerous other debugging and instrumentation tools.

Library preload On both Linux and Mac most programs use a dynamically linked C library to make file accesses. By using LD_LIBRARY_PRELOAD it is possible to inject a different library into the program memory which intercepts the relevant C library calls, recording which files are read and written. This approach is simpler than hooking syscalls, but only works if all syscall access is made through the C library. While normally true, that isn't the case for Go programs (syscalls are invoked directly) or statically linked programs (the C library cannot be replaced).

While the technique works on a Mac, from Mac OS X 1.10 onwards system binaries can't be traced due to System Integrity Protection. As an example, the C compiler is typically installed as a system binary. It is possible to disable System Integrity Protection (but not recommended by Apple); or to use non-system binaries (e.g. those supplied by Nix); or to copy the system binary to a temporary directory (which works provided the binary does not afterwards invoke another system binary). The library preload mechanism is implemented by FSATrace and the copying system binaries trick on Mac is implemented in Shake.

File system tracing An alternative approach is to implement a custom file system and have that report which files are accessed. One such implementation for Linux is TracedFS, which is unfortunately not yet complete. Such an approach can track all accesses, but may require administrator privileges to mount a file system.

Custom Linux tracing On Linux, thanks to the open-source nature of the kernel, there are many custom file systems (e.g FUSE) and tracing mechanisms (e.g. eBPF), many of which can be used/configured/extended to perform some kind of system tracing. Unfortunately, most of these are restricted to Linux only.

Custom Mac tracing BuildXL uses a Mac sandbox based on KAuth combined with TrustedBSD Mandatory Access Control (MAC) to both detect which files are accessed and also block access to specific files. The approach is based on internal Mac OS X details which have been reversed engineered, some of which are deprecated and scheduled for removal.

Windows Kernel API hooking On Windows it is possible to hook the Kernel API, which can be used to detect when any files are accessed. Implementing such a hook is difficult, particularly around 32bit v 64bit differences, as custom assembly language trampolines must be used. Furthermore, some antivirus products (incorrectly) detect such programs as viruses. Windows kernel hooking is available in both FSATrace and BigBro (sharing the same source code), although without support for 32bit processes that spawn 64bit processes.

Current State

Shake currently uses FSATrace, meaning it uses library preloading on Linux/Mac and kernel hooking on Windows. The biggest practical limitations vary by OS:

  • On Linux it can't trace into Go programs (or other programs that use system calls directly) and statically linked binaries. Integrating BigBro as an alternative would address these issues.
  • On Mac it can't trace into system binaries called from other system binaries, most commonly the system C/C++ compiler. Using your own C/C++ installation, via Homebrew or Nix, is a workaround.
  • On Windows it can't trace 64bit programs spawned by 32bit programs. In most cases the 32bit binaries can easily be replaced by 64bit binaries. The only problem I've seen was caused by a five year-old version of sh hiding out in my C:\bin directory, which was easily remedied with a newer version. The code to fix this issue is available, but scares me too much to try integrating.

Overall, the tracing available in Shake has a simple API, is very useful for Shake, and has been repurposed in other build systems. But I do dearly wish such functionality could be both powerful and standardised!

Sunday, May 03, 2020

HLint 3.0

Summary: HLint 3.0 uses the GHC parser.

In June 2019 I posted about our intention to move HLint to the GHC parser. Since then a small group of us have been hard at work making the conversion -- first by parsing with both GHC and haskell-src-exts, and finally, with the newly released HLint 3.0, parsing only with GHC. As of now, if your code can be parsed with GHC, it can probably be parsed with HLint. As new GHC releases come out, with new features and new forms of syntax, HLint will follow along closely.

The change list for this release records 51 separate items, which is about as many as the last nine HLint releases combined. Of those changes, 11 are breaking changes (the ones marked with *). That count omits all the hint conversions, which (hopefully!) aren't user visible. The main API breaks are in the Language.Haskell.HLint API, which has switched from haskell-src-exts types to GHC ones. You can now take a GHC syntax tree and apply HLint to it, or (as before) give HLint the source and have it do the parsing for you. We also took the opportunity to simplify the API while we were at it -- but the underlying functionality remains much the same. We also deleted a small number of command line flags that were no longer useful, and were never used very much. If you have difficulty converting to the new API or relied on some removed functionality, raise a bug.

What was especially nice about this conversion process, and the development of HLint in general, is that it is increasingly becoming a real team, where my role is more reviewer than coder. There have been 21 distinct contributors since the start of the GHC conversion, but I'd like to particularly call out a few major pieces of work that have been completed:

  • Shayne Fletcher is responsible for the ghc-lib-parser library that makes it feasible to use a single GHC API across multiple GHC versions. Without that, using the GHC library would be at least double the work (and just wouldn't be feasible). Shayne also did a lot of the conversion, mapping many of the rule types from haskell-src-exts to GHC. These API's are surprisingly different given they have the same underlying representation.
  • Georgi Lyubenov did most of the conversions that Shayne didn't do.
  • Joseph C. Sible has made sure that while efforts were focused on a complete rewrite of the code base, the underlying hints have continued to improve, removing incorrect hints, adding useful additional hints.
  • Ziyang Liu has focused on the refactoring side of HLint. The initial refactoring work was completed by Matthew Pickering as part of GSoC 2015. Since then it's had mild attention at best. Ziyang has stepped into the gap, importantly adding tests, CI and improving the refactorings in lots of places. It now feels like a real part of HLint.

We hope you enjoy HLint 3.0 and beyond!

PS. You may spot we're already on HLint 3.0.2 on Hackage - thanks to Ryan Scott for already finding a few bugs.