Monday, October 13, 2014

Shake's Internal State

Summary: Shake is not like Make, it has different internal state, which leads to different behaviour. I also store the state in an optimised way.

Update: I'm keeping an up to date version of this post in the Shake repo, which includes a number of questions/answers at the bottom, and is likely to evolve over time to incorporate that information into the main text.

In order to understand the behaviour of Shake, it is useful to have a mental model of Shake's internal state. To be a little more concrete, let's talk about Files which are stored on disk, which have ModTime value's associated with them, where modtime gives the ModTime of a FilePath (Shake is actually generalised over all those things). Let's also imagine we have the rule:

file *> \out -> do
    need [dependency]
    run

So file depends on dependency and rebuilds by executing the action run.

The Make Model

In Make there is no additional state, only the file-system. A file is considered dirty if it has a dependency such that:

modtime dependency > modtime file

As a consequence, run must update modtime file, or the file will remain dirty and rebuild in subsequent runs.

The Shake Model

For Shake, the state is:

database :: File -> (ModTime, [(File, ModTime)])

Each File is associated with a pair containing the ModTime of that file, plus a list of each dependency and their modtimes, all from when the rule was last run. As part of executing the rule above, Shake records the association:

file -> (modtime file, [(dependency, modtime dependency)])

The file is considered dirty if any of the information is no longer current. In this example, if either modtime file changes, or modtime dependency changes.

There are a few consequences of the Shake model:

  • There is no requirement for modtime file to change as a result of run. The file is dirty because something changed, after we run the rule and record new information it becomes clean.
  • Since a file is not required to change its modtime, things that depend on file may not require rebuilding even if file rebuilds.
  • If you update an output file, it will rebuild that file, as the ModTime of a result is tracked.
  • Shake only ever performs equality tests on ModTime, never ordering, which means it generalises to other types of value and works even if your file-system sometimes has incorrect times.

These consequences allow two workflows that aren't pleasant in Make:

  • Generated files, where the generator changes often, but the output of the generator for a given input changes rarely. In Shake, you can rerun the generator regularly, and using a function that writes only on change (writeFileChanged in Shake) you don't rebuild further. This technique can reduce some rebuilds from hours to seconds.
  • Configuration file splitting, where you have a configuration file with lots of key/value pairs, and want certain rules to only depend on a subset of the keys. In Shake, you can generate a file for each key/value and depend only on that key. If the configuration file updates, but only a subset of keys change, then only a subset of rules will rebuild. Alternatively, using Development.Shake.Config you can avoid the file for each key, but the dependency model is the same.

Optimising the Model

The above model expresses the semantics of Shake, but the implementation uses an optimised model. Note that the original Shake paper gives the optimised model, not the easy to understand model - that's because I only figured out the difference a few days ago (thanks to Simon Marlow, Simon Peyton Jones and Andrey Mokhov). To recap, we started with:

database :: File -> (ModTime, [(File, ModTime)])

We said that File is dirty if any of the ModTime values change. That's true, but what we are really doing is comparing the first ModTime with the ModTime on disk, and the list of second ModTime's with those in database. Assuming we are passed the current ModTime on disk, then a file is valid if:

valid :: File -> ModTime -> Bool
valid file mNow =
    mNow == mOld &&
    and [fst (database d) == m | (d,m) <- deps]
    where (mOld, deps) = database file

The problem with this model is that we store each File/ModTime pair once for the file itself, plus once for every dependency. That's a fairly large amount of information, and in Shake both File and ModTime can be arbitrarily large for user rules.

Let's introduce two assumptions:

Assumption 1: A File only has at most one ModTime per Shake run, since a file will only rebuild at most once per run. We use Step for the number of times Shake has run on this project.

Consequence 1: The ModTime for a file and the ModTime for its dependencies are all recorded in the same run, so they share the same Step.

Assumption 2: We assume that if the ModTime of a File changes, and then changes back to a previous value, we can still treat that as dirty. In the specific case of ModTime that would require time travel, but even for other values it is very rare.

Consequence 2: We only use historical ModTime values to compare them for equality with current ModTime values. We can instead record the Step at which the ModTime last changed, assuming all older Step values are unequal.

The result is:

database :: File -> (ModTime, Step, Step, [File])

valid :: File -> ModTime -> Bool
valid file mNow =
    mNow == mOld &&
    and [sBuild >= changed (database d) | d <- deps]
    where (mOld, sBuilt, sChanged, deps) = database file
          changed (_, _, sChanged, _) = sChanged

For each File we store its most recently recorded ModTime, the Step at which it was built, the Step when the ModTime last changed, and the list of dependencies. We now check if the Step for this file is greater than the Step at which dependency last changed. Using the assumptions above, the original formulation is equivalent.

Note that instead of storing one ModTime per dependency+1, we now store exactly one ModTime plus two small Step values.

We still store each file many times, but we reduce that by creating a bijection between File (arbitrarily large) and Id (small index) and only storing Id.

Implementing the Model

For those who like concrete details, which might change at any point in the future, the relevant definition is in Development.Shake.Database:

data Result = Result
    {result    :: Value   -- the result when last built
    ,built     :: Step    -- when it was actually run
    ,changed   :: Step    -- when the result last changed
    ,depends   :: [[Id]]  -- dependencies
    ,execution :: Float   -- duration of last run
    ,traces    :: [Trace] -- a trace of the expensive operations
    } deriving Show

The differences from the model are:

  • ModTime became Value, because Shake deals with lots of types of rules.
  • The dependencies are stored as a list of lists, so we still have access to the parallelism provided by need, and if we start rebuilding some dependencies we can do so in parallel.
  • We store execution and traces so we can produce profiling reports.
  • I haven't shown the File/Id mapping here - that lives elsewhere.
  • I removed all strictness/UNPACK annotations from the definition above, and edited a few comments.

As we can see, the code follows the optimised model quite closely.

8 comments:

mb14 said...

I'm still confused with the Shake model.

In you rule `File -> (ModTime, [(File, ModTime)]`. Is the time stored for a dependency
1 - the time the dependency has been last used
2 - the dependency last modification when the dependency has been used?

For example. Let's say B depends on A and A has been modified yesterday.
If I'm building B today: scenario (1) would be store

database B = (Today, [(A, Today)])

where as scenario (2) would store

database B = (Today, [(A, Yesterday)])

My understanding is scenario 2, in that case, ModTime could be easily replaced by a SHA. However, *Consequence 1*

The ModTime for a file and the ModTime for its dependencies are all recorded in the same run, so they share the same Step

Let's suppose we are using scenario (1).

Could you clarify please ?


Also, `valid` doesn't seem to be recursive, whereas you would expect an invalid dependency to invalidate all of it's dependees.
Is this assumption wrong or is the recursion is *hidden* in the code.

/mb14

Neil Mitchell said...

mb14: Thanks for your comments - really interesting, and certainly helping me to understand things better.

In the simple model, the time stored for a dependency is the last modification when the dependency has been used. So the semantics are based on scenario 2.

In the complex model, I move to scenario 1, but using some fake notion of Step to be the time.

The key is that I couldn't record only the scenario 2 information in the simple model because I need to know if the time has changed. I solve that in the complex model by storing two Step values, and relying on some assumptions.

For valid, I am assuming that before you call valid on a File, you have already called valid on all its dependencies, and if necessary, built them so they have become valid. I had noted that in an earlier draft of this post, but it got lost in editing :(.

Anonymous said...

I'm even more confused now. I was hoping both model (semantic and optimized one) to work in the same way. So which scenario is shake implementing , 1 or 2 ?

About `valid`, that seems to be a strong assumption in my opinion ;-)

As you know, I'm really interested in using shake (and build my own rules) but at the moment I still don't know if it's model doesn't mach my needs or if I don't understand the shake model at all.

Neil Mitchell said...

I suspect the confusion comes from my poor explanations, and the fact that I'm using this post (and associated questions) to really figure out how to express the model best. Simon Peyton Jones had a bunch of questions when I showed it to him, which I recorded at https://github.com/ndmitchell/shake/blob/master/docs/Model.md#questionsanswers-simon-peyton-jones .

Both the model and the implementation work in the same way, if you can assume the assumptions (which I think is fair - the first is guaranteed by Shake, the second makes almost no practical difference). The model is exactly scenario 2. When thinking about how Shake works, think scenario 2.

Entirely separately to the model, you can implement it more efficiently by storing the step at which you last ran the rule. But it's really just a data encoding of the scenario 2 information, erasing the data you don't need (you only care if values are equal, not what the non-equal value is). So think of the model as a custom version of gzip on scenario 2, not as scenario 1. It just so happens that, at first glance, the description in scenario 1 is quite close to the optimised model. At some point I'd like to formally prove that given the model and the assumptions there is an equivalence to the implementation.

In this post I'm only really talking about what it means for something to be valid, leaving aside the question of how things get checked for validity. I cover that a bit in the paper, http://community.haskell.org/~ndm/downloads/paper-shake_before_building-10_sep_2012.pdf, figure 5. Essentially you can make the function that returns the value of the dependency for also check the validity of the dependency, and then it's easy to show you can only see values which are valid. That's really a separate post, as you can parameterise the thing that ensures things are recursively valid over the thing that ensures things are one-step valid.

mb14 said...

Reading the Q/A from SPJ was really interesting. Scenario 2 : when you record the "state" of a dependency when it has been used is in my opinion the way Shake should work.
The state could the mod time or a SHA etc...

It feels however, that even though you are trying to implement Scenario 2 that the actual implementation is equivalent to Scenario 1 or something in between. Which is were the confusion comes from.

A good way to see if Shake is compatible with Scenario 2 would be to replace modTime with SHA.

In that case we would have

database B = (sha B), [(A, sha A)]


Using SHA would probably be slower than modTime, but might still be fast enough (look at git for example).

A part from that, the advantage of using a SHA are obvious, you don't need to rebuild things if the dependency content haven't change at all (even though the file might have been modified). Think of generated code. You can regenerate code without actually changing the content of generated files.

I'm saying that because at first sight the main difference between shake and make is
make use (>) whereas shake use (!=) and I can feel that shake is "loosing" something against make. However using (!=) (and a database) makes it possible to use a SHA, so why not ?

Neil Mitchell said...

I am fairly certain it's scenario 2. I have got file hashes in Shake, see see http://neilmitchell.blogspot.co.uk/2014/06/shake-file-hashesdigests.html - you can turn them on with a flag. As you say, that means it would be hard to be scenario 1, and the encoding I've come up with must be an implementation of scenario 2.

Note you could define > rules in Shake if you wanted, it is powerful enough. However, I don't think they're very useful - they assume monotonic times, which (certainly on NFS filesystems) isn't guaranteed.

mb14 said...

Fair enough. Having that in mind, I will have another look at Shake documentation. I'm sure everything will be clearer.

Thanks for your explanation.

Neil Mitchell said...

Please yell if it isn't clear.

In response to your questioning I've started trying to concretely model the Shake build system, including things like when valid is called, and I would like to eventually prove things like the optimised model is consistent.