Summary: Shake is not like Make, it has different internal state, which leads to different behaviour. I also store the state in an optimised way.
Update: I'm keeping an up to date version of this post in the Shake repo, which includes a number of questions/answers at the bottom, and is likely to evolve over time to incorporate that information into the main text.
In order to understand the behaviour of Shake, it is useful to have a mental model of Shake's internal state. To be a little more concrete, let's talk about File
s which are stored on disk, which have ModTime
value's associated with them, where modtime
gives the ModTime
of a FilePath
(Shake is actually generalised over all those things). Let's also imagine we have the rule:
file *> \out -> do
need [dependency]
run
So file
depends on dependency
and rebuilds by executing the action run
.
The Make Model
In Make there is no additional state, only the file-system. A file is considered dirty if it has a dependency such that:
modtime dependency > modtime file
As a consequence, run
must update modtime file
, or the file will remain dirty and rebuild in subsequent runs.
The Shake Model
For Shake, the state is:
database :: File -> (ModTime, [(File, ModTime)])
Each File
is associated with a pair containing the ModTime
of that file, plus a list of each dependency and their modtime
s, all from when the rule was last run. As part of executing the rule above, Shake records the association:
file -> (modtime file, [(dependency, modtime dependency)])
The file
is considered dirty if any of the information is no longer current. In this example, if either modtime file
changes, or modtime dependency
changes.
There are a few consequences of the Shake model:
- There is no requirement for
modtime file
to change as a result ofrun
. The file is dirty because something changed, after we run the rule and record new information it becomes clean. - Since a file is not required to change its
modtime
, things that depend onfile
may not require rebuilding even iffile
rebuilds. - If you update an output file, it will rebuild that file, as the
ModTime
of a result is tracked. - Shake only ever performs equality tests on
ModTime
, never ordering, which means it generalises to other types of value and works even if your file-system sometimes has incorrect times.
These consequences allow two workflows that aren't pleasant in Make:
- Generated files, where the generator changes often, but the output of the generator for a given input changes rarely. In Shake, you can rerun the generator regularly, and using a function that writes only on change (
writeFileChanged
in Shake) you don't rebuild further. This technique can reduce some rebuilds from hours to seconds. - Configuration file splitting, where you have a configuration file with lots of key/value pairs, and want certain rules to only depend on a subset of the keys. In Shake, you can generate a file for each key/value and depend only on that key. If the configuration file updates, but only a subset of keys change, then only a subset of rules will rebuild. Alternatively, using
Development.Shake.Config
you can avoid the file for each key, but the dependency model is the same.
Optimising the Model
The above model expresses the semantics of Shake, but the implementation uses an optimised model. Note that the original Shake paper gives the optimised model, not the easy to understand model - that's because I only figured out the difference a few days ago (thanks to Simon Marlow, Simon Peyton Jones and Andrey Mokhov). To recap, we started with:
database :: File -> (ModTime, [(File, ModTime)])
We said that File
is dirty if any of the ModTime
values change. That's true, but what we are really doing is comparing the first ModTime
with the ModTime
on disk, and the list of second ModTime
's with those in database
. Assuming we are passed the current ModTime
on disk, then a file is valid if:
valid :: File -> ModTime -> Bool
valid file mNow =
mNow == mOld &&
and [fst (database d) == m | (d,m) <- deps]
where (mOld, deps) = database file
The problem with this model is that we store each File
/ModTime
pair once for the file itself, plus once for every dependency. That's a fairly large amount of information, and in Shake both File
and ModTime
can be arbitrarily large for user rules.
Let's introduce two assumptions:
Assumption 1: A File
only has at most one ModTime
per Shake run, since a file will only rebuild at most once per run. We use Step
for the number of times Shake has run on this project.
Consequence 1: The ModTime
for a file and the ModTime
for its dependencies are all recorded in the same run, so they share the same Step
.
Assumption 2: We assume that if the ModTime
of a File
changes, and then changes back to a previous value, we can still treat that as dirty. In the specific case of ModTime
that would require time travel, but even for other values it is very rare.
Consequence 2: We only use historical ModTime
values to compare them for equality with current ModTime
values. We can instead record the Step
at which the ModTime
last changed, assuming all older Step
values are unequal.
The result is:
database :: File -> (ModTime, Step, Step, [File])
valid :: File -> ModTime -> Bool
valid file mNow =
mNow == mOld &&
and [sBuild >= changed (database d) | d <- deps]
where (mOld, sBuilt, sChanged, deps) = database file
changed (_, _, sChanged, _) = sChanged
For each File
we store its most recently recorded ModTime
, the Step
at which it was built, the Step
when the ModTime
last changed, and the list of dependencies. We now check if the Step
for this file is greater than the Step
at which dependency
last changed. Using the assumptions above, the original formulation is equivalent.
Note that instead of storing one ModTime
per dependency+1, we now store exactly one ModTime
plus two small Step
values.
We still store each file many times, but we reduce that by creating a bijection between File
(arbitrarily large) and Id
(small index) and only storing Id
.
Implementing the Model
For those who like concrete details, which might change at any point in the future, the relevant definition is in Development.Shake.Database:
data Result = Result
{result :: Value -- the result when last built
,built :: Step -- when it was actually run
,changed :: Step -- when the result last changed
,depends :: [[Id]] -- dependencies
,execution :: Float -- duration of last run
,traces :: [Trace] -- a trace of the expensive operations
} deriving Show
The differences from the model are:
ModTime
becameValue
, because Shake deals with lots of types of rules.- The dependencies are stored as a list of lists, so we still have access to the parallelism provided by
need
, and if we start rebuilding some dependencies we can do so in parallel. - We store
execution
andtraces
so we can produce profiling reports. - I haven't shown the
File
/Id
mapping here - that lives elsewhere. - I removed all strictness/
UNPACK
annotations from the definition above, and edited a few comments.
As we can see, the code follows the optimised model quite closely.
10 comments:
I'm still confused with the Shake model.
In you rule `File -> (ModTime, [(File, ModTime)]`. Is the time stored for a dependency
1 - the time the dependency has been last used
2 - the dependency last modification when the dependency has been used?
For example. Let's say B depends on A and A has been modified yesterday.
If I'm building B today: scenario (1) would be store
database B = (Today, [(A, Today)])
where as scenario (2) would store
database B = (Today, [(A, Yesterday)])
My understanding is scenario 2, in that case, ModTime could be easily replaced by a SHA. However, *Consequence 1*
The ModTime for a file and the ModTime for its dependencies are all recorded in the same run, so they share the same Step
Let's suppose we are using scenario (1).
Could you clarify please ?
Also, `valid` doesn't seem to be recursive, whereas you would expect an invalid dependency to invalidate all of it's dependees.
Is this assumption wrong or is the recursion is *hidden* in the code.
/mb14
mb14: Thanks for your comments - really interesting, and certainly helping me to understand things better.
In the simple model, the time stored for a dependency is the last modification when the dependency has been used. So the semantics are based on scenario 2.
In the complex model, I move to scenario 1, but using some fake notion of Step to be the time.
The key is that I couldn't record only the scenario 2 information in the simple model because I need to know if the time has changed. I solve that in the complex model by storing two Step values, and relying on some assumptions.
For valid, I am assuming that before you call valid on a File, you have already called valid on all its dependencies, and if necessary, built them so they have become valid. I had noted that in an earlier draft of this post, but it got lost in editing :(.
I'm even more confused now. I was hoping both model (semantic and optimized one) to work in the same way. So which scenario is shake implementing , 1 or 2 ?
About `valid`, that seems to be a strong assumption in my opinion ;-)
As you know, I'm really interested in using shake (and build my own rules) but at the moment I still don't know if it's model doesn't mach my needs or if I don't understand the shake model at all.
I suspect the confusion comes from my poor explanations, and the fact that I'm using this post (and associated questions) to really figure out how to express the model best. Simon Peyton Jones had a bunch of questions when I showed it to him, which I recorded at https://github.com/ndmitchell/shake/blob/master/docs/Model.md#questionsanswers-simon-peyton-jones .
Both the model and the implementation work in the same way, if you can assume the assumptions (which I think is fair - the first is guaranteed by Shake, the second makes almost no practical difference). The model is exactly scenario 2. When thinking about how Shake works, think scenario 2.
Entirely separately to the model, you can implement it more efficiently by storing the step at which you last ran the rule. But it's really just a data encoding of the scenario 2 information, erasing the data you don't need (you only care if values are equal, not what the non-equal value is). So think of the model as a custom version of gzip on scenario 2, not as scenario 1. It just so happens that, at first glance, the description in scenario 1 is quite close to the optimised model. At some point I'd like to formally prove that given the model and the assumptions there is an equivalence to the implementation.
In this post I'm only really talking about what it means for something to be valid, leaving aside the question of how things get checked for validity. I cover that a bit in the paper, http://community.haskell.org/~ndm/downloads/paper-shake_before_building-10_sep_2012.pdf, figure 5. Essentially you can make the function that returns the value of the dependency for also check the validity of the dependency, and then it's easy to show you can only see values which are valid. That's really a separate post, as you can parameterise the thing that ensures things are recursively valid over the thing that ensures things are one-step valid.
Reading the Q/A from SPJ was really interesting. Scenario 2 : when you record the "state" of a dependency when it has been used is in my opinion the way Shake should work.
The state could the mod time or a SHA etc...
It feels however, that even though you are trying to implement Scenario 2 that the actual implementation is equivalent to Scenario 1 or something in between. Which is were the confusion comes from.
A good way to see if Shake is compatible with Scenario 2 would be to replace modTime with SHA.
In that case we would have
database B = (sha B), [(A, sha A)]
Using SHA would probably be slower than modTime, but might still be fast enough (look at git for example).
A part from that, the advantage of using a SHA are obvious, you don't need to rebuild things if the dependency content haven't change at all (even though the file might have been modified). Think of generated code. You can regenerate code without actually changing the content of generated files.
I'm saying that because at first sight the main difference between shake and make is
make use (>) whereas shake use (!=) and I can feel that shake is "loosing" something against make. However using (!=) (and a database) makes it possible to use a SHA, so why not ?
I am fairly certain it's scenario 2. I have got file hashes in Shake, see see http://neilmitchell.blogspot.co.uk/2014/06/shake-file-hashesdigests.html - you can turn them on with a flag. As you say, that means it would be hard to be scenario 1, and the encoding I've come up with must be an implementation of scenario 2.
Note you could define > rules in Shake if you wanted, it is powerful enough. However, I don't think they're very useful - they assume monotonic times, which (certainly on NFS filesystems) isn't guaranteed.
Fair enough. Having that in mind, I will have another look at Shake documentation. I'm sure everything will be clearer.
Thanks for your explanation.
Please yell if it isn't clear.
In response to your questioning I've started trying to concretely model the Shake build system, including things like when valid is called, and I would like to eventually prove things like the optimised model is consistent.
The in-Shake-repo version of this document has moved. Its new location is: https://github.com/ndmitchell/shake/blob/master/model/Model.md
Thanks Strager Neds - I've updated the post with the new URL.
Post a Comment