Summary: Shake is not like Make, it has different internal state, which leads to different behaviour. I also store the state in an optimised way.
Update: I'm keeping an up to date version of this post in the Shake repo, which includes a number of questions/answers at the bottom, and is likely to evolve over time to incorporate that information into the main text.
In order to understand the behaviour of Shake, it is useful to have a mental model of Shake's internal state. To be a little more concrete, let's talk about Files which are stored on disk, which have ModTime value's associated with them, where modtime gives the ModTime of a FilePath (Shake is actually generalised over all those things). Let's also imagine we have the rule:
file *> \out -> do
    need [dependency]
    run
So file depends on dependency and rebuilds by executing the action run.
The Make Model
In Make there is no additional state, only the file-system. A file is considered dirty if it has a dependency such that:
modtime dependency > modtime file
As a consequence, run must update modtime file, or the file will remain dirty and rebuild in subsequent runs.
The Shake Model
For Shake, the state is:
database :: File -> (ModTime, [(File, ModTime)])
Each File is associated with a pair containing the ModTime of that file, plus a list of each dependency and their modtimes, all from when the rule was last run. As part of executing the rule above, Shake records the association:
file -> (modtime file, [(dependency, modtime dependency)])
The file is considered dirty if any of the information is no longer current. In this example, if either modtime file changes, or modtime dependency changes.
There are a few consequences of the Shake model:
- There is no requirement for modtime fileto change as a result ofrun. The file is dirty because something changed, after we run the rule and record new information it becomes clean.
- Since a file is not required to change its modtime, things that depend onfilemay not require rebuilding even iffilerebuilds.
- If you update an output file, it will rebuild that file, as the ModTimeof a result is tracked.
- Shake only ever performs equality tests on ModTime, never ordering, which means it generalises to other types of value and works even if your file-system sometimes has incorrect times.
These consequences allow two workflows that aren't pleasant in Make:
- Generated files, where the generator changes often, but the output of the generator for a given input changes rarely. In Shake, you can rerun the generator regularly, and using a function that writes only on change (writeFileChangedin Shake) you don't rebuild further. This technique can reduce some rebuilds from hours to seconds.
- Configuration file splitting, where you have a configuration file with lots of key/value pairs, and want certain rules to only depend on a subset of the keys. In Shake, you can generate a file for each key/value and depend only on that key. If the configuration file updates, but only a subset of keys change, then only a subset of rules will rebuild. Alternatively, using Development.Shake.Configyou can avoid the file for each key, but the dependency model is the same.
Optimising the Model
The above model expresses the semantics of Shake, but the implementation uses an optimised model. Note that the original Shake paper gives the optimised model, not the easy to understand model - that's because I only figured out the difference a few days ago (thanks to Simon Marlow, Simon Peyton Jones and Andrey Mokhov). To recap, we started with:
database :: File -> (ModTime, [(File, ModTime)])
We said that File is dirty if any of the ModTime values change. That's true, but what we are really doing is comparing the first ModTime with the ModTime on disk, and the list of second ModTime's with those in database. Assuming we are passed the current ModTime on disk, then a file is valid if:
valid :: File -> ModTime -> Bool
valid file mNow =
    mNow == mOld &&
    and [fst (database d) == m | (d,m) <- deps]
    where (mOld, deps) = database file
The problem with this model is that we store each File/ModTime pair once for the file itself, plus once for every dependency. That's a fairly large amount of information, and in Shake both File and ModTime can be arbitrarily large for user rules.
Let's introduce two assumptions:
Assumption 1: A File only has at most one ModTime per Shake run, since a file will only rebuild at most once per run. We use Step for the number of times Shake has run on this project.
Consequence 1: The ModTime for a file and the ModTime for its dependencies are all recorded in the same run, so they share the same Step.
Assumption 2: We assume that if the ModTime of a File changes, and then changes back to a previous value, we can still treat that as dirty. In the specific case of ModTime that would require time travel, but even for other values it is very rare.
Consequence 2: We only use historical ModTime values to compare them for equality with current ModTime values. We can instead record the Step at which the ModTime last changed, assuming all older Step values are unequal.
The result is:
database :: File -> (ModTime, Step, Step, [File])
valid :: File -> ModTime -> Bool
valid file mNow =
    mNow == mOld &&
    and [sBuild >= changed (database d) | d <- deps]
    where (mOld, sBuilt, sChanged, deps) = database file
          changed (_, _, sChanged, _) = sChanged
For each File we store its most recently recorded ModTime, the Step at which it was built, the Step when the ModTime last changed, and the list of dependencies. We now check if the Step for this file is greater than the Step at which dependency last changed. Using the assumptions above, the original formulation is equivalent.
Note that instead of storing one ModTime per dependency+1, we now store exactly one ModTime plus two small Step values.
We still store each file many times, but we reduce that by creating a bijection between File (arbitrarily large) and Id (small index) and only storing Id.
Implementing the Model
For those who like concrete details, which might change at any point in the future, the relevant definition is in Development.Shake.Database:
data Result = Result
    {result    :: Value   -- the result when last built
    ,built     :: Step    -- when it was actually run
    ,changed   :: Step    -- when the result last changed
    ,depends   :: [[Id]]  -- dependencies
    ,execution :: Float   -- duration of last run
    ,traces    :: [Trace] -- a trace of the expensive operations
    } deriving Show
The differences from the model are:
- ModTimebecame- Value, because Shake deals with lots of types of rules.
- The dependencies are stored as a list of lists, so we still have access to the parallelism provided by need, and if we start rebuilding some dependencies we can do so in parallel.
- We store executionandtracesso we can produce profiling reports.
- I haven't shown the File/Idmapping here - that lives elsewhere.
- I removed all strictness/UNPACKannotations from the definition above, and edited a few comments.
As we can see, the code follows the optimised model quite closely.