## Tuesday, February 17, 2015

### nub considered harmful

Summary: Don't use `nub`. A much faster alternative is `nubOrd` from the `extra` package.

The Haskell `Data.List` module contains the function `nub`, which removes duplicate elements. As an example:

``````nub [1,2,1,3] ==  [1,2,3]
``````

The function `nub` has the type `Eq a => [a] -> [a]`. The complexity of `take i \$ nub xs` is O(length xs * i). Assuming all elements are distinct and you want them all, that is O(length xs ^ 2). If we only have an `Eq` instance, that's the best complexity we can achieve. The reason is that given a list `as ++ [b]`, to check if `b` should be in the output requires checking `b` for equality against `nub as`, which requires a linear scan. Since checking each element requires a linear scan, we end up with a quadratic complexity.

However, if we have an `Ord` instance (as we usually do), we have a complexity of O(length xs * log i) - a function that grows significantly slower. The reason is that we can build a balanced binary-tree for the previous elements, and check each new element in log time. Does that make a difference in practice? Yes. As the graph below shows, by the time we get to 10,000 elements, `nub` is 70 times slower. Even at 1,000 elements `nub` is 8 times slower.

The fact `nub` is dangerous isn't new information, and I even suggested changing the `base` library in 2007. Currently there seems to be a `nub` hit squad, including Niklas Hambüchen, who go around raising tickets against various projects suggesting they avoid `nub`. To make that easier, I've added `nubOrd` to my `extra` package, in the `Data.List.Extra` module. The function `nubOrd` has exactly the same semantics as `nub` (both strictness properties and ordering), but is asymptotically faster, so is almost a drop-in replacement (just the additional `Ord` context).

For the curious, the above graph was generated in Excel, with the code below. I expect the spikes in `nub` correspond to garbage collection, or just random machine fluctuations.

``````import Control.Exception
import Data.List.Extra
import Control.Monad
import System.Time.Extra

benchmark xs = do
n <- evaluate \$ length xs
(t1,_) <- duration \$ evaluate \$ length \$ nub xs
(t2,_) <- duration \$ evaluate \$ length \$ nubOrd xs
putStrLn \$ show n ++ "," ++ show t1 ++ "," ++ show t2

main = do
forM_ [0,100..10000] \$ \i -> benchmark \$ replicate i 1
forM_ [0,100..10000] \$ \i -> benchmark [1..i]
``````

#### 9 comments:

Franklin Chen said...

I'm not sure about nubOrd. nub is horrible, of course, but my response to that has always been that lists are overused and one should be using tree maps, hash maps, or whatever else is suitable. nubOrd provides good performance, but doesn't reuse a standard map library but instead implements its own tree map. Part of me feels like, it's best to simply discourage doing things like nub at all. If one really wants to be keeping track of uniqueness, one should not be sticking stuff into a list in the first place, right?

Neil Mitchell said...

Franklin: I avoided reusing the standard container library to cut down on dependencies, and also avoid overheads like a size in each node, which wouldn't be useful here. Note that things like Set/Map don't have ordering, but a list does, so there is a use for it. I often use nub when I have some operation that only needs running once (eg. createDirectory), so I don't want to keep track of uniqueness, but nub is a nice disk-access optimization.

That said, there are certainly plenty of examples of people using lists when they should be using set/map. While nub is useful on lists, things like intersect/union are a lot more dubious.

Edward Kmett said...

With Fritz Henglein's discrimination tricks you can get this down all the way to linear in length for many Haskell data types.

Andrey Mokhov said...

Edward, could you clarify? Just checking if a list contains a duplicate requires Θ(n log n) time if all we have is an Ord constraint: http://en.m.wikipedia.org/wiki/Element_distinctness_problem

Perhaps, you meant using the Hashable constraint instead (or something similar), which should be enough to get to O(n) time.

Neil Mitchell said...

Andrey: See http://www.diku.dk/hjemmesider/ansatte/henglein/papers/henglein2011a.pdf - you need more than an Ord constraint, and in fact more than just a Hashable constraint - but it's a very cool technique.

Andrey Mokhov said...

Thanks Neil. This looks really cool!

James Mansion said...

I found myself using nub just the other day. Its weird, after decades or shell-scripting blah | sort | uniq, I expected to find an equivalent that collapses *adjacent* duplicates.

Neil Mitchell said...

James: map head . group will collapse adjacent elements. In some (cough cough) libraries you might find a uniq function that collapses duplicates.

Simon Shine said...

> you need more than an Ord constraint

In a sense you need *less* than an Ord constraint.

The old PDF link to discrimination-based sorting is dead, so here are two new links:

http://hackage.haskell.org/package/discrimination -- Edward Kmett's Haskell implementation

https://www.cs.ox.ac.uk/projects/utgp/school/henglein2012c.pdf -- Fritz Henglein's article