Tuesday, February 17, 2015

nub considered harmful

Summary: Don't use nub. A much faster alternative is nubOrd from the extra package.

The Haskell Data.List module contains the function nub, which removes duplicate elements. As an example:

nub [1,2,1,3] ==  [1,2,3]

The function nub has the type Eq a => [a] -> [a]. The complexity of take i $ nub xs is O(length xs * i). Assuming all elements are distinct and you want them all, that is O(length xs ^ 2). If we only have an Eq instance, that's the best complexity we can achieve. The reason is that given a list as ++ [b], to check if b should be in the output requires checking b for equality against nub as, which requires a linear scan. Since checking each element requires a linear scan, we end up with a quadratic complexity.

However, if we have an Ord instance (as we usually do), we have a complexity of O(length xs * log i) - a function that grows significantly slower. The reason is that we can build a balanced binary-tree for the previous elements, and check each new element in log time. Does that make a difference in practice? Yes. As the graph below shows, by the time we get to 10,000 elements, nub is 70 times slower. Even at 1,000 elements nub is 8 times slower.

The fact nub is dangerous isn't new information, and I even suggested changing the base library in 2007. Currently there seems to be a nub hit squad, including Niklas Hambüchen, who go around raising tickets against various projects suggesting they avoid nub. To make that easier, I've added nubOrd to my extra package, in the Data.List.Extra module. The function nubOrd has exactly the same semantics as nub (both strictness properties and ordering), but is asymptotically faster, so is almost a drop-in replacement (just the additional Ord context).

For the curious, the above graph was generated in Excel, with the code below. I expect the spikes in nub correspond to garbage collection, or just random machine fluctuations.

import Control.Exception
import Data.List.Extra
import Control.Monad
import System.Time.Extra

benchmark xs = do
    n <- evaluate $ length xs
    (t1,_) <- duration $ evaluate $ length $ nub xs
    (t2,_) <- duration $ evaluate $ length $ nubOrd xs
    putStrLn $ show n ++ "," ++ show t1 ++ "," ++ show t2

main = do
    forM_ [0,100..10000] $ \i -> benchmark $ replicate i 1
    forM_ [0,100..10000] $ \i -> benchmark [1..i]


Franklin Chen said...

I'm not sure about nubOrd. nub is horrible, of course, but my response to that has always been that lists are overused and one should be using tree maps, hash maps, or whatever else is suitable. nubOrd provides good performance, but doesn't reuse a standard map library but instead implements its own tree map. Part of me feels like, it's best to simply discourage doing things like nub at all. If one really wants to be keeping track of uniqueness, one should not be sticking stuff into a list in the first place, right?

Neil Mitchell said...

Franklin: I avoided reusing the standard container library to cut down on dependencies, and also avoid overheads like a size in each node, which wouldn't be useful here. Note that things like Set/Map don't have ordering, but a list does, so there is a use for it. I often use nub when I have some operation that only needs running once (eg. createDirectory), so I don't want to keep track of uniqueness, but nub is a nice disk-access optimization.

That said, there are certainly plenty of examples of people using lists when they should be using set/map. While nub is useful on lists, things like intersect/union are a lot more dubious.

Edward Kmett said...

With Fritz Henglein's discrimination tricks you can get this down all the way to linear in length for many Haskell data types.

Andrey Mokhov said...

Edward, could you clarify? Just checking if a list contains a duplicate requires Θ(n log n) time if all we have is an Ord constraint: http://en.m.wikipedia.org/wiki/Element_distinctness_problem

Perhaps, you meant using the Hashable constraint instead (or something similar), which should be enough to get to O(n) time.

Neil Mitchell said...

Andrey: See http://www.diku.dk/hjemmesider/ansatte/henglein/papers/henglein2011a.pdf - you need more than an Ord constraint, and in fact more than just a Hashable constraint - but it's a very cool technique.

Andrey Mokhov said...

Thanks Neil. This looks really cool!

James Mansion said...

I found myself using nub just the other day. Its weird, after decades or shell-scripting blah | sort | uniq, I expected to find an equivalent that collapses *adjacent* duplicates.

Neil Mitchell said...

James: map head . group will collapse adjacent elements. In some (cough cough) libraries you might find a uniq function that collapses duplicates.

Simon Shine said...

> you need more than an Ord constraint

In a sense you need *less* than an Ord constraint.

The old PDF link to discrimination-based sorting is dead, so here are two new links:

http://hackage.haskell.org/package/discrimination -- Edward Kmett's Haskell implementation

https://www.cs.ox.ac.uk/projects/utgp/school/henglein2012c.pdf -- Fritz Henglein's article