Summary: Don't use nub
. A much faster alternative is nubOrd
from the extra
package.
The Haskell Data.List
module contains the function nub
, which removes duplicate elements. As an example:
nub [1,2,1,3] == [1,2,3]
The function nub
has the type Eq a => [a] -> [a]
. The complexity of take i $ nub xs
is O(length xs * i). Assuming all elements are distinct and you want them all, that is O(length xs ^ 2). If we only have an Eq
instance, that's the best complexity we can achieve. The reason is that given a list as ++ [b]
, to check if b
should be in the output requires checking b
for equality against nub as
, which requires a linear scan. Since checking each element requires a linear scan, we end up with a quadratic complexity.
However, if we have an Ord
instance (as we usually do), we have a complexity of O(length xs * log i) - a function that grows significantly slower. The reason is that we can build a balanced binary-tree for the previous elements, and check each new element in log time. Does that make a difference in practice? Yes. As the graph below shows, by the time we get to 10,000 elements, nub
is 70 times slower. Even at 1,000 elements nub
is 8 times slower.
The fact nub
is dangerous isn't new information, and I even suggested changing the base
library in 2007. Currently there seems to be a nub
hit squad, including Niklas Hambüchen, who go around raising tickets against various projects suggesting they avoid nub
. To make that easier, I've added nubOrd
to my extra
package, in the Data.List.Extra
module. The function nubOrd
has exactly the same semantics as nub
(both strictness properties and ordering), but is asymptotically faster, so is almost a drop-in replacement (just the additional Ord
context).
For the curious, the above graph was generated in Excel, with the code below. I expect the spikes in nub
correspond to garbage collection, or just random machine fluctuations.
import Control.Exception
import Data.List.Extra
import Control.Monad
import System.Time.Extra
benchmark xs = do
n <- evaluate $ length xs
(t1,_) <- duration $ evaluate $ length $ nub xs
(t2,_) <- duration $ evaluate $ length $ nubOrd xs
putStrLn $ show n ++ "," ++ show t1 ++ "," ++ show t2
main = do
forM_ [0,100..10000] $ \i -> benchmark $ replicate i 1
forM_ [0,100..10000] $ \i -> benchmark [1..i]
9 comments:
I'm not sure about nubOrd. nub is horrible, of course, but my response to that has always been that lists are overused and one should be using tree maps, hash maps, or whatever else is suitable. nubOrd provides good performance, but doesn't reuse a standard map library but instead implements its own tree map. Part of me feels like, it's best to simply discourage doing things like nub at all. If one really wants to be keeping track of uniqueness, one should not be sticking stuff into a list in the first place, right?
Franklin: I avoided reusing the standard container library to cut down on dependencies, and also avoid overheads like a size in each node, which wouldn't be useful here. Note that things like Set/Map don't have ordering, but a list does, so there is a use for it. I often use nub when I have some operation that only needs running once (eg. createDirectory), so I don't want to keep track of uniqueness, but nub is a nice disk-access optimization.
That said, there are certainly plenty of examples of people using lists when they should be using set/map. While nub is useful on lists, things like intersect/union are a lot more dubious.
With Fritz Henglein's discrimination tricks you can get this down all the way to linear in length for many Haskell data types.
Edward, could you clarify? Just checking if a list contains a duplicate requires Θ(n log n) time if all we have is an Ord constraint: http://en.m.wikipedia.org/wiki/Element_distinctness_problem
Perhaps, you meant using the Hashable constraint instead (or something similar), which should be enough to get to O(n) time.
Andrey: See http://www.diku.dk/hjemmesider/ansatte/henglein/papers/henglein2011a.pdf - you need more than an Ord constraint, and in fact more than just a Hashable constraint - but it's a very cool technique.
Thanks Neil. This looks really cool!
I found myself using nub just the other day. Its weird, after decades or shell-scripting blah | sort | uniq, I expected to find an equivalent that collapses *adjacent* duplicates.
James: map head . group will collapse adjacent elements. In some (cough cough) libraries you might find a uniq function that collapses duplicates.
> you need more than an Ord constraint
In a sense you need *less* than an Ord constraint.
The old PDF link to discrimination-based sorting is dead, so here are two new links:
http://hackage.haskell.org/package/discrimination -- Edward Kmett's Haskell implementation
https://www.cs.ox.ac.uk/projects/utgp/school/henglein2012c.pdf -- Fritz Henglein's article
Post a Comment