Wednesday, February 25, 2015

Making withSocketsDo unnecessary

Summary: Currently you have to call withSocketsDo before using the Haskell network library. In the next version you won't have to.

The Haskell network library has always had a weird and unpleasant invariant. Under Windows, you must call withSocketsDo before calling any other functions. If you forget, the error message isn't particularly illuminating (e.g. getAddrInfo, does not exist, error 10093). Calling withSocketsDo isn't harmful under Linux, but equally isn't necessary, and thus easy to accidentally omit. The network library has recently merged some patches so that in future versions there is no requirement to call withSocketsDo, even on Windows.

Existing versions of network

The reason for requiring withSocketsDo is so that the network library can initialise the Windows Winsock library. The code for withSocketsDo was approximately:

withSocketsDo :: IO a -> IO a
#if WINDOWS
withSocketsDo act = do
    initWinsock
    act `finally` termWinsock
#else
withSocketsDo act = act
#endif

Where initWinsock and termWinsock were C functions. Both checked a mutable variable so they only initialised/terminated once. The initWinsock function immediately initialised the Winsock library. The termWinsock function did not terminate the library, but merely installed an atexit handler, providing a function that ran when the program shut down which terminated the Winsock library.

As a result, in all existing versions of the network library, it is fine to nest calls to withSocketsDo, call withSocketsDo multiple times, and to perform networking operations after withSocketsDo has returned.

Future versions of network

My approach to removing the requirement to call withSocketsDo was to make it very cheap, then sprinkle it everywhere it might be needed. Making such a function cheap on non-Windows just required an INLINE pragma (although its very likely GHC would have always inlined the function anyway).

For Windows, I changed to:

withSocketsDo act = do evaluate withSocketsInit; act 

{-# NOINLINE withSocketsInit #-}
withSocketsInit = unsafePerformIO $ do
    initWinsock
    termWinsock

Now withSocketsDo is very cheap, with subsequent calls requiring no FFI calls, and thanks to pointer tagging, just a few cheap instructions. When placing additional withSocketsDo calls my strategy was to make sure I called it before constructing a Socket (which many functions take as an argument), and when taking one of the central locks required for the network library. In addition, I identified a few places not otherwise covered.

In newer versions of the network library it is probably never necessary to call withSocketsDo - if you find a place where one is necessary, let me know. However, for compatibility with older versions on Windows, it is good practice to always call withSocketsDo. Libraries making use of the network library should probably call withSocketsDo on their users behalf.

Tuesday, February 17, 2015

nub considered harmful

Summary: Don't use nub. A much faster alternative is nubOrd from the extra package.

The Haskell Data.List module contains the function nub, which removes duplicate elements. As an example:

nub [1,2,1,3] ==  [1,2,3]

The function nub has the type Eq a => [a] -> [a]. The complexity of take i $ nub xs is O(length xs * i). Assuming all elements are distinct and you want them all, that is O(length xs ^ 2). If we only have an Eq instance, that's the best complexity we can achieve. The reason is that given a list as ++ [b], to check if b should be in the output requires checking b for equality against nub as, which requires a linear scan. Since checking each element requires a linear scan, we end up with a quadratic complexity.

However, if we have an Ord instance (as we usually do), we have a complexity of O(length xs * log i) - a function that grows significantly slower. The reason is that we can build a balanced binary-tree for the previous elements, and check each new element in log time. Does that make a difference in practice? Yes. As the graph below shows, by the time we get to 10,000 elements, nub is 70 times slower. Even at 1,000 elements nub is 8 times slower.


The fact nub is dangerous isn't new information, and I even suggested changing the base library in 2007. Currently there seems to be a nub hit squad, including Niklas Hamb├╝chen, who go around raising tickets against various projects suggesting they avoid nub. To make that easier, I've added nubOrd to my extra package, in the Data.List.Extra module. The function nubOrd has exactly the same semantics as nub (both strictness properties and ordering), but is asymptotically faster, so is almost a drop-in replacement (just the additional Ord context).

For the curious, the above graph was generated in Excel, with the code below. I expect the spikes in nub correspond to garbage collection, or just random machine fluctuations.

import Control.Exception
import Data.List.Extra
import Control.Monad
import System.Time.Extra

benchmark xs = do
    n <- evaluate $ length xs
    (t1,_) <- duration $ evaluate $ length $ nub xs
    (t2,_) <- duration $ evaluate $ length $ nubOrd xs
    putStrLn $ show n ++ "," ++ show t1 ++ "," ++ show t2

main = do
    forM_ [0,100..10000] $ \i -> benchmark $ replicate i 1
    forM_ [0,100..10000] $ \i -> benchmark [1..i]

Tuesday, February 10, 2015

Why is the Hoogle index so out of date?

Summary: Hoogle 4 is out of date. The alpha version Hoogle 5 has fresh code and data every day (and isn't yet ready).

Someone recently asked why Hoogle's index is so out of date. Making the index both more current (updated daily) and larger (indexing all of Stackage) is one of the goals behind my Hoogle 5 rewrite (which still isn't finished). Let's compare the different update processes:

Hoogle 4 updates took about two hours to complete, if they went well, and often had to be aborted. I first compiled the Hoogle binary on the haskell.org machines, which often failed, as typically the version of GHC was very old. Once I'd got a compiled binary, I needed to generate the database, which took about 2 hours, and occasionally failed halfway through. Once I had the new binary and databases I moved everything to correct place for Apache, accepting a small window of downtime during the move. Assuming that worked, I did a few test searches and smiled. Often the new Hoogle binary failed to start (usually failure to find some files, sometimes permissions) and I had to switch back to the old copy. Fixing up such issues took up to an hour. I had a mix of Windows .bat and Linux .sh scripts to automate some of the steps, but they weren't very robust, and required babysitting.

Hoogle 5 updates happen automatically at 8pm every night, take 4 minutes, and have yet to fail. I have a cron script that checks out the latest code and runs an update script. That script clones a fresh repo, compiles Hoogle, builds the databases, runs the test suite, kills the old version and launches the new version. The Hoogle code is all tested on Travis, so I don't expect that to fail very often. The upgrade script is hard to test, but the two failure modes are upgrading to a broken version, or not upgrading. The upgrade script runs checks and fails if anything doesn't work as expected, so it errs on the side of not upgrading. I use Uptime Robot to run searches and check the server is working, along with a canary page which raises an error if no upgrade happens for two days.

Clearly, the Hoogle 5 version update story is better. But why didn't I do it that way with Hoogle 4? The answer is that Hoogle 4 came out over six years ago, and a lot has changed since then:

  • Hoogle 4 is a CGI binary, served through Apache, while Hoogle 5 is a Haskell Warp server. By moving the logic into Haskell, it's far easier for me to configure and manage. Warp was only released on Hackage in 2011.
  • Hoogle 4 runs on the on the main haskell.org server, where my mistakes can easily take out the haskell.org home page (as a result, the haskell.org home page once said "moo" for 10 minutes). Hoogle 5 runs on a dedicated VM where I have root, and no one else runs anything, so I can experiment with settings about swap files, IP tables and cron jobs.
  • My job has provided a lot of practice doing drive-by sysadmining over the last 6 years. I've also had a lot of practice doing critical releases on a nightly basis. In comparison, Hoogle is pretty simple.
  • The revised/rethought approach to Hoogle databases is a lot faster and uses a lot less memory, so it takes under a minute to generate databases, instead of over an hour. That time difference makes it much easier to experiment with different approaches.

When will Hoogle 5 be ready? It doesn't yet do type search, there is no offline version and no API. There are probably lots of other little pieces missing. If you want, feel free to use it now at hoogle.haskell.org. You can still use Hoogle 4 at haskell.org/hoogle, or the more up-to-date FP complete hosted Hoogle 4.

Thursday, February 05, 2015

Refactoring with Equational Reasoning

Summary: Haskell is great for refactoring, thanks to being able to reason about and transform programs with confidence.

I think one of Haskell's strengths as a practical language is that it's easy to refactor, and more importantly, easy to refactor safety. Programs in the real world often accumulate technical debt - code that is shaped more by its history than its current purpose. Refactoring is one way to address that technical debt, making the code simpler, but not changing any meaningful behaviour.

When refactoring, you need to think of which alternative forms of code are equivalent but better. In C++ I've removed unused variables, only to find they were RAII variables, and their mere presence had a semantic effect. In R I've removed redundant if expressions, only to find the apparently pure condition had the effect of coercing a variable and changing its type. In Haskell, it's equally possible to make refactorings that at first glance appear safe but aren't - however, in practice, it happens a lot less. I think there are a few reasons for that:

  • Haskell is pure and evaluation strategy is largely unobservable - moving a statement "before" or "after" another lexically is usually safe.
  • Refactorings that do go wrong, for example variables that accidentally get moved out of scope or types which are no longer as constrained, usually result in compiler errors.
  • The Haskell community cares about semantics and laws. The Monad laws are satisfied by almost all monads, flagrantly breaking those laws is rare.
  • Functions like unsafePerformIO, which could harm refactoring, are almost always used behind a suitable pure abstraction.

Note that these reasons are due to both the language, and the conventions of the Haskell community. (Despite these factors, there are a few features that can trip up refactorings, e.g. exceptions, record wildcards, space-leaks.)

To take a very concrete example, today I was faced with the code:

f = fromMaybe (not b) . select
if f v == b then opt1 else opt2

At one point the function f was used lots, had a sensible name and nicely abstracted some properties. Now f is used once, the semantics are captured elsewhere, and the code is just unclear. We can refactor this statement, focusing on the condition:

f v == b
-- inline f
(fromMaybe (not b) . select) v == b
-- remove brackets and inline (.)
fromMaybe (not b) (select v) == b
-- expand to a case statement
(case select v of Nothing -> not b; Just x -> x) == b
-- push the == down
case select v of Nothing -> not b == b; Just x -> x == b
-- simplify not b == b
case select v of Nothing -> False; Just x -> x == b
-- collapse back up
select v == Just b

And now substitute back in:

if select v == Just b then opt1 else opt2

Our code is now much simpler and more direct. Thanks to the guarantees I expect of Haskell programs, I also have a high degree of confidence this code really is equivalent - even if it isn't obvious just looking at beginning and end.

Tuesday, January 27, 2015

Hoogle 5 is coming

Summary: I'm working on Hoogle 5. If you like unfinished software, you can try it.

For the last month I've been working on the next version of Hoogle, version 5. It isn't finished, and it isn't the "official" version of Hoogle yet, but it's online, you can try it out, and anyone who wants to hack on Hoogle (or things like hoogle-index) should probably take a look now.

How do I try it?

The purpose of this blog post isn't to solicit beta testers, it's not ready for that - I'm mostly reaching out to Hoogle developers. That said, I know some people will want to try it, so it's online at hoogle.haskell.org. Beware, it isn't finished, and haskell.org/hoogle remains the "official" version to use (but hey, use whichever you want, or both, or neither).

What's new for users of the website?

  • It isn't finished. In particular, type search hasn't been implemented. But there are also lots of other pieces in all the corners that don't work properly.
  • It searches all the packages on Stackage by default.
  • There is a drop-down to allow you to restrict focus to only a single package, or an author, or a tag.

What's new for everyone else?

  • There is no library, it's not on Hackage and the command line tool isn't designed for users. These things will be coming over time.
  • It's hosted on a haskell.org virtual machine, served through Warp rather than as a CGI program. Thanks to the Haskell Infrastructure team for all their help. As a result, I'm free to experiment with Hoogle without accidentally making the Haskell homepage say "moo".
  • Generating database for all of Stackage (minus download time) takes about 40s and uses < 1Gb of memory. The databases are built directly out of the .tar.gz files, without unpacking them.

What's next?

Update: the hogle code is now on the master branch of the hoogle repo, and the hoogle v4 code is on branch hoogle4.

Hoogle 5 is a complete rewrite of Hoogle, stealing bits as they were useful. At the moment Hoogle 4 is hosted at https://github.com/ndmitchell/hoogle while version 5 is at https://github.com/ndmitchell/hogle. The next step is to rename and move github repos so they are sensible once again. My best guess is that I should rename hoogle to hoogle4, then rename hogle to hoogle and move the issue tickets. I'm open to other suggestions.

Once that's resolved, I need to continue fixing and improving Hoogle so all the things that aren't finished become finished. If you are interested in helping, I recommend you create a github ticket describing what you want to do and we can take it from there.

Tuesday, January 13, 2015

User Interfaces for Users

Summary: When designing a user interface, think about the user, and what they want to know. Don't just present the information you know.

As part of my job I've ended up writing a fair amount of user interface code, and feedback from users has given me an appreciation of some common mistakes. Many user interfaces are designed to present information, and one of the key rules is to think about what the user wants to see, not just showing the information you have easily to hand. I was reminded of this rule when I was expecting a parcel. On the morning of Saturday 10/1/2015 I used a track my parcel interface to find:

The interface presents a lot of information, but most of it is interesting to the delivery company, not to me. I have basically one question: when will my parcel arrive. From the interface I can see:

  • The parcel is being shipped with "Express AM" delivery. No link to what that means. Searching leads me to a page saying that guarantees delivery before 1pm on a business day (some places said noon, some said 1pm). That is useful information if I want to enter into a contract with the delivery service, but not when I'm waiting for a parcel. What happens on Saturday? Do they still deliver, just not guarantee the time? Do they wait until the next business day? Do they do either, as they see fit?
  • My parcel has been loaded onto a vehicle. Has the vehicle has left the depot, or is that a separate step? How many further steps are there between loading and delivery? This question is easy to answer after the parcel has been delivered, since the additional steps show up in the list, but difficult to answer before.

On Saturday morning my best guess about when the parcel would arrive was between then and Monday 1pm. Having been through the complete process, I now know the best answer was between some still unknown time on Monday morning and Monday 1pm. With that information, I'd have taken my son to the park rather than keeping him cooped up indoors.

I suggest the company augment their user interface with the line "Your parcel will be delivered on Monday, between 9am and 1pm". It's information they could have computed easily, and answers my question.

The eagle-eyed may notice that there is a plus to expand to show more information. Alas, all that shows is:

I think they're trying to say my puny iPhone can't cope with the bold font that is essential to tell me the status and I should get an iPhone plus... Checking the desktop website also showed no further information.

Wednesday, December 10, 2014

Beta testing the Windows Minimal GHC Installer

Summary: There is now a minimal Windows GHC installer that can install the network library.

Michael Snoyman and I are seeking beta testers for our new Windows GHC installer. The installer is very new, and we're keen to get feedback.

Update: If you want to install to Program Files (the default installation path), download the file, right click and hit "Run as administrator". This will be automatic in the next version.

One of the standard problems when upgrading Haskell libraries on Windows is that sometimes you have to upgrade a package which requires a configure script - typically the network library. When that happens, you have to take special measures, and even then, sometimes the binary will fail at runtime (this happens on about half my machines).

Our installer solves that problem by bundling GHC, Cabal and MSYS in one installer. You can install the network library out of the box without any special configuration and it just works.

Our installer does not provide all the packages included with the Haskell Platform, but it does provide an environment where you can easily install those packages. This minimal installer might be more suitable for developers.

Why might I want this installer

The installer is much easier to install than the GHC distribution (has a real installer, sets up the %PATH%, includes Cabal and can build the network library).

The installer has less libraries than the Haskell Platform installer, but includes MSYS so you can build everything in the Platform. The lack of additional packages and close correspondence with GHC (there is one installer for each GHC version) might make it more suitable for library developers. Once GHC 7.10.1 is released, we hope to make a GHC Minimal Installer available within days.

The installer also has more precise control over what is added to the %PATH%. You can add all the binaries it provides, or just add a single binary minghc-7.8.3 which when run at the command prompt temporarily adds all the installed binaries to the %PATH%. Using the switcher, you can easily switch GHC versions.