Showing posts with label hackage. Show all posts
Showing posts with label hackage. Show all posts

Monday, July 27, 2020

Which packages does Hoogle search?

Summary: Hoogle searches packages on Stackage.

Haskell (as of 27 July 2020) has 14,490 packages in the Hackage package repository. Hoogle (the Haskell API search engine) searches 2,463 packages. This post explains which packages are searched, why some packages are excluded, and thus, how you can ensure your package is searched.

The first filter is that Hoogle only searches packages on Stackage. Hoogle indexes any package which is either in the latest Stackage nightly or Stackage LTS, but always indexes the latest version that is on Hackage. If you want a Hoogle search that perfectly matches a given Stackage release, I recommend using the Stackage Hoogle search available from any snapshot page. There are two reasons for restricting to only packages on Stackage:

  • I want Hoogle results to be useful. The fact that the package currently builds with a recent GHC used by Stackage is a positive sign that the package is maintained and might actually work for a user who wants to use it. Most of the packages on Hackage probably don't build with the latest GHC, and thus aren't useful search results.
  • Indexing time and memory usage is proportional to the number of packages, and somewhat the size of those packages. By dropping over 10 thousand packages we can index more frequently and on more constrained virtual machines. With the recent release of Hoogle 5.0.18 the technical limitations on size were removed to enable indexing all of Hackage - but there is still no intention to do so.

There are 2,426 packages in Stackage Nightly, and 2,508 in Stackage LTS, with most overlapping. There are 2,580 distinct packages between these two sources, the Haskell Platform and a few custom packages Hoogle knows about (e.g. GHC).

Of the 2,580 eligible packages, 77 are executables only, so don't have any libraries to search, leaving 2,503 packages.

Of the remaining packages 2,503, 40 are missing documentation on Hackage, taking us down to 2,463. As for why a package might not have documentation:

  • Some are missing documentation because they are very GHC internal and are mentioned but not on Hackage, e.g. ghc-heap.
  • Some are Windows only and won't generate documentation on the Linux Hackage servers, e.g. Win32-notify.
  • Some have dependencies not installed on the Hackage servers, e.g. rocksdb-query.
  • Some have documentation that appears to have been generated without generating a corresponding Hoogle data file, e.g. array-memoize.
  • Some are just missing docs entirely on Hackage for no good reason I can see, e.g. bytestring-builder.

The Hoogle database is generated and deployed once per day, automatically. Occasionally a test failure or dependency outage will cause generation to fail, but I get alerted, and usually it doesn't get stale by more than a few days. If you add your package to Stackage and it doesn't show up on Hoogle within a few days, raise an issue.

Friday, November 23, 2018

Counting the Cost of Colons in Haskell

Summary: Haskell uses :: as the type operator. That was a mistake that costs us over 1 million characters of source code.

Haskell uses :: for type annotations, e.g. (1 :: Int). Most other FP languages with types use :, including Scala, OCaml, Agda, Idris and Elm. Haskell uses : for list cons, so you can write:

myList = 1:2:[] :: [Int]

Whereas in other languages you write:

myList = 1::2::[] : [Int]

Moreover, the reason Haskell preferred :: was the belief that if the cons operator was :: then people would quite naturally insert spaces around it, giving:

myList = 1 :: 2 :: [] : [Int]

The final program is now noticeably longer than the original. Back when people first invented Haskell, I imagine they were mainly list manipulation operations, whereas now plenty of libraries work mainly at the type level. That raises the question - would Hackage be shorter or longer if we used : for types?

Method

I downloaded the latest version of every Hackage package. For each .hs file, I excluded comments and strings, then counted the number of instances of : and ::, also noting whether there were spaces around the :. The code I used can be found here.

Results

  • Instances of :: = 1,991,631
  • Instances of : = 265,880 (of which 79,109 were surrounded by two spaces, and 26,931 had one space)

Discussion

Assuming we didn't add/remove any spaces, switching the symbols would have saved 1,725,751 characters. If we assume that everyone would have written :: with spaces, that saving drops to 137,9140 characters. These numbers are fairly small, representing about 0.17% of the total 993,793,866 characters on Hackage.

Conclusion

The Haskell creators were wrong in their assumptions - type should have been :.

Update: the first version of this post had numbers that were too low due to a few bugs, now fixed.

Thursday, November 22, 2018

Downloading all of Hackage

Summary: I wanted to download the latest version of every package in Hackage. Here's a script and explanation of how to do it.

Imagine you want the latest version of every package on Hackage. I found two tools that mirror every version of every package:

  • Using hackage-mirror you can do hackage-mirror --from="http://hackage.haskell.org" --to="C:/hackage-mirror". But this project is long deprecated and doesn't actually work anymore.
  • Using hackage-mirror-tool you might be able to do it, but it requires a new Cabal, isn't on Hackage, doesn't seem to work on Windows and doesn't say whether it downloads to disk or not.

Given it's a fairly simple problem, after investigating these options for an hour, I decided to cut my losses and write a script myself. Writing the script took a lot less than an hour, and I even wrote this blog post while the download was running. The complete script is at the bottom of this post, but I thought it might be instructive to explain how I went about developing it.

Step 0: Set up my working environment

I created a file named Download.hs where I was writing the source code, used ghcid Download.hs in a VS Code terminal to get fast error feedback using Ghcid, and opened another terminal to execute runhaskell Download.hs for testing.

Step 1: Find where a download link is

You can download a package from Hackage at http://hackage.haskell.org/package/shake/shake-0.17.tar.gz. You can also use https, but for my purposes and bulk downloading I figured http was fine. I hunted around to find a link which didn't contain the version number (as then I wouldn't have to compute the version number), but failed.

Step 2: Find a list of package versions

Looking at the cabal tool I found the cabal list --simple command, which prints a big list of packages in the form:

foo 1.0
foo 2.1
bar 1.0

For each package on Hackage I get all versions sequentially, with the highest version number last. I can execute this command using systemOutput_ "cabal list --simple" (where systemOutput_ comes from the extra library).

Step 3: Generate the list of URLs

Now I have the data as a big string I want to convert it into a list of URL's. The full pipeline is:

map (toUrl . last) . groupOn fst .  map word1 . lines

Reading from right to left, I split the output into a list of lines with lines, then split each line on its first space (using word1 from the extra library). I then use groupOn fst so that I get consecutive runs of each package (no points for guessing where groupOn comes from). For each list of versions for a package I take the last (since I know that's the highest one) and transform it into the URL using:

let toUrl (name, ver) = "http://hackage.haskell.org/package/" ++ name ++ "/" ++ name ++ "-" ++ ver ++ ".tar.gz"

Step 4: Download the URLs

I could make multiple calls to wget, but that's very slow, so instead I write them to a file and make a single call:

writeFile "_urls.txt" $ unlines urls
system_ "wget --input-file=_urls.txt"

I use the name _urls.txt so I can spot that special file in amongst all the .tar.gz files this command produces.

Step 5: Putting it all together

The complete script is:

import Data.List.Extra
import System.Process.Extra

main :: IO ()
main = do
    let toUrl (name, ver) = "http://hackage.haskell.org/package/" ++ name ++ "/" ++ name ++ "-" ++ ver ++ ".tar.gz"
    urls <- map (toUrl . last) . groupOn fst .  map word1 . lines <$> systemOutput_ "cabal list --simple"
    writeFile "_urls.txt" $ unlines urls
    system_ "wget --input-file=_urls.txt"

After waiting 46 minutes I had 13,258 packages weighing in at 861Mb.

Update: In the comments Janek Stolarek suggested the simpler alternative of cabal list --simple | cut -d' ' -f1 | sort | uniq | xargs cabal get (I had missed the existence of cabal get). Niklas Hambüchen also shares a script https://github.com/nh2/hackage-download which can download even faster.



Saturday, March 24, 2018

Adding Package Lower-bounds

Summary: I hacked Cabal so I could spot where I was missing package lower bounds. The approach has lots of limitations, but I did find one missing lower bound in HLint.

Cabal lets you constrain your dependencies with both upper bounds and lower bounds (for when you are using a feature not available in older versions). While there has been plenty of debate and focus on upper bounds, it feels like lower bounds have been somewhat neglected. As an experiment I decided to modify cabal to prefer older versions of packages, then tried to compile a few of my packages. The approach seems sound, but would require a fair bit of work to be generally usable.

Hacking Cabal

By default Cabal prefers to choose packages that are already installed and have the highest bound possible. The code to control that is in cabal-install/Distribution/Solver/Modular/Preference.hs and reads:

-- Prefer packages with higher version numbers over packages with
-- lower version numbers.
latest :: [Ver] -> POption -> Weight
latest sortedVersions opt =
    let l = length sortedVersions
        index = fromMaybe l $ L.findIndex (<= version opt) sortedVersions
    in  fromIntegral index / fromIntegral l

To change that to prefer lower versions I simply replaced the final expression with fromIntegral (l - index) / fromIntegral l. I also removed the section about giving preferences to currently installed versions, since I wanted the lowest bound to be chosen regardless.

So I didn't mess up my standard copy of Cabal I changed the .cabal file to call the executable kabal.

Testing Kabal on Extra

To test the approach, I used my extra library, and ran kabal new-build all. I used new-build to avoid poluting my global package database with these franken-packages, and all to build all targets. That failed with:

Failed to build Win32-2.2.2.0.
In file included from dist\build\Graphics\Win32\Window_hsc_make.c:1:0:
Window.hsc: In function 'main':
Window.hsc:189:16: error: 'GWL_USERDATA' undeclared (first use in this function)
C:\ghc\ghc-8.2.2/lib/template-hsc.h:38:10: note: in definition of macro 'hsc_const'
     if ((x) < 0)                                      \
          ^

So it seems that Win32-2.2.2.0 claims to work with GHC 8.2, but probably doesn't (unfortunately it's not on the Linux-based Hackage Matrix). We can work around that problem by constraining Win32 to the version that is already installed with --constraint=Win32==2.5.4.1. With that, we can successfully build extra. For bonus points we can also use --enable-test, checking the test suite has correct lower bounds, which also works.

Testing Kabal on Shake

For Shake we start with:

kabal new-build all --constraint=Win32==2.5.4.1 --enable-test

That worked perfectly - either I had sufficient lower bounds, or this approach doesn't do what I hoped...

Testing Kabal on HLint

Trying HLint with our standard recipe we get:

Failed to build ansi-terminal-0.6.2.
[3 of 6] Compiling System.Console.ANSI.Windows.Foreign ( System\Console\ANSI\Windows\Foreign.hs, dist\build\System\Console\ANSI\Windows\Foreign.o )
System\Console\ANSI\Windows\Foreign.hs:90:20: error:
    Ambiguous occurrence `SHORT'
    It could refer to either `System.Win32.Types.SHORT',
                             imported from `System.Win32.Types' at System\Console\ANSI\Windows\Foreign.hs:41:1-25
                          or `System.Console.ANSI.Windows.Foreign.SHORT',
                             defined at System\Console\ANSI\Windows\Foreign.hs:59:1

So it seems ansi-terminal-0.6.2 and Win32-2.5.4.1 don't cooperate. Let's fix that by restricting ansi-terminal==0.7 with another constraint. Now we get:

Preprocessing library for cmdargs-0.10.2..
Building library for cmdargs-0.10.2..
[ 1 of 25] Compiling Data.Generics.Any ( Data\Generics\Any.hs, dist\build\Data\Generics\Any.o )

Data\Generics\Any.hs:65:17: error:
    Variable not in scope: tyConString :: TyCon -> String
   |
65 | typeShellFull = tyConString . typeRepTyCon . typeOf
   |                 ^^^^^^^^^^^

Oh dear, now it's the fault of cmdargs, which is one of my packages! Checking the Hackage Matrix for cmdargs we see:

Namely that 0.10.2 to 0.10.9 don't compile with GHC 8.2. We solve that by going to the maintainers corner and editing the .cabal file of released versions to produce a revision with better bounds - replacing base == 4.* with base >= 4 && < 4.10. Finding the translation from GHC version 8.2 to base version 4.10 involved consulting the magic page of mappings.

After waiting 15 minutes for the package tarballs to update, then doing cabal update, I got to a real error in HLint:

src\Hint\Duplicate.hs:44:37: error:
    * Could not deduce (Default (String, String, SrcSpan))
        arising from a use of `duplicateOrdered'

Looking at the data-default library I see that the Default instance for triples was only introduced in version 0.3. Adding the bounds data-default >= 0.3 to the hlint.cabal dependencies fixes the issue, allowing HLint to compile cleanly.

Next, looking at the commit log, I noticed that I'd recently added a lower bound on the yaml package. I wondered if I removed that bound then it could be detected?

Resolving dependencies...
Error:
    Dependency on unbuildable library from yaml
    In the stanza 'library'
    In the inplace package 'hlint-2.1'

Alas not - Cabal says the library is unbuildable - I don't really know what that means.

Testing Kabal on Ghcid

Trying Ghcid with our standard recipe and accumulated constraints we get:

Preprocessing library for Win32-notify-0.2..
Building library for Win32-notify-0.2..
[1 of 2] Compiling System.Win32.FileNotify ( dist\build\System\Win32\FileNotify.hs, dist\build\System\Win32\FileNotify.o )

src\System\Win32\FileNotify.hsc:29:9: error:
    Ambiguous occurrence `fILE_LIST_DIRECTORY'

So it seems Win32-notify-0.2 and Win32-2.5.4.1 don't cooperate. With that discovery I had used up all the time I was willing to spend and stopped the experiment.

Conclusions

By modifying Cabal to select for older packages I was able to find and fix a lower bound. However, because all my dependencies aren't lower-bound safe, it became a somewhat manual process. To be practically useful the prinple of correct lower-bounds needs adopting widely. Some notes:

  • The Hackage Matrix provides a large amount of actionable intelligence - a great experience. However, fixing the issues it discovers (actually adding the bounds) is frustratingly manual, requiring lots of clicks and edits in a textbox.
  • Using cabal new-build caused each directory to gain a .ghc.environment.x86_64-mingw32-8.2.2 file, which silently recongfigured ghc and ghci in those directories so they stopped working as I expected. Not a pleasant experience!
  • I ran my tests on Windows, and most of the dependencies with incorrect bounds were Windows-specific issues. Maybe Linux would have had less lower-bound issues?
  • I used a pretty recent GHC, which excludes a lot of older versions of packages because they don't work on newer GHC versions - picking the oldest-supported GHC would probably have found more bounds.
  • Are lower bounds actually useful? If you ignore which packages are globally installed (which both stack and cabal new-build effectively do) then the only reason to be constrained to an older version is by upper bounds - in which case solving excessive upper-bounds is likely to give more actual benefit.
  • I'm not the first person to think of constraining cabal to use older versions - e.g. Cabal bug 2876 from 2015.
  • The Trustee tool can infer minimum bounds, but it's Linux only so doesn't work for me. It is probably better for people who want to do their own bound checking.
  • Compiling kabal required a bit of trial and error, I eventually settled on compiling each dependent Cabal package in turn into the global package database, which wasn't ideal, but did work.