Sunday, October 30, 2011

Calling Haskell from R

Summary: This post describes how to write a function in Haskell, and then call it from R.

R is a very popular language for statistics, particular with biologists (and computational paleobiologists). For writing high performance code, the R developers recommend the use of C or Fortran - not languages that are particularly easy for beginners. However, you can instead write a Haskell function that can be called directly from R. The basic idea is to create a C compatible library using Haskell (as described in the GHC users manual) and then call that library from R (as described in this document). As a simple example, let's write a function that adds up the square roots of a list of numbers.

Create an R-compatible Haskell library

In normal Haskell, we would define the function to add up the square roots of a list as:

sumRoots :: [Double] -> Double
sumRoots xs = sum (map sqrt xs)


However, to make a function that is compatible with R, we have to follow two rules:

  • Every argument must be a Ptr to a C compatible type, typically Int, Double or CString. (To be pedantic, we should probably use CInt or CDouble, but using GHC on Windows these types are equivalent - keeping things simpler.)
  • The result must be IO ()


Obeying these restrictions, we need to use the type:

sumRootsR :: Ptr Int -> Ptr Double -> Ptr Double -> IO ()
sumRootsR n xs result = ...


Instead of passing in the list xs, we now pass in:

  • n, the length of the list xs
  • xs, the elements of the list
  • result, a space to put the result


We can implement sumRootsR by using the functions available in the Foreign module:

sumRootsR :: Ptr Int -> Ptr Double -> Ptr Double -> IO ()
sumRootsR n xs result = do
n <- peek n
xs <- peekArray n xs
poke result $ sumRoots xs


This function first gets the value for n, then for each element in 0..n-1 gets the element out of the pointer array xs and puts it in a nice list. We then call the original sumRoots, and store the value in the space provided by result. As a general rule, you should put all the logic in one function (sumRoots), and the wrapping in another (sumRootsR). We can then export this function with the definition:

foreign export ccall sumRootsR :: Ptr Int -> Ptr Double -> Ptr Double -> IO ()


Putting everything together, we end up with the Haskell file:

-- SumRoots.hs
{-# LANGUAGE ForeignFunctionInterface #-}
module SumRoots where

import Foreign

foreign export ccall sumRootsR :: Ptr Int -> Ptr Double -> Ptr Double -> IO ()

sumRootsR :: Ptr Int -> Ptr Double -> Ptr Double -> IO ()
sumRootsR n xs result = do
n <- peek n
xs <- peekArray n xs
poke result $ sumRoots xs

sumRoots :: [Double] -> Double
sumRoots xs = sum (map sqrt xs)


We also need a C stub file. The one described in the GHC users guide works well:

// StartEnd.c
#include <Rts.h>

void HsStart()
{
int argc = 1;
char* argv[] = {"ghcDll", NULL}; // argv must end with NULL

// Initialize Haskell runtime
char** args = argv;
hs_init(&argc, &args);
}

void HsEnd()
{
hs_exit();
}


We can now compile our library with the commands:

ghc -c SumRoots.hs
ghc -c StartEnd.c
ghc -shared -o SumRoots.dll SumRoots.o StartEnd.o


This creates the library SumRoots.dll.

Calling Haskell from R

At the R command prompt, we can load the library with:

dyn.load("C:/SumRoots.dll") # use the full path to the SumRoots library
.C("HsStart")


We can now invoke our function:

input = c(9,3.5,5.58,64.1,12.54)
.C("sumRootsR", n=as.integer(length(input)), xs=as.double(input), result=as.double(0))$result


This prints out the answer 18.78046.

We can make this function easier to use on the R side by writing a wrapper, for example:

sumRoots <- function(input)
{
return(.C("sumRootsR", n=as.integer(length(input)), xs=as.double(input), result=as.double(0))$result)
}
Now we can write:
sumRoots(c(12,444.34))
And get back the answer 24.54348. With a small amount of glue code, it's easy to call Haskell libraries from R programs.

Update: See the comments below from Alex Davis for how to do such things on newer versions of Mac OS.

13 comments:

winterkoninkje said...

Sweet. Now we just need to write a nice tool for autogenerating all the glue, since FFI glue is usually the worst part for beginners. So what sorts of datatypes does R export other than CInt, CDouble, CString, and arrays a la (CInt,Ptr a) ? How does R's GC work in case we wanted to allocate arrays on the Haskell side?

Dominik said...

Hi,
thx for the post, the last time I tried that I didn't get it to work but I'll try again with your explanation.

R also calls an initialization routine after dyn.load() which you can have call HsStart
(see:
http://cran.r-project.org/doc/manuals/R-exts.html#dyn_002eload-and-dyn_002eunload). Then you could avoid explicitly calling HsStart from R. Doesnt make much difference though,



@winterkoninkje: .C() only exposes basic C datatypes, like double*, int*, char* and so on. To my knowledge you cant allocate objects on the C- or Haskell-side that way, you usually allocate all output objects in R code and pass them along to .C(). There's two other functions, .Call() and .External() that pass you real R objects and where you can also allocate R objects in C/Haskell code, but I think you have to use R API functions to create them and have R's gc handle these objects. I found that usually .C() is good enough and much easier to use.

Jesse said...

"...not languages that are particularly easy for beginners"

Really, Haskell falls into that category as well. The suggestion (for C/Fortran) probably comes from the fact that there is no wrapper code indirection necessary in that case.

The ability to use Haskell instead is still a pretty nice point, however.

scott said...

This seems interesting, but the Haskell you have to write to be compatible with R doesn't look much like idiomatic Haskell.

What's the point of using Haskell when you have to use IO to do something that shouldn't require it?

Audun said...

Instead of the manual loop over the array:
xs <- forM [0..n-1] $ \i ->
peekElemOff xs i

You could do the simpler:
xs <- peekArray n xs

Where peekArray is from Foreign.Marshal.Array.

Neil Mitchell said...

Dominik: Thanks for the tip on HsStart - good idea.

Audun: Thanks for the tip, I've updated the post - that's a much simpler solution.

scott/Jesse: The wrapping code is not idiomatic Haskell, but that's a few lines. The idea is you write a small wrapper Haskell/R wrapper (sumRootsR), and then can write lots of idiomatic Haskell that it calls.

winterkoninkje: I don't intend to do anything else on this work, but if you do anything my wife may be interested!

Dominik said...

BTW to compile on FreeBSD I had to use (and probably something similar on Linux):

ghc -c -fPIC StartEnd.c
ghc -c -fPIC SumRoots.hs
ghc -shared -dynamic -fPIC -o SumRoots.so SumRoots.o SumRoots_stub.o StartEnd.o -lHSrts-ghc7.0.3 -optl-Wl,-rpath,/usr/local/lib/ghc-7.0.3/

I found that on http://www.well-
typed.com/blog/30 . I wonder though if you still have to explicitly link against the right RTS.
For starters making a wrapper for .C would be cool; maybe one can steal some ideas from the inline and Rcpp R packages. Think I'll try that some time.

Neil Mitchell said...

Dominik: Useful information about the FreeBSD command line. I think if you were on GHC 7.2 you wouldn't need to include "SumRoots_stub.o"

James said...

Do you what would be involved in going the other way around (ie calling R code from Haskell code)?

Neil Mitchell said...

No idea I'm afraid, but the GHC manual does say how to call Haskell from Excel, and the same pattern might work, if R can bind to C dlls.

Alex Davis said...

For anybody trying this on Mac OS X Mavericks, here's what I had to do to get it working.

1. More flags to the compiler, just like Dominik found. I wish I had noticed his comment before figuring this out on my own. But, you need these flags:
-fPIC: generate "position independent code". Whatever that is, it's necessary for shared libraries outside of Windows, according to this page of the GHC manual.
-dynamic: Access Haskell libraries dynamically at runtime, instead of compiling them into the final result. This is because the libraries were not compiled with the -fPIC flag, according to this page of the GHC manual.
-lHSrts-ghc7.8.3: I imagine you should substitute in your own GHC version here. This links your final product to the Haskell run-time system. I find no mention in the user manual that this is necessary, but on my computer it's necessary, or you get "Symbol not found" errors.

That was enough for my friend Shea, but even after successfully compiling I was still getting errors. On my computer I get segmentation faults from SumRoots.hs as written. The problem, it turns out, is the use of Int instead of CInt: "n" is read in wrong, and the computer looks for too long of "xs" and gets a segfault. On my computer I need to use CInt. Here's the complete code, using CInt as well as CDouble:

-- SumRoots.hs
{-# LANGUAGE ForeignFunctionInterface #-}
module SumRoots where

import Foreign
import Foreign.C.Types

foreign export ccall sumRootsR :: Ptr CInt -> Ptr CDouble -> Ptr CDouble -> IO ()

sumRootsR :: Ptr CInt -> Ptr CDouble -> Ptr CDouble -> IO ()
sumRootsR n xs result = do
cN <- peek n
let haskellN = fromIntegral cN :: Int
cXs <- peekArray haskellN xs
let haskellXs = map realToFrac cXs :: [Double]
haskellResult = sumRoots haskellXs
cResult = realToFrac haskellResult :: CDouble
poke result cResult

sumRoots :: [Double] -> Double
sumRoots xs = sum (map sqrt xs)

This adds a lot of complexity - it was lucky for this exposition that CInt and Int are interchangable on Windows. But on my computer this is what it took.

Thanks to Shea Levy for walking me through this!

Alex Davis said...

PS Here's the actual compiler command I ended up using:

ghc -shared -fPIC -dynamic -lHSrts-ghc7.8.3 SumRoots.hs StartEnd.c -o SumRoots.dylib

Note "dylib" instead of "dll" - it's just the Mac equivalent of "dylib", according to this wiki page.

Neil Mitchell said...

Alex: Thanks for sharing, I've updated the post to point people at your comments.