GHC Performance / Replacement for R?

Wed Aug 31 13:31:04 UTC 2016

Hi Iavor,

Thank you very much for this. It's nice to know that we have the ability 
in Haskell to be as frugal (or profligate) with memory as R when working 
with data frames. I should say this number of fields is quite low in the 
data science world. Data sets with 500 columns are not uncommon and I 
did have one with 10,000 columns!

I know other folks have worked on producing something like data frames 
e.g. https://github.com/acowley/Frames and 
http://stla.github.io/stlapblog/posts/HaskellFrames.html for example but 
I wanted to remain in the world of relatively simple types and I haven't 
looked at its performance in terms of memory. On the plus side it did 
manage to read in the 10,000 column data set although ghc took about 5 
minutes to do the typechecking (I should say within ghci).

Just to mention that R is not the only language that has nice facilities 
for data exploration; python has a package called pandas: 
http://pandas.pydata.org.

I feel we still have a way to go to make Haskell provide as easy an 
environment for data exploration as R or Python but I shall continue on 
my crusade.

Many thanks once again, Dominic.

On 30/08/2016 22:05, Iavor Diatchki wrote:
> Hello,
>
> when you parse the CSV fully, you end up creating a lot of small 
> bytestring objects, and each of these adds some overhead.   The 
> vectors themselves add up some additional overhead.  All of this adds 
> up when you have as many fields as you do.   An alternative would be 
> to use a different representation for the data, which recomputes 
> things when needed.   While this might be a bit slower in some cases, 
> it could have significant saving in terms of memory use.   I wrote up 
> a small example to illustrate what I have in mind, which should be 
> attached to this e-mail.
>
> Basically, instead of parsing the CSV file fully, I just indexed where 
> the lines are (ref. the "rows" field of "CSV"). This allows me to 
> access each row quickly, and the when I need to get a specific field, 
> I simply parse the bytes of the row.
> One could play all kinds of games like that, and I imagine R does 
> something similar, although I have never looked at how it works.   To 
> test the approach I generated ~200Mb of sample data (generator is also 
> in the attached file), and I was able to filter it using ~240Mb, which 
> is comparable to what you reported about R.  One could probably 
> package all this up in library that supports "R like" operations.
>
> These are the stats I get from -s:
>
>    4,137,632,432 bytes allocated in the heap
>          925,200 bytes copied during GC
>      200,104,224 bytes maximum residency (2 sample(s))
>        6,217,864 bytes maximum slop
>              246 MB total memory in use (1 MB lost due to fragmentation)
>
>                                      Tot time (elapsed)  Avg pause 
>  Max pause
>   Gen  0      7564 colls,     0 par    0.024s   0.011s   0.0000s   
>  0.0001s
>   Gen  1         2 colls,     0 par    0.000s   0.001s   0.0003s   
>  0.0006s
>
>   INIT    time    0.000s  (  0.000s elapsed)
>   MUT     time    0.364s  (  0.451s elapsed)
>   GC      time    0.024s  (  0.011s elapsed)
>   EXIT    time    0.000s  (  0.001s elapsed)
>   Total   time    0.388s  (  0.463s elapsed)
>
>   %GC     time       6.2%  (2.5% elapsed)
>
>   Alloc rate    11,367,122,065 bytes per MUT second
>
>   Productivity  93.8% of total user, 78.6% of total elapsed
>
> -Iavor
>
>
>
>
> On Thu, Aug 25, 2016 at 3:31 AM, Simon Peyton Jones via 
> Glasgow-haskell-users <glasgow-haskell-users at haskell.org 
> <mailto:glasgow-haskell-users at haskell.org>> wrote:
>
>     Sounds bad.  But it'll need someone with bytestring expertise to
>     debug.  Maybe there's a GHC problem underlying; or maybe it's
>     shortcoming of bytestring.
>
>     Simon
>
>     |  -----Original Message-----
>     |  From: Glasgow-haskell-users [mailto:glasgow-haskell-users-
>     <mailto:glasgow-haskell-users->
>     | bounces at haskell.org <mailto:bounces at haskell.org>] On Behalf Of
>     Dominic Steinitz
>     |  Sent: 25 August 2016 10:11
>     |  To: GHC users <glasgow-haskell-users at haskell.org
>     <mailto:glasgow-haskell-users at haskell.org>>
>     |  Subject: GHC Performance / Replacement for R?
>     |
>     |  I am trying to use Haskell as a replacement for R but running
>     into two
>     |  problems which I describe below. Are there any plans to address the
>     |  performance issues I have encountered?
>     |
>     |   1. I seem to have to jump through a lot of hoops just to be
>     able to
>     |      select the data I am interested in.
>     |
>     |  {-# LANGUAGE ScopedTypeVariables #-}
>     |
>     |  {-# OPTIONS_GHC -Wall #-}
>     |
>     |  import Data.Csv hiding ( decodeByName )
>     |  import qualified Data.Vector as V
>     |
>     |  import Data.ByteString ( ByteString )
>     |  import qualified Data.ByteString.Char8 as B
>     |
>     |  import qualified Pipes.Prelude as P
>     |  import qualified Pipes.ByteString as Bytes import Pipes import
>     |  qualified Pipes.Csv as Csv import System.IO
>     |
>     |  import qualified Control.Foldl as L
>     |
>     |  main :: IO ()
>     |  main = withFile "examples/787338586_T_ONTIME.csv" ReadMode $ \h
>     -> do
>     |    let csvs :: Producer (V.Vector ByteString) IO ()
>     |        csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat
>     |        uvectors :: Producer (V.Vector ByteString) IO ()
>     |        uvectors = csvs  >-> P.map (V.foldr V.cons V.empty)
>     |    vec_vec <- L.impurely P.foldM  L.vector uvectors
>     |    print $ (vec_vec :: V.Vector (V.Vector ByteString)) V.! 17
>     |    print $ V.length vec_vec
>     |    let rockspring = V.filter (\x -> x V.! 8 == B.pack "RKS") vec_vec
>     |    print $ V.length rockspring
>     |
>     |  Here's the equivalent R:
>     |
>     |  df <- read.csv("787338586_T_ONTIME.csv")
>     |  rockspring <- df[df$ORIGIN == "RKS",]
>     |
>     |   2. Now I think I could improve the above to make an
>     environment that
>     |      is more similar to the one my colleagues are used to in R
>     but more
>     |      problematical is the memory usage.
>     |
>     |   * 112.5M file
>     |   * Just loading the source into ghci takes 142.7M
>     |   * > foo <- readFile "examples/787338586_T_ONTIME.csv" > length foo
>     |     takes me up to 4.75G. But we probably don't want to do this!
>     |   * Let's try again.
>     |   * > :set -XScopedTypeVariables
>     |   * > h <- openFile "examples/787338586_T_ONTIME.csv" ReadMode
>     |   * > let csvs :: Producer (V.Vector ByteString) IO () = Csv.decode
>     |  HasHeader (Bytes.fromHandle h) >-> P.concat
>     |   * > let uvectors :: Producer (V.Vector ByteString) IO () =
>     csvs  >->
>     |  P.map (V.map id) >-> P.map (V.foldr V.cons V.empty)
>     |   * > vec_vec :: V.Vector (V.Vector ByteString) <- L.impurely
>     P.foldM
>     |  L.vector uvectors
>     |   * Now I am up at 3.17G. In R I am under 221.3M.
>     |   * > V.length rockspring takes a long time to return 155 and
>     now I am
>     |     at 3.5G!!! In R > rockspring <- df[df$ORIGIN == "RKS",] seems
>     |     instantaneous and now uses only 379.5M.
>     |   * > length(rockspring) 37 > length(df$ORIGIN) 471949 i.e.
>     there are
>     |     37 columns and 471,949 rows.
>     |
>     |  Running this as an executable gives
>     |
>     |  ~/Dropbox/Private/labels $ ./examples/BugReport +RTS -s ["2014-01-
>     |  01","EV","20366","N904EV","2512","10747","1074702","30747",
>     |   "BRO","Brownsville, TX","Texas","11298","1129803","30194",
>     |    "DFW","Dallas/Fort Worth, TX","Texas","0720","0718",
>     |   
>     "-2.00","8.00","0726","0837","7.00","0855","0844","-11.00","0.00",
>     |    "","0.00","482.00","","","","","",""]
>     |  471949
>     |  155
>     |    14,179,764,240 bytes allocated in the heap
>     |     3,378,342,072 bytes copied during GC
>     |       786,333,512 bytes maximum residency (13 sample(s))
>     |        36,933,976 bytes maximum slop
>     |              1434 MB total memory in use (0 MB lost due to
>     |  fragmentation)
>     |
>     |                                       Tot time (elapsed)  Avg pause
>     |  Max pause
>     |    Gen  0     26989 colls,     0 par    1.423s  1.483s     0.0001s
>     |  0.0039s
>     |    Gen  1        13 colls,     0 par    1.005s  1.499s     0.1153s
>     |  0.6730s
>     |
>     |    INIT    time    0.000s  (  0.003s elapsed)
>     |    MUT     time    3.195s  (  3.193s elapsed)
>     |    GC      time    2.428s  (  2.982s elapsed)
>     |    EXIT    time    0.016s  (  0.138s elapsed)
>     |    Total   time    5.642s  (  6.315s elapsed)
>     |
>     |    %GC     time      43.0%  (47.2% elapsed)
>     |
>     |    Alloc rate    4,437,740,019 bytes per MUT second
>     |
>     |    Productivity  57.0% of total user, 50.9% of total elapsed
>     |
>     |  _______________________________________________
>     |  Glasgow-haskell-users mailing list
>     | Glasgow-haskell-users at haskell.org
>     <mailto:Glasgow-haskell-users at haskell.org>
>     |
>     https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmail.h
>     <https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmail.h>
>     | askell.org
>     <http://askell.org>%2fcgi-bin%2fmailman%2flistinfo%2fglasgow-haskell-
>     |  users&data=01%7c01%7csimonpj%40microsoft.com
>     <http://40microsoft.com>%7c5017a5fe26cb4df9c41d08d
>     | 
>     3ccc7b5bd%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=2Ku1Fr5QttHRoj5
>     |  NSOJREZrt2Fsqhi63iJOpxmku68E%3d
>     _______________________________________________
>     Glasgow-haskell-users mailing list
>     Glasgow-haskell-users at haskell.org
>     <mailto:Glasgow-haskell-users at haskell.org>
>     http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
>     <http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/glasgow-haskell-users/attachments/20160831/88bbe628/attachment.html>