GHC Performance / Replacement for R?
Dominic Steinitz
dominic at steinitz.org
Wed Aug 31 13:31:04 UTC 2016
Hi Iavor,
Thank you very much for this. It's nice to know that we have the ability
in Haskell to be as frugal (or profligate) with memory as R when working
with data frames. I should say this number of fields is quite low in the
data science world. Data sets with 500 columns are not uncommon and I
did have one with 10,000 columns!
I know other folks have worked on producing something like data frames
e.g. https://github.com/acowley/Frames and
http://stla.github.io/stlapblog/posts/HaskellFrames.html for example but
I wanted to remain in the world of relatively simple types and I haven't
looked at its performance in terms of memory. On the plus side it did
manage to read in the 10,000 column data set although ghc took about 5
minutes to do the typechecking (I should say within ghci).
Just to mention that R is not the only language that has nice facilities
for data exploration; python has a package called pandas:
http://pandas.pydata.org.
I feel we still have a way to go to make Haskell provide as easy an
environment for data exploration as R or Python but I shall continue on
my crusade.
Many thanks once again, Dominic.
On 30/08/2016 22:05, Iavor Diatchki wrote:
> Hello,
>
> when you parse the CSV fully, you end up creating a lot of small
> bytestring objects, and each of these adds some overhead. The
> vectors themselves add up some additional overhead. All of this adds
> up when you have as many fields as you do. An alternative would be
> to use a different representation for the data, which recomputes
> things when needed. While this might be a bit slower in some cases,
> it could have significant saving in terms of memory use. I wrote up
> a small example to illustrate what I have in mind, which should be
> attached to this e-mail.
>
> Basically, instead of parsing the CSV file fully, I just indexed where
> the lines are (ref. the "rows" field of "CSV"). This allows me to
> access each row quickly, and the when I need to get a specific field,
> I simply parse the bytes of the row.
> One could play all kinds of games like that, and I imagine R does
> something similar, although I have never looked at how it works. To
> test the approach I generated ~200Mb of sample data (generator is also
> in the attached file), and I was able to filter it using ~240Mb, which
> is comparable to what you reported about R. One could probably
> package all this up in library that supports "R like" operations.
>
> These are the stats I get from -s:
>
> 4,137,632,432 bytes allocated in the heap
> 925,200 bytes copied during GC
> 200,104,224 bytes maximum residency (2 sample(s))
> 6,217,864 bytes maximum slop
> 246 MB total memory in use (1 MB lost due to fragmentation)
>
> Tot time (elapsed) Avg pause
> Max pause
> Gen 0 7564 colls, 0 par 0.024s 0.011s 0.0000s
> 0.0001s
> Gen 1 2 colls, 0 par 0.000s 0.001s 0.0003s
> 0.0006s
>
> INIT time 0.000s ( 0.000s elapsed)
> MUT time 0.364s ( 0.451s elapsed)
> GC time 0.024s ( 0.011s elapsed)
> EXIT time 0.000s ( 0.001s elapsed)
> Total time 0.388s ( 0.463s elapsed)
>
> %GC time 6.2% (2.5% elapsed)
>
> Alloc rate 11,367,122,065 bytes per MUT second
>
> Productivity 93.8% of total user, 78.6% of total elapsed
>
> -Iavor
>
>
>
>
> On Thu, Aug 25, 2016 at 3:31 AM, Simon Peyton Jones via
> Glasgow-haskell-users <glasgow-haskell-users at haskell.org
> <mailto:glasgow-haskell-users at haskell.org>> wrote:
>
> Sounds bad. But it'll need someone with bytestring expertise to
> debug. Maybe there's a GHC problem underlying; or maybe it's
> shortcoming of bytestring.
>
> Simon
>
> | -----Original Message-----
> | From: Glasgow-haskell-users [mailto:glasgow-haskell-users-
> <mailto:glasgow-haskell-users->
> | bounces at haskell.org <mailto:bounces at haskell.org>] On Behalf Of
> Dominic Steinitz
> | Sent: 25 August 2016 10:11
> | To: GHC users <glasgow-haskell-users at haskell.org
> <mailto:glasgow-haskell-users at haskell.org>>
> | Subject: GHC Performance / Replacement for R?
> |
> | I am trying to use Haskell as a replacement for R but running
> into two
> | problems which I describe below. Are there any plans to address the
> | performance issues I have encountered?
> |
> | 1. I seem to have to jump through a lot of hoops just to be
> able to
> | select the data I am interested in.
> |
> | {-# LANGUAGE ScopedTypeVariables #-}
> |
> | {-# OPTIONS_GHC -Wall #-}
> |
> | import Data.Csv hiding ( decodeByName )
> | import qualified Data.Vector as V
> |
> | import Data.ByteString ( ByteString )
> | import qualified Data.ByteString.Char8 as B
> |
> | import qualified Pipes.Prelude as P
> | import qualified Pipes.ByteString as Bytes import Pipes import
> | qualified Pipes.Csv as Csv import System.IO
> |
> | import qualified Control.Foldl as L
> |
> | main :: IO ()
> | main = withFile "examples/787338586_T_ONTIME.csv" ReadMode $ \h
> -> do
> | let csvs :: Producer (V.Vector ByteString) IO ()
> | csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat
> | uvectors :: Producer (V.Vector ByteString) IO ()
> | uvectors = csvs >-> P.map (V.foldr V.cons V.empty)
> | vec_vec <- L.impurely P.foldM L.vector uvectors
> | print $ (vec_vec :: V.Vector (V.Vector ByteString)) V.! 17
> | print $ V.length vec_vec
> | let rockspring = V.filter (\x -> x V.! 8 == B.pack "RKS") vec_vec
> | print $ V.length rockspring
> |
> | Here's the equivalent R:
> |
> | df <- read.csv("787338586_T_ONTIME.csv")
> | rockspring <- df[df$ORIGIN == "RKS",]
> |
> | 2. Now I think I could improve the above to make an
> environment that
> | is more similar to the one my colleagues are used to in R
> but more
> | problematical is the memory usage.
> |
> | * 112.5M file
> | * Just loading the source into ghci takes 142.7M
> | * > foo <- readFile "examples/787338586_T_ONTIME.csv" > length foo
> | takes me up to 4.75G. But we probably don't want to do this!
> | * Let's try again.
> | * > :set -XScopedTypeVariables
> | * > h <- openFile "examples/787338586_T_ONTIME.csv" ReadMode
> | * > let csvs :: Producer (V.Vector ByteString) IO () = Csv.decode
> | HasHeader (Bytes.fromHandle h) >-> P.concat
> | * > let uvectors :: Producer (V.Vector ByteString) IO () =
> csvs >->
> | P.map (V.map id) >-> P.map (V.foldr V.cons V.empty)
> | * > vec_vec :: V.Vector (V.Vector ByteString) <- L.impurely
> P.foldM
> | L.vector uvectors
> | * Now I am up at 3.17G. In R I am under 221.3M.
> | * > V.length rockspring takes a long time to return 155 and
> now I am
> | at 3.5G!!! In R > rockspring <- df[df$ORIGIN == "RKS",] seems
> | instantaneous and now uses only 379.5M.
> | * > length(rockspring) 37 > length(df$ORIGIN) 471949 i.e.
> there are
> | 37 columns and 471,949 rows.
> |
> | Running this as an executable gives
> |
> | ~/Dropbox/Private/labels $ ./examples/BugReport +RTS -s ["2014-01-
> | 01","EV","20366","N904EV","2512","10747","1074702","30747",
> | "BRO","Brownsville, TX","Texas","11298","1129803","30194",
> | "DFW","Dallas/Fort Worth, TX","Texas","0720","0718",
> |
> "-2.00","8.00","0726","0837","7.00","0855","0844","-11.00","0.00",
> | "","0.00","482.00","","","","","",""]
> | 471949
> | 155
> | 14,179,764,240 bytes allocated in the heap
> | 3,378,342,072 bytes copied during GC
> | 786,333,512 bytes maximum residency (13 sample(s))
> | 36,933,976 bytes maximum slop
> | 1434 MB total memory in use (0 MB lost due to
> | fragmentation)
> |
> | Tot time (elapsed) Avg pause
> | Max pause
> | Gen 0 26989 colls, 0 par 1.423s 1.483s 0.0001s
> | 0.0039s
> | Gen 1 13 colls, 0 par 1.005s 1.499s 0.1153s
> | 0.6730s
> |
> | INIT time 0.000s ( 0.003s elapsed)
> | MUT time 3.195s ( 3.193s elapsed)
> | GC time 2.428s ( 2.982s elapsed)
> | EXIT time 0.016s ( 0.138s elapsed)
> | Total time 5.642s ( 6.315s elapsed)
> |
> | %GC time 43.0% (47.2% elapsed)
> |
> | Alloc rate 4,437,740,019 bytes per MUT second
> |
> | Productivity 57.0% of total user, 50.9% of total elapsed
> |
> | _______________________________________________
> | Glasgow-haskell-users mailing list
> | Glasgow-haskell-users at haskell.org
> <mailto:Glasgow-haskell-users at haskell.org>
> |
> https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmail.h
> <https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmail.h>
> | askell.org
> <http://askell.org>%2fcgi-bin%2fmailman%2flistinfo%2fglasgow-haskell-
> | users&data=01%7c01%7csimonpj%40microsoft.com
> <http://40microsoft.com>%7c5017a5fe26cb4df9c41d08d
> |
> 3ccc7b5bd%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=2Ku1Fr5QttHRoj5
> | NSOJREZrt2Fsqhi63iJOpxmku68E%3d
> _______________________________________________
> Glasgow-haskell-users mailing list
> Glasgow-haskell-users at haskell.org
> <mailto:Glasgow-haskell-users at haskell.org>
> http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
> <http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/glasgow-haskell-users/attachments/20160831/88bbe628/attachment.html>
More information about the Glasgow-haskell-users
mailing list