GHC Performance / Replacement for R?
Iavor Diatchki
iavor.diatchki at gmail.com
Tue Aug 30 21:05:07 UTC 2016
Hello,
when you parse the CSV fully, you end up creating a lot of small bytestring
objects, and each of these adds some overhead. The vectors themselves add
up some additional overhead. All of this adds up when you have as many
fields as you do. An alternative would be to use a different
representation for the data, which recomputes things when needed. While
this might be a bit slower in some cases, it could have significant saving
in terms of memory use. I wrote up a small example to illustrate what I
have in mind, which should be attached to this e-mail.
Basically, instead of parsing the CSV file fully, I just indexed where the
lines are (ref. the "rows" field of "CSV"). This allows me to access each
row quickly, and the when I need to get a specific field, I simply parse
the bytes of the row.
One could play all kinds of games like that, and I imagine R does something
similar, although I have never looked at how it works. To test the
approach I generated ~200Mb of sample data (generator is also in the
attached file), and I was able to filter it using ~240Mb, which is
comparable to what you reported about R. One could probably package all
this up in library that supports "R like" operations.
These are the stats I get from -s:
4,137,632,432 bytes allocated in the heap
925,200 bytes copied during GC
200,104,224 bytes maximum residency (2 sample(s))
6,217,864 bytes maximum slop
246 MB total memory in use (1 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 7564 colls, 0 par 0.024s 0.011s 0.0000s
0.0001s
Gen 1 2 colls, 0 par 0.000s 0.001s 0.0003s
0.0006s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.364s ( 0.451s elapsed)
GC time 0.024s ( 0.011s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 0.388s ( 0.463s elapsed)
%GC time 6.2% (2.5% elapsed)
Alloc rate 11,367,122,065 bytes per MUT second
Productivity 93.8% of total user, 78.6% of total elapsed
-Iavor
On Thu, Aug 25, 2016 at 3:31 AM, Simon Peyton Jones via
Glasgow-haskell-users <glasgow-haskell-users at haskell.org> wrote:
> Sounds bad. But it'll need someone with bytestring expertise to debug.
> Maybe there's a GHC problem underlying; or maybe it's shortcoming of
> bytestring.
>
> Simon
>
> | -----Original Message-----
> | From: Glasgow-haskell-users [mailto:glasgow-haskell-users-
> | bounces at haskell.org] On Behalf Of Dominic Steinitz
> | Sent: 25 August 2016 10:11
> | To: GHC users <glasgow-haskell-users at haskell.org>
> | Subject: GHC Performance / Replacement for R?
> |
> | I am trying to use Haskell as a replacement for R but running into two
> | problems which I describe below. Are there any plans to address the
> | performance issues I have encountered?
> |
> | 1. I seem to have to jump through a lot of hoops just to be able to
> | select the data I am interested in.
> |
> | {-# LANGUAGE ScopedTypeVariables #-}
> |
> | {-# OPTIONS_GHC -Wall #-}
> |
> | import Data.Csv hiding ( decodeByName )
> | import qualified Data.Vector as V
> |
> | import Data.ByteString ( ByteString )
> | import qualified Data.ByteString.Char8 as B
> |
> | import qualified Pipes.Prelude as P
> | import qualified Pipes.ByteString as Bytes import Pipes import
> | qualified Pipes.Csv as Csv import System.IO
> |
> | import qualified Control.Foldl as L
> |
> | main :: IO ()
> | main = withFile "examples/787338586_T_ONTIME.csv" ReadMode $ \h -> do
> | let csvs :: Producer (V.Vector ByteString) IO ()
> | csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat
> | uvectors :: Producer (V.Vector ByteString) IO ()
> | uvectors = csvs >-> P.map (V.foldr V.cons V.empty)
> | vec_vec <- L.impurely P.foldM L.vector uvectors
> | print $ (vec_vec :: V.Vector (V.Vector ByteString)) V.! 17
> | print $ V.length vec_vec
> | let rockspring = V.filter (\x -> x V.! 8 == B.pack "RKS") vec_vec
> | print $ V.length rockspring
> |
> | Here's the equivalent R:
> |
> | df <- read.csv("787338586_T_ONTIME.csv")
> | rockspring <- df[df$ORIGIN == "RKS",]
> |
> | 2. Now I think I could improve the above to make an environment that
> | is more similar to the one my colleagues are used to in R but more
> | problematical is the memory usage.
> |
> | * 112.5M file
> | * Just loading the source into ghci takes 142.7M
> | * > foo <- readFile "examples/787338586_T_ONTIME.csv" > length foo
> | takes me up to 4.75G. But we probably don't want to do this!
> | * Let's try again.
> | * > :set -XScopedTypeVariables
> | * > h <- openFile "examples/787338586_T_ONTIME.csv" ReadMode
> | * > let csvs :: Producer (V.Vector ByteString) IO () = Csv.decode
> | HasHeader (Bytes.fromHandle h) >-> P.concat
> | * > let uvectors :: Producer (V.Vector ByteString) IO () = csvs >->
> | P.map (V.map id) >-> P.map (V.foldr V.cons V.empty)
> | * > vec_vec :: V.Vector (V.Vector ByteString) <- L.impurely P.foldM
> | L.vector uvectors
> | * Now I am up at 3.17G. In R I am under 221.3M.
> | * > V.length rockspring takes a long time to return 155 and now I am
> | at 3.5G!!! In R > rockspring <- df[df$ORIGIN == "RKS",] seems
> | instantaneous and now uses only 379.5M.
> | * > length(rockspring) 37 > length(df$ORIGIN) 471949 i.e. there are
> | 37 columns and 471,949 rows.
> |
> | Running this as an executable gives
> |
> | ~/Dropbox/Private/labels $ ./examples/BugReport +RTS -s ["2014-01-
> | 01","EV","20366","N904EV","2512","10747","1074702","30747",
> | "BRO","Brownsville, TX","Texas","11298","1129803","30194",
> | "DFW","Dallas/Fort Worth, TX","Texas","0720","0718",
> | "-2.00","8.00","0726","0837","7.00","0855","0844","-11.00","0.00",
> | "","0.00","482.00","","","","","",""]
> | 471949
> | 155
> | 14,179,764,240 bytes allocated in the heap
> | 3,378,342,072 bytes copied during GC
> | 786,333,512 bytes maximum residency (13 sample(s))
> | 36,933,976 bytes maximum slop
> | 1434 MB total memory in use (0 MB lost due to
> | fragmentation)
> |
> | Tot time (elapsed) Avg pause
> | Max pause
> | Gen 0 26989 colls, 0 par 1.423s 1.483s 0.0001s
> | 0.0039s
> | Gen 1 13 colls, 0 par 1.005s 1.499s 0.1153s
> | 0.6730s
> |
> | INIT time 0.000s ( 0.003s elapsed)
> | MUT time 3.195s ( 3.193s elapsed)
> | GC time 2.428s ( 2.982s elapsed)
> | EXIT time 0.016s ( 0.138s elapsed)
> | Total time 5.642s ( 6.315s elapsed)
> |
> | %GC time 43.0% (47.2% elapsed)
> |
> | Alloc rate 4,437,740,019 bytes per MUT second
> |
> | Productivity 57.0% of total user, 50.9% of total elapsed
> |
> | _______________________________________________
> | Glasgow-haskell-users mailing list
> | Glasgow-haskell-users at haskell.org
> | https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmail.h
> | askell.org%2fcgi-bin%2fmailman%2flistinfo%2fglasgow-haskell-
> | users&data=01%7c01%7csimonpj%40microsoft.com%7c5017a5fe26cb4df9c41d08d
> | 3ccc7b5bd%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=2Ku1Fr5QttHRoj5
> | NSOJREZrt2Fsqhi63iJOpxmku68E%3d
> _______________________________________________
> Glasgow-haskell-users mailing list
> Glasgow-haskell-users at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/glasgow-haskell-users/attachments/20160830/8e38334a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.hs
Type: text/x-haskell
Size: 1986 bytes
Desc: not available
URL: <http://mail.haskell.org/pipermail/glasgow-haskell-users/attachments/20160830/8e38334a/attachment.hs>
More information about the Glasgow-haskell-users
mailing list