GHC Performance / Replacement for R?

Simon Peyton Jones simonpj at microsoft.com
Thu Aug 25 10:31:49 UTC 2016


Sounds bad.  But it'll need someone with bytestring expertise to debug.  Maybe there's a GHC problem underlying; or maybe it's shortcoming of bytestring.  

Simon

|  -----Original Message-----
|  From: Glasgow-haskell-users [mailto:glasgow-haskell-users-
|  bounces at haskell.org] On Behalf Of Dominic Steinitz
|  Sent: 25 August 2016 10:11
|  To: GHC users <glasgow-haskell-users at haskell.org>
|  Subject: GHC Performance / Replacement for R?
|  
|  I am trying to use Haskell as a replacement for R but running into two
|  problems which I describe below. Are there any plans to address the
|  performance issues I have encountered?
|  
|   1. I seem to have to jump through a lot of hoops just to be able to
|      select the data I am interested in.
|  
|  {-# LANGUAGE ScopedTypeVariables #-}
|  
|  {-# OPTIONS_GHC -Wall #-}
|  
|  import Data.Csv hiding ( decodeByName )
|  import qualified Data.Vector as V
|  
|  import Data.ByteString ( ByteString )
|  import qualified Data.ByteString.Char8 as B
|  
|  import qualified Pipes.Prelude as P
|  import qualified Pipes.ByteString as Bytes import Pipes import
|  qualified Pipes.Csv as Csv import System.IO
|  
|  import qualified Control.Foldl as L
|  
|  main :: IO ()
|  main = withFile "examples/787338586_T_ONTIME.csv" ReadMode $ \h -> do
|    let csvs :: Producer (V.Vector ByteString) IO ()
|        csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat
|        uvectors :: Producer (V.Vector ByteString) IO ()
|        uvectors = csvs  >-> P.map (V.foldr V.cons V.empty)
|    vec_vec <- L.impurely P.foldM  L.vector uvectors
|    print $ (vec_vec :: V.Vector (V.Vector ByteString)) V.! 17
|    print $ V.length vec_vec
|    let rockspring = V.filter (\x -> x V.! 8 == B.pack "RKS") vec_vec
|    print $ V.length rockspring
|  
|  Here's the equivalent R:
|  
|  df <- read.csv("787338586_T_ONTIME.csv")
|  rockspring <- df[df$ORIGIN == "RKS",]
|  
|   2. Now I think I could improve the above to make an environment that
|      is more similar to the one my colleagues are used to in R but more
|      problematical is the memory usage.
|  
|   * 112.5M file
|   * Just loading the source into ghci takes 142.7M
|   * > foo <- readFile "examples/787338586_T_ONTIME.csv" > length foo
|     takes me up to 4.75G. But we probably don't want to do this!
|   * Let's try again.
|   * > :set -XScopedTypeVariables
|   * > h <- openFile "examples/787338586_T_ONTIME.csv" ReadMode
|   * > let csvs :: Producer (V.Vector ByteString) IO () = Csv.decode
|  HasHeader (Bytes.fromHandle h) >-> P.concat
|   * > let uvectors :: Producer (V.Vector ByteString) IO () = csvs  >->
|  P.map (V.map id) >-> P.map (V.foldr V.cons V.empty)
|   * > vec_vec :: V.Vector (V.Vector ByteString) <- L.impurely P.foldM
|  L.vector uvectors
|   * Now I am up at 3.17G. In R I am under 221.3M.
|   * > V.length rockspring takes a long time to return 155 and now I am
|     at 3.5G!!! In R > rockspring <- df[df$ORIGIN == "RKS",] seems
|     instantaneous and now uses only 379.5M.
|   * > length(rockspring) 37 > length(df$ORIGIN) 471949 i.e. there are
|     37 columns and 471,949 rows.
|  
|  Running this as an executable gives
|  
|  ~/Dropbox/Private/labels $ ./examples/BugReport +RTS -s ["2014-01-
|  01","EV","20366","N904EV","2512","10747","1074702","30747",
|   "BRO","Brownsville, TX","Texas","11298","1129803","30194",
|    "DFW","Dallas/Fort Worth, TX","Texas","0720","0718",
|    "-2.00","8.00","0726","0837","7.00","0855","0844","-11.00","0.00",
|    "","0.00","482.00","","","","","",""]
|  471949
|  155
|    14,179,764,240 bytes allocated in the heap
|     3,378,342,072 bytes copied during GC
|       786,333,512 bytes maximum residency (13 sample(s))
|        36,933,976 bytes maximum slop
|              1434 MB total memory in use (0 MB lost due to
|  fragmentation)
|  
|                                       Tot time (elapsed)  Avg pause
|  Max pause
|    Gen  0     26989 colls,     0 par    1.423s   1.483s     0.0001s
|  0.0039s
|    Gen  1        13 colls,     0 par    1.005s   1.499s     0.1153s
|  0.6730s
|  
|    INIT    time    0.000s  (  0.003s elapsed)
|    MUT     time    3.195s  (  3.193s elapsed)
|    GC      time    2.428s  (  2.982s elapsed)
|    EXIT    time    0.016s  (  0.138s elapsed)
|    Total   time    5.642s  (  6.315s elapsed)
|  
|    %GC     time      43.0%  (47.2% elapsed)
|  
|    Alloc rate    4,437,740,019 bytes per MUT second
|  
|    Productivity  57.0% of total user, 50.9% of total elapsed
|  
|  _______________________________________________
|  Glasgow-haskell-users mailing list
|  Glasgow-haskell-users at haskell.org
|  https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmail.h
|  askell.org%2fcgi-bin%2fmailman%2flistinfo%2fglasgow-haskell-
|  users&data=01%7c01%7csimonpj%40microsoft.com%7c5017a5fe26cb4df9c41d08d
|  3ccc7b5bd%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=2Ku1Fr5QttHRoj5
|  NSOJREZrt2Fsqhi63iJOpxmku68E%3d


More information about the Glasgow-haskell-users mailing list