GHC Performance / Replacement for R?
Simon Peyton Jones
simonpj at microsoft.com
Thu Aug 25 10:31:49 UTC 2016
Sounds bad. But it'll need someone with bytestring expertise to debug. Maybe there's a GHC problem underlying; or maybe it's shortcoming of bytestring.
Simon
| -----Original Message-----
| From: Glasgow-haskell-users [mailto:glasgow-haskell-users-
| bounces at haskell.org] On Behalf Of Dominic Steinitz
| Sent: 25 August 2016 10:11
| To: GHC users <glasgow-haskell-users at haskell.org>
| Subject: GHC Performance / Replacement for R?
|
| I am trying to use Haskell as a replacement for R but running into two
| problems which I describe below. Are there any plans to address the
| performance issues I have encountered?
|
| 1. I seem to have to jump through a lot of hoops just to be able to
| select the data I am interested in.
|
| {-# LANGUAGE ScopedTypeVariables #-}
|
| {-# OPTIONS_GHC -Wall #-}
|
| import Data.Csv hiding ( decodeByName )
| import qualified Data.Vector as V
|
| import Data.ByteString ( ByteString )
| import qualified Data.ByteString.Char8 as B
|
| import qualified Pipes.Prelude as P
| import qualified Pipes.ByteString as Bytes import Pipes import
| qualified Pipes.Csv as Csv import System.IO
|
| import qualified Control.Foldl as L
|
| main :: IO ()
| main = withFile "examples/787338586_T_ONTIME.csv" ReadMode $ \h -> do
| let csvs :: Producer (V.Vector ByteString) IO ()
| csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat
| uvectors :: Producer (V.Vector ByteString) IO ()
| uvectors = csvs >-> P.map (V.foldr V.cons V.empty)
| vec_vec <- L.impurely P.foldM L.vector uvectors
| print $ (vec_vec :: V.Vector (V.Vector ByteString)) V.! 17
| print $ V.length vec_vec
| let rockspring = V.filter (\x -> x V.! 8 == B.pack "RKS") vec_vec
| print $ V.length rockspring
|
| Here's the equivalent R:
|
| df <- read.csv("787338586_T_ONTIME.csv")
| rockspring <- df[df$ORIGIN == "RKS",]
|
| 2. Now I think I could improve the above to make an environment that
| is more similar to the one my colleagues are used to in R but more
| problematical is the memory usage.
|
| * 112.5M file
| * Just loading the source into ghci takes 142.7M
| * > foo <- readFile "examples/787338586_T_ONTIME.csv" > length foo
| takes me up to 4.75G. But we probably don't want to do this!
| * Let's try again.
| * > :set -XScopedTypeVariables
| * > h <- openFile "examples/787338586_T_ONTIME.csv" ReadMode
| * > let csvs :: Producer (V.Vector ByteString) IO () = Csv.decode
| HasHeader (Bytes.fromHandle h) >-> P.concat
| * > let uvectors :: Producer (V.Vector ByteString) IO () = csvs >->
| P.map (V.map id) >-> P.map (V.foldr V.cons V.empty)
| * > vec_vec :: V.Vector (V.Vector ByteString) <- L.impurely P.foldM
| L.vector uvectors
| * Now I am up at 3.17G. In R I am under 221.3M.
| * > V.length rockspring takes a long time to return 155 and now I am
| at 3.5G!!! In R > rockspring <- df[df$ORIGIN == "RKS",] seems
| instantaneous and now uses only 379.5M.
| * > length(rockspring) 37 > length(df$ORIGIN) 471949 i.e. there are
| 37 columns and 471,949 rows.
|
| Running this as an executable gives
|
| ~/Dropbox/Private/labels $ ./examples/BugReport +RTS -s ["2014-01-
| 01","EV","20366","N904EV","2512","10747","1074702","30747",
| "BRO","Brownsville, TX","Texas","11298","1129803","30194",
| "DFW","Dallas/Fort Worth, TX","Texas","0720","0718",
| "-2.00","8.00","0726","0837","7.00","0855","0844","-11.00","0.00",
| "","0.00","482.00","","","","","",""]
| 471949
| 155
| 14,179,764,240 bytes allocated in the heap
| 3,378,342,072 bytes copied during GC
| 786,333,512 bytes maximum residency (13 sample(s))
| 36,933,976 bytes maximum slop
| 1434 MB total memory in use (0 MB lost due to
| fragmentation)
|
| Tot time (elapsed) Avg pause
| Max pause
| Gen 0 26989 colls, 0 par 1.423s 1.483s 0.0001s
| 0.0039s
| Gen 1 13 colls, 0 par 1.005s 1.499s 0.1153s
| 0.6730s
|
| INIT time 0.000s ( 0.003s elapsed)
| MUT time 3.195s ( 3.193s elapsed)
| GC time 2.428s ( 2.982s elapsed)
| EXIT time 0.016s ( 0.138s elapsed)
| Total time 5.642s ( 6.315s elapsed)
|
| %GC time 43.0% (47.2% elapsed)
|
| Alloc rate 4,437,740,019 bytes per MUT second
|
| Productivity 57.0% of total user, 50.9% of total elapsed
|
| _______________________________________________
| Glasgow-haskell-users mailing list
| Glasgow-haskell-users at haskell.org
| https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmail.h
| askell.org%2fcgi-bin%2fmailman%2flistinfo%2fglasgow-haskell-
| users&data=01%7c01%7csimonpj%40microsoft.com%7c5017a5fe26cb4df9c41d08d
| 3ccc7b5bd%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=2Ku1Fr5QttHRoj5
| NSOJREZrt2Fsqhi63iJOpxmku68E%3d
More information about the Glasgow-haskell-users
mailing list