[Data-haskell] DataHaskell Newsletter #1, June 2018
Marco Zocca
zocca.marco at gmail.com
Sun Jun 17 12:47:53 UTC 2018
Hi all,
I'd like to share with you a few of the things that happened during
the past months on and around DataHaskell (DH), and a summary of the
current state of things. Time permitting, this will become a regular
newsletter, with the idea to keep up to date those who don't usually
hang out on our Gitter chatroom [1].
----------------------------------------------------------------------------------------------------------------
* Outreach activities and related meetings
ICFP 2017 : Early September 2017 we had an hour-long lunch
mini-workshop at ICFP during which a few people presented how they use
Haskell in their numerical and data-crunching work. Mostly positive
opinions, as well as a "contrarian" viewpoint (from an expert
practitioner) who lamented the low-performance of native Haskell
numerical code (e.g. the large memory footprint of boxed data). Michal
Gajda presented how he uses IHaskell for interactive data exploration
(btw, it's a thing and you can try it today!), Trevor McDonell
introduced Accelerate (the high-performance array library that can
target CUDA GPUs), Adam Scibior presented `monad-bayes` and Praveen
Narayan presented `hakaru` (two probabilistic progamming laguages
embedded in Haskell).
ZuriHac 2018 : community bonding mostly (read: no hacking was done).
Received more feedback of the usual sort "how do I do X in Haskell?"
"I was expecting a more advanced state of machine learning
capabilities in Haskell", etc. There may or may not be a set of
Haskell bindings for Apache Arrow in the works, stay tuned!
ICFP 2018 : There will be two very interesting workshops (FHPC,
functional programming in high-performance computing and NPFL,
numerical programming in functional languages), for those who will be
attending the conference. One of the workshop chairs of NPFL is
Dominic Steinitz of Tweag, author of the excellent
idontgetoutmuch.wordpress.com blog.
----------------------------------------------------------------------------------------------------------------
* DH survey
On April 5 I've published a survey [2], that tried to gauge the
community interest and pain points related to doing "data science" in
our beloved language.
The survey is still open but I won't be collecting data anymore (70
people have taken the questionnaire and 62 have completed it), and
I've scraped and formatted the dataset (you can find it at
https://github.com/DataHaskell/surveys).
Some of the free text answers are particularly interesting.
I don't have much time currently to create the plots and publish a
blog post about it, but it would be a Good Thing to have, if anyone
wants to step up. The project backbone is ready for new contributors!
----------------------------------------------------------------------------------------------------------------
* Contributors wanted : External projects
At regular intervals, new exciting projects appear and it's becoming
quite hard to keep track of everything that's happening in this area.
I'd like to highlight here a few that are seeing significant activity
as of late and have large impact potential.
** General-purpose (e.g. numerics-related)
`numhask` [3] is an experiment at replacing the numerical typeclasses
of `base`, in particular those related to `Num`, in favor of a
finer-grained and more principled hierarchy. It is already available
and receiving quite a bit of attention recently, but there still are a
number of areas that could use some help, for example property
checking, test coverage, the `accelerate` bindings, etc.
** Data science
`boke-hs` [4] : a native Haskell interface to generate Bokeh plots.
The project was mostly developed during the ZuriHac weekend; currently
it's functional and you can create line plots with it, but helping out
with the domain mapping would be a very valuable contribution. In
particular, the library emits JSON blobs that are interpreted in the
browser by BokehJS ; "domain mapping" means creating the Haskell data
that will serialize into JSON appropriately.
A currently-unnamed library for linking databases and Haskell
dataframes. This is Gagandeep Bhatia's ongoing Summer of Haskell
project, and you can find a writeup and links to his current work
(which uses `beam` as a database binding library and `Frames` as
type-safe in-memory representation) here:
https://www.gagandeepbhatia.com/blog/ .
`haskell.do` [5] : a native Haskell notebook/interactive editor/IDE
for the browser. The project started well, has a website and a few
initial releases, but currently needs some love to achieve its full
potential. Interactive development with visual feedback is a crucial
part of data science work, and contributions to this project will be
very valuable.
** Machine learning
`hasktorch` [6] is in developer beta ! This is a set of Haskell
bindings for `torch` (the deep learning C library), and comes wrapped
in a typed interface that provide statically-checked vector
dimensions, and the like.
** Probabilistic programming languages
`deanie` [7] is a probabilistic EDSL. Currently it's lacking code for
its general-purpose inference engine (see
https://github.com/jtobin/deanie/blob/master/lib/Deanie/Inference/Comonadic.hs).
The implementation technique is related to "Co-Free for interpreters"
as shown in [8].
----------------------------------------------------------------------------------------------------------------
* Contributors wanted : DataHaskell internal projects
We do have a few projects set up that could use some collaborators. If
you are willing to contribute to any of these, please open a ticket on
the project issue tracker and the maintainers will be in touch
shortly.
* `type-providers` : a unified code generation library for accessing
structured data in a type-safe way :
https://github.com/DataHaskell/type-providers . Michal Gajda is
willing to mentor a student who wants to translate his own
`json-autotype` to a new library that can generate type-safe code for
XML, and can be queried in-memory via Frames. Contributing to this
library will likely require some knowledge of Template Haskell.
* A fork of the venerable `statistics` library, that separates the
dense linear algebra library as a standalone library :
https://github.com/DataHaskell/statistics . There is currently a PR to
the upstream library (https://github.com/bos/statistics/pull/143);
longer-term plans include giving it a typeclass-based interface, for
example based on `numhask`.
* `numhask-linear-algebra` : this is a longer shot at unifying native
linear algebra libraries under a single typeclass representation, as
provided by `numhask` (project at
https://github.com/DataHaskell/numhask-linear-algebra ). This is
something that's currently unavailable in the Haskell ecosystem and
will be a highly relevant contribution to many.
----------------------------------------------------------------------------------------------------------------
* DataHaskell internal administration
In addition to helping out with the survey blogpost, some help in the
following areas would be much appreciated
** Knowledge base : testers wanted !
The knowledge base (
http://www.datahaskell.org/docs/community/current-environment.html )
is a growing curated collection of libraries for doing data
science-related tasks; anything from manipulating storable data to
statistical inference.
It's pretty useful as it is, but I think the entries should be
annotated with additional information, such as the degree of
completeness and developer- and user-friendliness.
For this, some help in testing out the libraries and reporting back
would be very useful.
Ultimately, it would be very useful to contribute such an assessment
back to the "state of the ecosystem" document [9], which is highly
visible but not so much up to date on the data-science related things.
** Maintainers wanted !
Currently, Nikita and I are the only owners of the github
organization. It would be great if additional people stepped up for
"tending the garden", e.g. keeping track of issue tickets, filling out
the documentation, lending a hand on the gitter channel to address
newcomers' issues, etc.
Historically, data science and numerical computing have been
underserved niches of the Haskell ecosystem, and while things are
steadily improving, together we can bring about this change soon !
----------------------------------------------------------------------------------------------------------------
That's it for now, I hope you enjoyed this Newsletter; don't hesitate
to share your thoughts either here or on our Gitter chatroom! [1]
Marco
github.com/ocramz
----------------------------------------------------------------------------------------------------------------
References :
[1] DH Gitter chatroom : https://gitter.im/dataHaskell/Lobby
[2] DH user survey April 2018: https://www.surveymonkey.com/r/3FBBJWR
[3] numhask : github.com/tonyday567/numhask
[4] boke-hs , native Haskell bindings for Bokeh :
https://github.com/ahaym/boke-hs
[5] HaskellDO, the interactive Haskell editor: http://haskell.do/ ,
https://github.com/theam/haskell-do
[6] hasktorch, the Haskell bindings to Torch :
https://github.com/hasktorch/hasktorch
[7] deanie , probabilistic programming language :
https://github.com/jtobin/deanie
[8] "Free for DSLs, co-free for interpreters"
http://dlaing.org/cofun/posts/free_and_cofree.html
[9] State of the Haskell ecosystem :
https://github.com/Gabriel439/post-rfc/blob/master/sotu.md
More information about the Data-Haskell
mailing list