From zocca.marco at gmail.com Sun Jun 17 12:47:53 2018 From: zocca.marco at gmail.com (Marco Zocca) Date: Sun, 17 Jun 2018 14:47:53 +0200 Subject: [Data-haskell] DataHaskell Newsletter #1, June 2018 Message-ID: Hi all, I'd like to share with you a few of the things that happened during the past months on and around DataHaskell (DH), and a summary of the current state of things. Time permitting, this will become a regular newsletter, with the idea to keep up to date those who don't usually hang out on our Gitter chatroom [1]. ---------------------------------------------------------------------------------------------------------------- * Outreach activities and related meetings ICFP 2017 : Early September 2017 we had an hour-long lunch mini-workshop at ICFP during which a few people presented how they use Haskell in their numerical and data-crunching work. Mostly positive opinions, as well as a "contrarian" viewpoint (from an expert practitioner) who lamented the low-performance of native Haskell numerical code (e.g. the large memory footprint of boxed data). Michal Gajda presented how he uses IHaskell for interactive data exploration (btw, it's a thing and you can try it today!), Trevor McDonell introduced Accelerate (the high-performance array library that can target CUDA GPUs), Adam Scibior presented `monad-bayes` and Praveen Narayan presented `hakaru` (two probabilistic progamming laguages embedded in Haskell). ZuriHac 2018 : community bonding mostly (read: no hacking was done). Received more feedback of the usual sort "how do I do X in Haskell?" "I was expecting a more advanced state of machine learning capabilities in Haskell", etc. There may or may not be a set of Haskell bindings for Apache Arrow in the works, stay tuned! ICFP 2018 : There will be two very interesting workshops (FHPC, functional programming in high-performance computing and NPFL, numerical programming in functional languages), for those who will be attending the conference. One of the workshop chairs of NPFL is Dominic Steinitz of Tweag, author of the excellent idontgetoutmuch.wordpress.com blog. ---------------------------------------------------------------------------------------------------------------- * DH survey On April 5 I've published a survey [2], that tried to gauge the community interest and pain points related to doing "data science" in our beloved language. The survey is still open but I won't be collecting data anymore (70 people have taken the questionnaire and 62 have completed it), and I've scraped and formatted the dataset (you can find it at https://github.com/DataHaskell/surveys). Some of the free text answers are particularly interesting. I don't have much time currently to create the plots and publish a blog post about it, but it would be a Good Thing to have, if anyone wants to step up. The project backbone is ready for new contributors! ---------------------------------------------------------------------------------------------------------------- * Contributors wanted : External projects At regular intervals, new exciting projects appear and it's becoming quite hard to keep track of everything that's happening in this area. I'd like to highlight here a few that are seeing significant activity as of late and have large impact potential. ** General-purpose (e.g. numerics-related) `numhask` [3] is an experiment at replacing the numerical typeclasses of `base`, in particular those related to `Num`, in favor of a finer-grained and more principled hierarchy. It is already available and receiving quite a bit of attention recently, but there still are a number of areas that could use some help, for example property checking, test coverage, the `accelerate` bindings, etc. ** Data science `boke-hs` [4] : a native Haskell interface to generate Bokeh plots. The project was mostly developed during the ZuriHac weekend; currently it's functional and you can create line plots with it, but helping out with the domain mapping would be a very valuable contribution. In particular, the library emits JSON blobs that are interpreted in the browser by BokehJS ; "domain mapping" means creating the Haskell data that will serialize into JSON appropriately. A currently-unnamed library for linking databases and Haskell dataframes. This is Gagandeep Bhatia's ongoing Summer of Haskell project, and you can find a writeup and links to his current work (which uses `beam` as a database binding library and `Frames` as type-safe in-memory representation) here: https://www.gagandeepbhatia.com/blog/ . `haskell.do` [5] : a native Haskell notebook/interactive editor/IDE for the browser. The project started well, has a website and a few initial releases, but currently needs some love to achieve its full potential. Interactive development with visual feedback is a crucial part of data science work, and contributions to this project will be very valuable. ** Machine learning `hasktorch` [6] is in developer beta ! This is a set of Haskell bindings for `torch` (the deep learning C library), and comes wrapped in a typed interface that provide statically-checked vector dimensions, and the like. ** Probabilistic programming languages `deanie` [7] is a probabilistic EDSL. Currently it's lacking code for its general-purpose inference engine (see https://github.com/jtobin/deanie/blob/master/lib/Deanie/Inference/Comonadic.hs). The implementation technique is related to "Co-Free for interpreters" as shown in [8]. ---------------------------------------------------------------------------------------------------------------- * Contributors wanted : DataHaskell internal projects We do have a few projects set up that could use some collaborators. If you are willing to contribute to any of these, please open a ticket on the project issue tracker and the maintainers will be in touch shortly. * `type-providers` : a unified code generation library for accessing structured data in a type-safe way : https://github.com/DataHaskell/type-providers . Michal Gajda is willing to mentor a student who wants to translate his own `json-autotype` to a new library that can generate type-safe code for XML, and can be queried in-memory via Frames. Contributing to this library will likely require some knowledge of Template Haskell. * A fork of the venerable `statistics` library, that separates the dense linear algebra library as a standalone library : https://github.com/DataHaskell/statistics . There is currently a PR to the upstream library (https://github.com/bos/statistics/pull/143); longer-term plans include giving it a typeclass-based interface, for example based on `numhask`. * `numhask-linear-algebra` : this is a longer shot at unifying native linear algebra libraries under a single typeclass representation, as provided by `numhask` (project at https://github.com/DataHaskell/numhask-linear-algebra ). This is something that's currently unavailable in the Haskell ecosystem and will be a highly relevant contribution to many. ---------------------------------------------------------------------------------------------------------------- * DataHaskell internal administration In addition to helping out with the survey blogpost, some help in the following areas would be much appreciated ** Knowledge base : testers wanted ! The knowledge base ( http://www.datahaskell.org/docs/community/current-environment.html ) is a growing curated collection of libraries for doing data science-related tasks; anything from manipulating storable data to statistical inference. It's pretty useful as it is, but I think the entries should be annotated with additional information, such as the degree of completeness and developer- and user-friendliness. For this, some help in testing out the libraries and reporting back would be very useful. Ultimately, it would be very useful to contribute such an assessment back to the "state of the ecosystem" document [9], which is highly visible but not so much up to date on the data-science related things. ** Maintainers wanted ! Currently, Nikita and I are the only owners of the github organization. It would be great if additional people stepped up for "tending the garden", e.g. keeping track of issue tickets, filling out the documentation, lending a hand on the gitter channel to address newcomers' issues, etc. Historically, data science and numerical computing have been underserved niches of the Haskell ecosystem, and while things are steadily improving, together we can bring about this change soon ! ---------------------------------------------------------------------------------------------------------------- That's it for now, I hope you enjoyed this Newsletter; don't hesitate to share your thoughts either here or on our Gitter chatroom! [1] Marco github.com/ocramz ---------------------------------------------------------------------------------------------------------------- References : [1] DH Gitter chatroom : https://gitter.im/dataHaskell/Lobby [2] DH user survey April 2018: https://www.surveymonkey.com/r/3FBBJWR [3] numhask : github.com/tonyday567/numhask [4] boke-hs , native Haskell bindings for Bokeh : https://github.com/ahaym/boke-hs [5] HaskellDO, the interactive Haskell editor: http://haskell.do/ , https://github.com/theam/haskell-do [6] hasktorch, the Haskell bindings to Torch : https://github.com/hasktorch/hasktorch [7] deanie , probabilistic programming language : https://github.com/jtobin/deanie [8] "Free for DSLs, co-free for interpreters" http://dlaing.org/cofun/posts/free_and_cofree.html [9] State of the Haskell ecosystem : https://github.com/Gabriel439/post-rfc/blob/master/sotu.md