[Haskell-cafe] GSoC proposal: Haskell AST-based refactoring and API upgrading tool

Mon Apr 29 08:00:23 CEST 2013

I would like to propose the development of source code refactoring tool
that operates on Haskell source code ASTs and lets you formulate rewrite
rules written in Haskell.

Objective
---------

The goal is to make refactorings easier and allow global code changes
that might be incredibly tedious to do in a non-automated way.
By making these transformations convenient, we can make it easier to
maintain clean code, add new features or clean up leftovers faster, and
reduce the fear and effort to upgrade to newer versions of packages and
APIs.

Transformations
---------------

First, here are a few operations you would use this tool for. Some of
them are common operations you would also do in other programming
languages, some are more specific to Haskell.

* Changing all occurrences of "import Prelude hiding (catch)" to "import
qualified Control.Exception as E"

* Replacing all uses of a function with that function being imported
qualified or the other way around

* Adding a field to data constructor a record, setting user-supplied
defaults for construction and destruction:

    -- Suppose you want to change one of these
    data User = User { name :: String, age :: Int }
    data User = User String Int

    -- into one of these
    data User = User { name :: String, age :: Int, active :: Bool }
    data User = User String Int Bool

    -- the refactoring tool could perform, in all relevant locations:
    show (User name age) = ...
    show (User name age _) = ...

    -- and also this transformation:
    ... u { name = "deleted" } ...
    ... u { name = "deleted", active = False } ...

    -- or equivalently with records.

    -- Special cases could be taken care of as specified, such as
    --   "whenever an object of [this User type] has
    --    of its records passed into some function 'email', do this
    --    now only if the user is active, so modify all relevant code
    --        email (name u)
    --    to
    --        if (active u) then email (name u) else return ()

    -- Other examples include adding a position counter to attoparsec.

* Adding a type parameter to a type

    -- This happens a lot on monad transformer stacks, e.g.
    newtype MyMonad a b c = MyMonad (ReaderT a (WriterT b ...

    -- and as you would probably agree on, this is not the most
    -- comfortable change to make; in big project this can mean
    -- hour-long grinding.

    -- It has also recently happened in the basic underlying types
    -- of packages like conduit and pipes.

* Adding a new transformer around a monad

* Addressing problems like mentioned in
http://blog.ezyang.com/2012/01/modelling-io/:
  "There is one last problem with this approach: once the primitives
have been selected, huge swaths of the standard library have to be
redefined by “copy pasting” their definitions ..."

* Extracting a value into a let or where clause

* Renaming a variable, and all its occurrences that are semantically
same variable (based on its scope)

* Changing the way things are done, such as:

    * Replacing uses of fmap with <$>, also taking care of the
      corresponding import, and such cases were partial application
      is involved

    * Replacing uses of "when (isJust)" to "forM_"

* Making imports clearer by adding all functions used to the file to the
import list of the module that gets them in scope

* Finding all places where an exported function does not have all its
arguments haddock-documented.

* Performing whole-project refactorings instead of operating on single
files only, allowing operations like

    "Find me all functions of this type, e.g.
         Maybe a -> (a -> m a) -> m a
     in the project and extract them into this new module,
     with the name 'onJust'."

Some of the problems above can be tried to address using regex-based
search and replace, but this already fails in the simplest case of
"import Prelude hiding (catch)" in case there is more than that imported
from Prelude or newlines involved in the import list.

Transformation on the AST are much more powerful, and can guarantee that
the result is, at least syntactically, valid. No text base tool can do that.

Other uses
----------

In addition to being able to perform transformations as mentioned above,
the refactoring tool as a library can be leveraged to:

* Support or be the base of code formatting tools such as
haskell-stylish, linters, style/convention checkers, static analyzers,
test coverage tools etc.

* Implement automatic API upgrades.

  Imagine the author of a library you use deprecates some functions,
introduces replacements, adds type parameters. In these cases, it is
very clear and often well-documented which code has to be replaced by
what. The library author could, along with the new release, publish
"upgrade transformations" that you can apply to your code base to save
most of the manual work.

  These upgrade transformations could be either parts of the packages
themselves or be separately maintained and refined on the feedback of
users applying them to their code bases.
  This could allow the Haskell community to keep up their fast pace
while making API breakage a non-problem.

  Automation is what makes us deal well with test suites, source
control, error checking. Taking the pain out of this will increase
people's incentive to upgrade their code an to keep it well-maintained
by easy refactorings.

  This concept of automatic code upgrades would, to my knowledge, be
quite unique in the programming language world, and the possibility to
have this is, as many things, the result of Haskell's excellent type system.
  In comparison to dynamic and "unsafe" languages, we actually have a
lot of information around that we can use to write powerful and
expressive tools.

Splitting into GSoC tasks
-------------------------

I think that the project is too large for one summer and should be split
into two parts:

1. Implementation of a "full-source" transformation-enabling AST
2. The transformation engine and API / DSL

About 1: Creating a full-source AST

Existing parsers for Haskell tend to throw away information that is not
necessary for compiling the code. Of course this is not good for a
code-to-code transformation tool, stripping comments or whitespace the
programmer cares about is not an option.

haskell-src-exts has gained support for comments around version 1.1 (as
another GSoC project?), but still comments are treated as foreigners,
given to you in a list detached from the AST; their positions are
specified with integers, which means that when you modify the AST, you
have to take care to adjust the comments as well.

The problem seems to be that haskell-source-exts is made for parsing,
its AST is made for transformations. I have yet to see a tool that
actually *modifies* haskell-src-exts's AST - most of them use it for
parsing, and then apply some form of pretty-printing to get back to code.

We need an AST that contains *all* information about the original source
code, which means render . parse = id; this is why I called it "full
source" AST.
Also, this AST should have all elements available in a way that
encourages modification and does not discriminate against operations
like re-indentation or alignment.

I like to think of this as something that could eventually be part of
ghc, in a pipeline like

    GHC's parser
 -> full-source AST
 -> reduced AST that strips parts irrelevant for compilation
 -> usual compilation pipeline

I am thinking of this being in a GHC environment because it would
guarantee that it is up-to-date with the currently supported language
extensions etc.
However, I also see that this might slow down development and make
things harder, so at least for the beginning it might be better to do it
as as a normal library; I would love to get feedback on this.

It might be a good idea to start with GHC's or haskell-source-exts
parser and modify them to obtain the full-source AST. I believe though
that it would be beneficial if the AST would be to a certain extent be
decoupled from the parser, such that other people could write other
parsers (e.g. using other parsing libraries like Parsec, attoparsec,
uu-parsinglib, trifecta) that produce the same compatible AST type.

I believe that creating this AST makes a GSoC project for itself, and
that it would be beneficial for all efforts processing Haskell as a
source language.

About 2: Transformation engine and how to write transformations

Transformations should be Haskell code that operates on the AST.

They would be similar to how you write TemplateHaskell in a way, yet
much easier and more intuitive to use in most cases. I suggest the
creation of a monadic DSL that allows you to select and match those
parts of the code you are interested in, perform some case analysis and
computation to determine how you want to transform it, and then give you
convenient ways to express your changes in that DSL.

Functions in this DSL would roughly belong to one of the kinds
    * finding/matching
    * transforming/rewriting.

I have not thought about how this DSL would exactly look like, but I
could imagine myself writing a high-level transformation like this:

    is           <- getImports
    let matching = filter (importsFunction "catch")
                 . filter (hasModuleName "Prelude")
    mapM_ (rewrite . removeImportFunction "catch") matching

One of the priority goals of this GSoC project is making this API
convenient; transformations should read more naturally and be much
shorter than equivalent TemplateHaskell.

This extensive, convenience oriented monadic DSL should aim to have a
reasonably large set of functions, and "do things for you" even if you
could build them yourself with three or four combinators.
It should base itself on a minimal set of core transformations.
I think that the Haskell community has a lot of expertise, especially
from parsing libraries, in creating this core API + convenience
combinators duo, that could aid the GSoC student to get this right.

Along with this way to specify transformations, a transformation engine
has to be constructed which can perform them somewhat efficiently over
large code-bases.

While speed should should not be a main target for this project and
functionality and expressiveness are the main goals, the transformation
engine should be designed at least with speed in mind, which can be
worked on at a later stage outside of the GSoC project.

This second part of the proposal should result in a library to perform
transformations and an executable that can apply them to a folder of
source files.

Possible extensions and follow-up projects
------------------------------------------

* Interactive refactoring tool

The refactoring tool could be driven by an interactive application that
lets you write some transformations, apply them to your code base, view
the diffs and build the project; on build failures or otherwise
not-complete transformations it would allow to refine your
transformations, or write custom special cases for selected files, line
ranges or functions.
This way you could quickly upgrade your code base in a fast
write-check-repeat cycle.

* An API upgrade infrastructure

As mentioned above, transformations could be published along with new
software packages that allow their dependants to easily upgrade their
code to the latest API version with minimal manual effort.
An infrastructure or standard way of way of doing this could be established.
For now, I think that package maintainers would publish their upgrade
transformations in their projects' or separate code repositories, and
users would apply them with the refactoring tool as needed.
I can also imagine a more dedicated infrastructure though, where API
upgrades are stored on Hackage or a similar database.

Summary
-------

This proposal contains two GSoC projects that I believe to be reasonably
sized for one summer each.

I think that they make good projects according to GSoC standards as they
are focused, feasible, and aimed at creating real-world code that bring
a clearly visible benefit.

The entry barrier is not too high since knowledge of compiler or runtime
internals is not required. However, applicants should probably have a
fairly good understanding of parsers and how they are dealt with in
Haskell, and be somewhat familiar with the Haskell community in order to
find sources of good feedback and for being able to judge whether the
tools being created would be convenient for the community to use.

Discussion
----------

Now I would be glad to get some responses on this proposal; I have
written a bit of text, but it is still a very rough idea and I would
love to hear your thoughts about it.