Analyzing Haskell call graph (was: Thread on Discourse - HIE file processing)

Wed Aug 9 19:06:28 UTC 2023

On Mon, Jul 31, 2023 at 16:26 Tristan Cacqueray wrote:
> On Mon, Jul 31, 2023 at 11:05 David Christiansen via ghc-devs wrote:
>> Dear GHC devs,
>>
>> I think that having automated security advisory warnings from build tools
>> is important for Haskell adoption in certain industries. This can be done
>> based on build plans, but a package is really the wrong granularity - a
>> large, widely-used package might export a little-used definition that is
>> the subject of an advisory, and it would be good to warn only the users of
>> said definition (cf base and readFloat).
>>
>> Tristan is exploring using HIE files to do this check, but I don't know if
>> you read Discourse, where he posted the question:
>> https://discourse.haskell.org/t/rfc-using-hie-files-to-list-external-declarations-for-cabal-audit/7147
>>
>
> Thank you David for bringing this up here. One thing to note is that we
> would need hie files for ghc libraries, as proposed in:
>   https://gitlab.haskell.org/ghc/ghc/-/merge_requests/1337
>
> Cheers,
> -Tristan

Dear GHC devs,

To recap, the goal of this project is to check if a given declaration is
used by a package. For example, I would like to check if such
definition: "package:Module.name" is reachable from another module.

In this post I list the considered options, and raise some questions
about using the simplified core from .hi files. 

I would appreciate if you could have a look and help me figure out the
remaining blockers. Note that I'm not very familiar with the GHC
internals and how to properly read Core expressions, so any feedback
would be appreciated.

# Context and Problem Statement

We would like to check if a package is affected by a known
vulnerability. Instead of looking at the build dependencies names and
versions, we would like to search for individual functions. This is
particularly important to avoid false alarm when a given vulnerability
only appears in a rarely used declaration of a popular package. 

Therefor, we need a way to search the whole call graph to assert with
confidence that a given declaration is not used (e.g. reachable).

# Considered Options

To obtain the call graph data, the following options are considered:

* .hie files produced when using the `-fwrite-ide-info` flag.
* .modpack files produced by the [wpc-plugin][grin].
* custom GHC plugin.
* .hi files containing the simplified core when using the
  `-fwrite-if-simplified-core` flag. 

# Pros and Cons of the Options

### Hie files

This option is similar to what [weeder][weeder] already implements.
However this file format is designed for IDE, and it may not be suitable
for our problem. For example, RULES, deriving, RebindableSyntax and
template haskell are not well captured.

[weeder]: https://github.com/ocharles/weeder/

### Modpack

This option appears to work, but it seems overkill. I don't think we
need to reach for STG representation.

[grin]: https://github.com/grin-compiler/ghc-whole-program-compiler-project

### Custom GHC plugin

This option enables extra metadata to be collected, but if using the
simplified core is enough, then it is just an extra step compared to
using .hi files.

### Hi files

Using .hi files is the only option that doesn't require an extra
compilation artifacts, the necessary files are already part of the
packages.

To collect hie files or files generated by a GHC plugin, ghc/cabal/stack
all need some extra work:

- ghc libraries doesn't ship hie files ([issue!16901](https://gitlab.haskell.org/ghc/ghc/-/issues/16901)).
- cabal needs recent changes for hie files ([PR#9019](https://github.com/haskell/cabal/pull/9019)) and plugin artifacts ([PR#8662](https://github.com/haskell/cabal/pull/8662)).
- stack doesn't seem to install hie files for global library.

Moreover, creating artifacts with a plugin for ghc libraries may
requires manual steps because these libraries are not built by the
end user.

Therefor, using .hi files is the most straightforward solution.

# Questions

In this section I present the current implementation of
[cabal-audit](https://github.com/TristanCacqueray/cabal-audit/).

## Collecting dependencies from core

In the [cabal-audit-core:CabalAudit.Core](https://github.com/TristanCacqueray/cabal-audit/blob/main/cabal-audit-core/src/CabalAudit/Core.hs)
module I implemented the logic to extract the call graph from core
expression into a list of declarations composed of
  `UnitId:ModuleName.OccName` and their dependencies.

Here is an example output for the [cabal-audit-test:CabalAudit.Test.User](https://github.com/TristanCacqueray/cabal-audit/blob/main/cabal-audit-test/src/CabalAudit/Test/User.hs) module:

```ShellSession
$ cabal run -O0 --write-ghc-environment=always cabal-audit-hi -- CabalAudit.Test.User
cabal-audit-test:CabalAudit.Test.Inline.fonctionInlined: base:GHC.Num.$fNumInt, base:GHC.Num.-, ghc-prim:GHC.Types.I#
cabal-audit-test:CabalAudit.Test.Instance.$fTestClassTea: cabal-audit-test:CabalAudit.Test.Instance.$ctasty1
cabal-audit-test:CabalAudit.Test.Instance.$fTestClassCofee: cabal-audit-test:CabalAudit.Test.Instance.$ctasty
cabal-audit-test:CabalAudit.Test.Instance.$ctasty: ghc-prim:GHC.Classes.&&, ghc-prim:GHC.Types.True
cabal-audit-test:CabalAudit.Test.Instance.$ctasty1: base:GHC.Base.., cabal-audit-test:CabalAudit.Test.Instance.alwaysTrue, ghc-prim:GHC.Classes.not
cabal-audit-test:CabalAudit.Test.Instance.alwaysTrue: base:GHC.Base.const, ghc-prim:GHC.Types.True
cabal-audit-test:CabalAudit.Test.User.monDoubleDecr: base:GHC.Num.$fNumInt, base:GHC.Num.-, cabal-audit-test:CabalAudit.Test.Inline.fonctionInlined, ghc-prim:GHC.Types.I#
cabal-audit-test:CabalAudit.Test.User.useAlwaysTrue: cabal-audit-test:CabalAudit.Test.Instance.Tea, cabal-audit-test:CabalAudit.Test.Instance.$fTestClassTea
cabal-audit-test:CabalAudit.Test.User.useCofeeInstance: cabal-audit-test:CabalAudit.Test.Instance.Cofee, cabal-audit-test:CabalAudit.Test.Instance.$fTestClassCofee
```

This appears correct, in particular:

- Type class instances are uniquely identified (that was not working well when using a custom plugin).
- Inlined declaration are not inlined in the simplified core when built with `-O0`.

However this is collecting extra definitions that are not part of the
source file. I understand that '$fTestClassTea' means the 'TestClass'
instance of 'Tea'. But it seems like the actual implementation is behind
the extra '$ctasty' declaration. Moreover, when analyzing the other test
modules, I see many declarations named 'lvlXX', which I guess are local
names that have been floated out.

This is not ideal because the resulting graph contains extra edges that
are not relevant for the end user. I tried to tidy this using
'isExportedId' and 'idDetails' from 'GHC.Types.Var' but I worry that
this not a good strategy. So my question is: how to recover the original
declarations context of core expressions, so that the resulting
dependency graph only contains edges that are part of the source
declaration? I assume this can be done by dissolving the declarations
starting with '$' or 'lvl', but it would be good to know how to do that
reliably. 

## Handling inlined declaration

When compiling with `-O1`, declarations seem to be inlined in the
simplified core. In that case, is it possible to recover the original
inlined OccName?

If not, I guess we have to use a GHC plugin.
I investigated this strategy in [cabal-audit-plugin:CabalAudit.Plugin](https://github.com/TristanCacqueray/cabal-audit/blob/main/cabal-audit-plugin/src/CabalAudit/Plugin.hs). 
However I am not sure this is done correctly and I could use some 
guidances on how to proceed.

## Loading hidden module

If I understand correctly, accessing the ModIface mi_extra_decls to get
the simplified core requires an HscEnv. 
In the [cabal-audit-hi:GhcExtras](https://github.com/TristanCacqueray/cabal-audit/blob/main/cabal-audit-hi/src/GhcExtras.hs)
module, I put together the following helpers using GHC as a library:

```haskell
-- | Setup a Ghc session using the packages found in the local environment file
runGhcWithEnv :: Ghc a -> IO a

-- | Lookup a module and extract the simplified core.
getCoreBind :: ModuleName -> Maybe FastString -> Ghc (Maybe (Module, [CoreBind]))
```

However this doesn't work for hidden modules, trying to load them with
'GHC.lookupModule' fails with this error:

```ShellSession
    Could not load module `GHC.Event.Thread'
    it is a hidden module in the package `base-4.18.0.0'
```

I tried to reset the hsc_env.hsc_dflags.hiddenModules but without luck.
Is there a trick to access the ModIface of hidden modules?

## Including simplified core in .hi files by default

In the cabal-audit flake, I am using a nix override to set the
`-fwrite-if-simplified-core` ghc-options by default and to patch the ghc
build phase to use the `+hi_core` hadrian transformers.

To avoid rebuilding the dependencies, it would be great to have the
simplified core in the hi file by default.
Is there an issue or a downside when enabling the flag by default?
Could the libraries shipped with GHC contains the simplified core in the
future?

## Declaration identifications

In the [cabal-audit-command:CabalAudit.Command](https://github.com/TristanCacqueray/cabal-audit/blob/main/cabal-audit-command/src/CabalAudit/Command.hs)
module, I implemented a proof of concept reverse lookup to find
reachable declarations. For example using this command:

```ShellSession
$ cabal-audit-hi --target GHC.Exception.throw CabalAudit.Test.Simple
base:GHC.Exception.throw
|
`- base:GHC.IO.Handle.Internals.ioe_finalizedHandle
   |
   `- base:GHC.IO.Handle.FD.$wstdHandleFinalizer
      |
      `- base:GHC.IO.Handle.FD.stdout
         |
         +- base:System.IO.putStrLn1
         |  |
         |  `- base:System.IO.putStrLn
         |     |
         |     `- cabal-audit-test:CabalAudit.Test.Simple.afficheNombre
         |
         `- base:System.IO.putStr1
            |
            `- base:System.IO.putStr
               |
               `- cabal-audit-test:CabalAudit.Test.Simple.maFonction
```

In the event a vulnerability happens in a type class instance, how to
identify the affected instance?
Instead of using 'package:Module.$fClassNameDataName', is there an
established format we could use (for example "Typeclass X instance of T").

What about data types or type families, would it makes sense to include
them in the graph? If so, how to identify them in the advisory database?

Please let me know if I miss something.
Thanks for your time!
-Tristan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 515 bytes
Desc: not available
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20230809/855dddea/attachment.sig>