[Haskell-cafe] Question on Lexing Haskell syntax

Tom Smeding x at tomsmeding.com
Thu Nov 2 20:50:01 UTC 2023


The fun (?) thing about C syntax is that you _cannot_ defer this. 
Consider the following (invalid) C program:

int main(void) {
   t * x;
   int t;
   t * x;
}

When I pass this through gcc, what I get is:

file.c: In function ‘main’:
file.c:2:3: error: unknown type name ‘t’
     2 |   t * x;
       |   ^
file.c:4:5: error: invalid operands to binary * (have ‘int’ and ‘int *’)
     4 |   t * x;
       |     ^

The first 't * x' statement was parsed as a declaration of the variable 
'x' with as type 't*'. The second such statement was parsed as a 
multiplication. The difference in behaviour is the declaration of 't' as 
a variable in between.

When starting this email I thought that the default was the other way 
round, i.e. 't * x' is parsed as a multiplication unless 't' is defined 
as a type; this would be accomplished by e.g. 'typedef int t;'. However 
it seems that the default, at least in gcc 13.2.1, is a variable 
declaration. Luckily (?), the point stands that to lex C, if you want to 
distinguish multiplication from the pointer type symbol, you need 
communication from the parser.

- Tom

On 01/11/2023 01:51, Oleg Grenrus wrote:
>
> In C, AFAIU you can (and probably should) defer `typedef` usage 
> recognition to a separate "renamer/ name resolution" pass. In Haskell 
> we are forced to do name resolution after parsing, as we don't need to 
> declare stuff before use. Even so, separate pass is usually a good 
> idea anyway, you are better equipped to produce good error messages. 
> In fact GHC does even more: it defers the unbound names reporting to 
> the type checking phase, so it can give the types to unbound 
> variables, like:
>
>     Prelude> x : "foo"
>     <interactive>:2:1: error: Variable not in scope: x :: Char
>
> - Oleg
>
> On 1.11.2023 2.32, Brandon Allbery wrote:
>> Feedback between lexer and parser isn't exactly unusual. Consider 
>> that parsing a C `typedef` generally needs to feed back to the lexer 
>> so uses will be recognized properly.
>>
>> On Wed, Nov 1, 2023 at 12:28 AM Oleg Grenrus <oleg.grenrus at iki.fi> wrote:
>>
>>     Yes, the "communication between lexer and parser" is exactly what
>>     GHC does.
>>
>>     Amelia has a nice post about it
>>     https://amelia.how/posts/parsing-layout.html which made it click
>>     it for me.
>>
>>     Note, you don't actually need to use alex and happy, you can do
>>     hand-written lexer and parsec (or alex and parsec, ...). The key
>>     insight is to have stateful lexer, and control it from the parser.
>>
>>     Amelia's post grammar is a bit too strict, e.g. GHC accepts real
>>     semis in virtual layout, and also empty "statements" in between,
>>     so we can write
>>
>>        \x y z -> case x of True -> y;;;;;; False -> z
>>
>>     but that's easy (at least in parsec) to adjust the parser grammar
>>     to accept those.
>>
>>     Or, you can *approximate* the parse-error rule with "alternative
>>     layout rule" [1], which can be implemented as a pass between
>>     lexing and parsing, or as a stateful lexer (but in this case
>>     parser won't need to adjust lexer's state). GHC has an
>>     undocumented AlternativeLayoutRule extension, so you can
>>     experiment with it to see what it accepts (look for tests in GHC
>>     source for examples). It handles let-in bindings well enough.
>>
>>     [1]
>>     https://www.mail-archive.com/haskell-prime@haskell.org/msg01938.html
>>     which can be imp
>>
>>     - Oleg
>>
>>     On 1.11.2023 0.31, Travis Athougies wrote:
>>>     According to the Haskell report [1] (See Note 5), a virtual `}`
>>>     token
>>>     is inserted if parsing the next token would cause a parse error
>>>     and the
>>>     indentation stack is non-empty.
>>>
>>>     I'm trying to lex and parse Haskell source and this sort of
>>>     interplay
>>>     (which requires two-way communication between lexer and parser)
>>>     makes
>>>     it very difficult to write a conformant implementation.
>>>
>>>     I can't change the standard (obviously), but I'm wondering if
>>>     this is
>>>     actually what GHC (de facto the only Haskell compiler) does, or
>>>     if it
>>>     applies some other rule. If so, does anyone know the exact
>>>     mechanism of
>>>     its implementation?
>>>
>>>     I've been programming Haskell for more than a decade, and while
>>>     I have
>>>     an intuitive understanding of the indentation rules, I would have
>>>     assumed the source could be lexed without also having a parser. In
>>>     particular, the note seems to imply that the main purpose of
>>>     this is to
>>>     properly lex `let`/`in` bindings. Perhaps there's an alternate
>>>     equivalent rule?
>>>
>>>     Curious to hear other's thoughts.
>>>
>>>     Travis
>>>
>>>     [1]
>>>     https://www.haskell.org/onlinereport/haskell2010/haskellch10.html#x17-17800010.3
>>     > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options
>>     or view archives go to: >
>>     http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >
>>     Only members subscribed via the mailman list are allowed to post.
>>     _______________________________________________
>>     Haskell-Cafe mailing list
>>     To (un)subscribe, modify options or view archives go to:
>>     http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>     Only members subscribed via the mailman list are allowed to post.
>>
>>
>>
>> -- 
>> brandon s allbery kf8nh
>> allbery.b at gmail.com
>
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20231102/8cd55861/attachment.html>


More information about the Haskell-Cafe mailing list