[Haskell-cafe] feasability of implementing an awk interpreter.

Mon Aug 23 20:43:30 EDT 2010

On Aug 23, 2010, at 7:00 PM, Roel van Dijk wrote:

> On Mon, Aug 23, 2010 at 8:07 AM, Richard O'Keefe <ok at cs.otago.ac.nz> wrote:
>> But what _is_ "the core functionality".
>> The Single Unix Specification can be browsed on-line.
>> There is no part of it labelled "core"; it's all required
>> or it isn't AWK.  

[If -f progfile is specified, the application shall ensure that the
 files named by each of the progfile option-arguments are text files
 and their concatenation, in the same order as they appear in the
 arguments, is an awk program.
] is what I was referring to.

>> Is that "core"?  Who knows?
> 
> I say that that behaviour is not part of the language but of the runtime.

Actually, it's a *compile*-time thing.

> 
>> Whatever the "core functionality" might be, YOU will have to define
>> what that "core" is.  There's no standard, or even common, sublanguage.
> 
> One approach to find the core of a language is to find which parts can
> be implemented in terms of other parts. If part B can be expressed in
> terms of part A then B doesn't belong in the core.

Agreed.  But it's not clear that AWK *has* a non-trivial core in that
sense.  OK, so you can define != in terms of == and >,<=,>= in terms
of <, and you can define + and unary - in terms of infix -.  And you
can define (a,b,c,...) as (a SUBSEP b SUBSEP c SUBSEP ...).  But you
can't, for example, define
	print <number>
in terms of
	print (<number> "")
because number printing and number to string printing use different
format variables (OFMT and CONVFMT respectively), and you can't
define the two of them in terms of sprintf() because there is no
way for an AWK program to _test_ whether a value is a number or a
string or an uninitialized value (which has defined properties) or
an uncommitted numeric string.

What you would have to do would be to define an *extended* 'core'
containing 
	case(E; U, x.I, x.F, x.UI, x.UF, x.S)
	U   - what to do for uninitialized value
	x.I - what to do for an integral value
	x.F - what to do for a non-integral number
	x.UI - what to do for a uncommitted maybe-integer-maybe-string
	x.UF - what to do for an uncommitted maybe-float-maybe-string
	x.S - what to do for a string
That is, the core you need contains operations that are NOT in the
source language.

Here's one of my favourite quotations from the Single Unix Specification
V3 description of AWK:

For example, with historical implementations the following program:
{
    a = "+2"
    b = 2
    if (NR % 2)
        c = a + b
    if (a == b)
        print "numeric comparison"
    else
        print "string comparison"
}
would perform a numeric comparison (and output numeric comparison)
for each odd-numbered line, but perform a string comparison (and
output string comparison) for each even-numbered line. IEEE Std 1003.1-2001 ensures that comparisons will be numeric if necessary.

I just tried four AWK implementations.
GNU AWK and Mike's AWK both wrote

string comparison
string comparison
string comparison
string comparison

as required by the standard.  But two others (one provided by a major
UNIX vendor, and the other provided by one of the inventors of AWK)
did indeed write

numeric comparison
string comparison
numeric comparison
string comparison

Now let's make an apparently tiny change to the program.
Let's replace
	a = "+2"
by
	a = ENVIRON["FOO"]
and do
	setenv FOO +2
in the shell.  Now all four implementations print
numeric comparison
four times.

Getting this right is not just a tiny tweak to the system,
it's a fundamental issue that affects the way you represent
AWK 'values' in your interpreter.

Then there are the undefined things.
Consider

BEGIN {
    echo = "echo"
    n = getline <echo
    print n | echo
    close(echo)
    ...
}

The third line opens an input stream reading from a file called
"echo".  The fourth line opens an output stream writing to a
pipe running the "echo" command.

What does the fifth line close?

>