[Haskell-cafe] How would you hack it?
Andrew Coppin
andrewcoppin at btinternet.com
Wed Jun 4 14:33:00 EDT 2008
So anyway, today I found myself wanting to build a program that
generates some test data. I managed to make it *work* without too much
difficulty, but it wasn't exactly "elegant". I'm curios to know how
higher-order minds would approach this problem. It's not an especially
"hard" problem, and I'm sure there are several good solutions possible.
I have a file that contains several thousand words, seperated by white
space. [I gather that on Unix there's a standard location for this
file?] I want to end up with a file that contains a randomly-chosen
selection of words. Actually, I'd like the end result to be a LaTeX
file, containing sections, subsections, paragraphs and sentences.
(Although obviously the actual sentences will be gibberish.) I'd like to
be able to select how big the file should be, to within a few dozen
characters anyway. Exact precision is not required.
How would you do this?
The approach I came up with is to slurp up the words like so:
raw <- readFile "words.txt"
let ws = words raw
let n = length ws
let wa = listArray (1,n) ws
(I actually used lazy ByteStrings of characters.) So now I have an array
of packed ByteStrings, and I can pick array indexes at random and use
"unwords" to build my gibberish "sentences".
The fun part is making a sentence come out the right size. There are two
obvious possibilities:
- Assume that all words are approximately N characters long, and
estimate how many words a sentence therefore needs to contain to have
the required length.
- Start with an empty list and *actually count* the characters as you
add each word. (You can prepend them rather than append them for extra
efficiency.)
I ended up taking the latter approach - at least at the sentence level.
What I actually did was to write a function that builds lists of words.
There is then another function that builds several sublists of
approximately the prescribed lengths, inserts some commas, capitalises
the first letter and appends a fullstop. This therefore generates a
"sentence". After that, there's a function that builds several sentences
of random size with random numbers of commas and makes a "paragraph" out
of them. Next a function gathers several paragraphs and inserts a
randomly-generated subsection heading. A similar function takes several
subsections and adds a random section heading.
In my current implementation, all of this is in the IO monad (so I can
pick things randomly). Also, the whole thing looks very repetative and
really if I thought about it harder, it ought to be possible to factor
it out into something smaller and more ellegant. The clause function
builds clauses of "approximately" the right size, and each function
above that (sentence, paragraph, subsection, section) becomes
progressively less accurate in its sizing. On the other hand, the final
generated text always for example has 4 subsections to each section, and
they always look visually the same size. I'd like to make it more
random, but all the code works by estimating "roughly" how big a
particular construct is, and therefore how many of then are required to
full N characters. For larger and larger N, actually counting this stuff
would seem somewhat inefficient, so I'm estimating. But that makes it
hard to add more randomness without losing overall size control.
The final file can end up being a fair way different in size to what I
requested, and has an annoyingly regular internal structure. But
essentially, it works.
It would be nice to modify the code to generate HTML - but since it's
coded so simple-mindedly, that would be a copy & paste job.
Clearly, what I *should* have done is think more about a good
abstraction before writing miles of code. ;-) So how would you guys do this?
More information about the Haskell-Cafe
mailing list