[Haskell-cafe] How would you hack it?

Wed Jun 4 14:33:00 EDT 2008

So anyway, today I found myself wanting to build a program that 
generates some test data. I managed to make it *work* without too much 
difficulty, but it wasn't exactly "elegant". I'm curios to know how 
higher-order minds would approach this problem. It's not an especially 
"hard" problem, and I'm sure there are several good solutions possible.

I have a file that contains several thousand words, seperated by white 
space. [I gather that on Unix there's a standard location for this 
file?] I want to end up with a file that contains a randomly-chosen 
selection of words. Actually, I'd like the end result to be a LaTeX 
file, containing sections, subsections, paragraphs and sentences. 
(Although obviously the actual sentences will be gibberish.) I'd like to 
be able to select how big the file should be, to within a few dozen 
characters anyway. Exact precision is not required.

How would you do this?

The approach I came up with is to slurp up the words like so:

  raw <- readFile "words.txt"
  let ws = words raw
  let n = length ws
  let wa = listArray (1,n) ws

(I actually used lazy ByteStrings of characters.) So now I have an array 
of packed ByteStrings, and I can pick array indexes at random and use 
"unwords" to build my gibberish "sentences".

The fun part is making a sentence come out the right size. There are two 
obvious possibilities:
- Assume that all words are approximately N characters long, and 
estimate how many words a sentence therefore needs to contain to have 
the required length.
- Start with an empty list and *actually count* the characters as you 
add each word. (You can prepend them rather than append them for extra 
efficiency.)
I ended up taking the latter approach - at least at the sentence level.

What I actually did was to write a function that builds lists of words. 
There is then another function that builds several sublists of 
approximately the prescribed lengths, inserts some commas, capitalises 
the first letter and appends a fullstop. This therefore generates a 
"sentence". After that, there's a function that builds several sentences 
of random size with random numbers of commas and makes a "paragraph" out 
of them. Next a function gathers several paragraphs and inserts a 
randomly-generated subsection heading. A similar function takes several 
subsections and adds a random section heading.

In my current implementation, all of this is in the IO monad (so I can 
pick things randomly). Also, the whole thing looks very repetative and 
really if I thought about it harder, it ought to be possible to factor 
it out into something smaller and more ellegant. The clause function 
builds clauses of "approximately" the right size, and each function 
above that (sentence, paragraph, subsection, section) becomes 
progressively less accurate in its sizing. On the other hand, the final 
generated text always for example has 4 subsections to each section, and 
they always look visually the same size. I'd like to make it more 
random, but all the code works by estimating "roughly" how big a 
particular construct is, and therefore how many of then are required to 
full N characters. For larger and larger N, actually counting this stuff 
would seem somewhat inefficient, so I'm estimating. But that makes it 
hard to add more randomness without losing overall size control.

The final file can end up being a fair way different in size to what I 
requested, and has an annoyingly regular internal structure. But 
essentially, it works.

It would be nice to modify the code to generate HTML - but since it's 
coded so simple-mindedly, that would be a copy & paste job.

Clearly, what I *should* have done is think more about a good 
abstraction before writing miles of code. ;-) So how would you guys do this?