[Haskell-cafe] Re: String vs ByteString

Richard O'Keefe ok at cs.otago.ac.nz
Tue Aug 17 23:28:28 EDT 2010


On Aug 17, 2010, at 11:51 PM, Ketil Malde wrote:

> Yitzchak Gale <gale at sefer.org> writes:
> 
>> I don't think the genome is typical text.
> 
> I think the typical *large* collection of text is text-encoded data, and
> not, for lack of a better word, literature.  Genomics data is just an
> example.

I have a collection of 100,000 patents I'm working with.
5.5GB of XML, most of it (US-)English text.
After stripping out the XML markup, it's 4GB of text.
It's a random sample from some 14 million patents I could
have access to, but 100,000 was more than enough.





More information about the Haskell-Cafe mailing list