[Haskell-cafe] Re: String vs ByteString
Richard O'Keefe
ok at cs.otago.ac.nz
Tue Aug 17 23:28:28 EDT 2010
On Aug 17, 2010, at 11:51 PM, Ketil Malde wrote:
> Yitzchak Gale <gale at sefer.org> writes:
>
>> I don't think the genome is typical text.
>
> I think the typical *large* collection of text is text-encoded data, and
> not, for lack of a better word, literature. Genomics data is just an
> example.
I have a collection of 100,000 patents I'm working with.
5.5GB of XML, most of it (US-)English text.
After stripping out the XML markup, it's 4GB of text.
It's a random sample from some 14 million patents I could
have access to, but 100,000 was more than enough.
More information about the Haskell-Cafe
mailing list