[Haskell-cafe] Re: String vs ByteString

Ketil Malde ketil at malde.org
Tue Aug 17 07:02:49 EDT 2010

Ivan Lazar Miljenovic <ivan.miljenovic at gmail.com> writes:

> Seeing as how the genome just uses 4 base "letters",   

Yes, the bulk of the data is not really "text" at all, but each sequence
(it's fragmented due to the molecular division into chromosomes, and
due to incompleteness) also has a textual header.  Generally, the Fasta
format looks like this:

  >sequence-id some arbitrary metadata blah blah
  ..lines and lines of letters...

(As an aside, although there are only four nucleotides (ACGT), there are
occasional wildcard characters, the most common being N for aNy
nucleotide, but there are defined wildcards for all subsets of the alphabet.)

> wouldn't it be better to not treat it as text but use something else?

I generally use ByteStrings, with the .Char8 interface if/when
appropriate.  This is actually a pretty good choice; even if people use
Unicode in the headers, I don't particularly want to care - as long as
it is transparent.  In some cases, I'd like to, say, search headers for
some specific string - in these cases, a nice, tidy, rich, and optimized
Data.ByteString(.Lazy).UTF8 would be nice.  (But obviously not terribly
essential at the moment, since I haven't bothered to test the available
options.  I guess for my stuff, the (human consumable) text bits are
neither very performance intensive, nor large, so I could probably and
fairly cheaply wrap relevant operations or fields with Data.Text's
{de,en}codeUtf8.  And in practice - partly due to lacking software
support, I'm sure - it's all ASCII anyway. :-) 

It'd be nice to have efficient substring searches and regular
expression, etc for the sequence data, but often this will be better
addressed by more specific algorithms, and in any case, a .Char8
implementation is likely to be more efficient than any gratuitous
Unicode encoding.

> (in case someone is trying to do their mad genetic manipulation by
> hand)?

You'd be surprised what a determined biologist can achive, armed only
with Word, Excel, and a reckless disregard for surmountability.

If I haven't seen further, it is by standing in the footprints of giants

More information about the Haskell-Cafe mailing list