[Haskell-cafe] Re: String vs ByteString

Tue Aug 17 08:52:27 EDT 2010

Hello, Ketil Malde!

On Tue, Aug 17, 2010 at 8:02 AM, Ketil Malde <ketil at malde.org> wrote:
> Ivan Lazar Miljenovic <ivan.miljenovic at gmail.com> writes:
>
>> Seeing as how the genome just uses 4 base "letters",
>
> Yes, the bulk of the data is not really "text" at all, but each sequence
> (it's fragmented due to the molecular division into chromosomes, and
> due to incompleteness) also has a textual header.  Generally, the Fasta
> format looks like this:
>
>  >sequence-id some arbitrary metadata blah blah
>  ACGATATACGCGCATGCGAT...
>  ..lines and lines of letters...
>
> (As an aside, although there are only four nucleotides (ACGT), there are
> occasional wildcard characters, the most common being N for aNy
> nucleotide, but there are defined wildcards for all subsets of the alphabet.)

As someone who knows and uses your bio package, I'm almost
certain that Text really isn't the right data type for
representing everything.  Certainly *not* for the genomic data
itself.  In fact, a representation using 4 bits per base (4
nucleotides plus 12 other characters, such as gaps as aNy) is
easy to represent using ByteStrings with two bases per byte and
should halve the space requirements.

However, the header of each sequence is text, in the sense of
human language text, and ideally should be represented using
Text.  In other words, the sequence data type[1] currently is
defined as:

  type SeqData = Data.ByteString.Lazy.ByteString
  type QualData = Data.ByteString.Lazy.ByteString
  data Sequence t = Seq !SeqData !SeqData !(Maybe QualData)

[1] http://hackage.haskell.org/packages/archive/bio/0.4.6/doc/html/Bio-Sequence-SeqData.html#t:Sequence

where the meaning is that in 'Seq header seqdata qualdata',
'header' would be something like "sequence-id some arbitrary
metadata blah blah" and 'seqdata' would be "ACGATATACGCGCATGCGAT".

But perhaps we should really have:

  type SeqData = Data.ByteString.Lazy.ByteString
  type QualData = Data.ByteString.Lazy.ByteString
  type HeaderData = Data.Text.Text -- strict is prolly a good choice here
  data Sequence t = Seq !HeaderData !SeqData !(Maybe QualData)

Semantically, this is the right choice, putting Text where there
is text.  We can read everything with ByteStrings and then use[2]

  decodeUtf8 :: ByteString -> Text

[2] http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Encoding.html#v:decodeUtf8

only for the header bits.  There is only one problem in this
approach, UTF-8 for the input FASTA file would be hardcoded.
Considering that probably nobody will be using UTF-16 or UTF-32
for the whole FASTA file, there remains only UTF-8 (from which
ASCII is just a special case) and other 8-bits encondings (such
as ISO8859-1, Shift-JIS, etc.).  I haven't seen a FASTA file with
characters outside the ASCII range yet, but I guess the choice of
UTF-8 shouldn't be a big problem.

>> wouldn't it be better to not treat it as text but use something else?
>
> I generally use ByteStrings, with the .Char8 interface if/when
> appropriate.  This is actually a pretty good choice; even if people use
> Unicode in the headers, I don't particularly want to care - as long as
> it is transparent.  In some cases, I'd like to, say, search headers for
> some specific string - in these cases, a nice, tidy, rich, and optimized
> Data.ByteString(.Lazy).UTF8 would be nice.  (But obviously not terribly
> essential at the moment, since I haven't bothered to test the available
> options.  I guess for my stuff, the (human consumable) text bits are
> neither very performance intensive, nor large, so I could probably and
> fairly cheaply wrap relevant operations or fields with Data.Text's
> {de,en}codeUtf8.  And in practice - partly due to lacking software
> support, I'm sure - it's all ASCII anyway. :-)

Oh, so I didn't read this paragraph closely enough :).  In this
e-mail I'm basically agreeing with your thoughts here =).

And what do you think about creating a real SeqData data type
with two bases per byte?  In terms of processing speed I guess
there will be a small penalty, but if you need to have large
quantities of base pairs in memory this would double your
capacity =).

Cheers,

--
Felipe.