GC-Managed ByteArray Slicing

Sat Sep 8 16:34:30 UTC 2018

I wouldn't recommend doing this to the existing ByteArray#, but for a
separate type it could be nice with some caveats.

1.) It's probably not that you want to copy out slices when there are no
references to the whole array but rather that you can keep the smallest
containing sub-interval of the Bytes# that contain all references. In your
story, if I slice to take its tail and then take that things tail. I'd
double my storage requirement after GC for large Bytes# when I drop the
reference to the original. This is a bit of an orthogonal strategy to the
one you propose, but the one you propose has problems in that I probably
should carefully memoize the results per slice somehow to maximize sharing
in the result, which complicates the implementation a lot. If on the other
hand, these slices are not allowed to nest then you don't have either
problem, but that sounds like a nightmare waiting to happen to debug.

2.) You'd probably have a hard time doing the move during GC because you
don't really get to know the space of all references in before you have to
move to to-space. This is further complicated by multiple generations even
if you conspire some way to procrastinate and delay the move til after you
move everything else and have found the largest sub-interval for the
current generation's references. One way to this would be to do something
like track a min/max bound per generation in the object itself of the
largest containing intervals that are wanted, and during a gc when you copy
the thing from from space to to space you copy over the interval from the
min bound from of all generations to the max bound from all generations,
then reset the current generations bounds to be something like
(maxPos,minPos) and as you find more references to sub-slices it'll grow by
continually mutating these secret mutable fields inside the object to tell
the next GC how much it can clean things up based solely on the contents
referenced from this generation.

-Edward

On Sat, Sep 8, 2018 at 6:35 AM Andrew Martin <andrew.thaddeus at gmail.com>
wrote:

> In GHC Haskell, there are two common options for working with bytes in
> memory:
>
>     data ByteArray = ByteArray ByteArray#
>     data ByteSlice = ByteSlice ByteArray# Int# Int#
>
> The second one, ByteSlice, is roughly what the bytestring library does
> (but with a ForeignPtr instead of a ByteArray# inside of it). What's
> unfortunate is that it's difficult to know which of these two types is
> going to provide better performance for a given application. These
> heuristics are helpful for guiding the process:
>
>    1. The longer the data is going to be around, the better the unsliced
>    variant will be. This point is about space efficiency. The sliced variant
>    needs two extra machine words to hold those indices, plus it holds on to
>    extra bytes from the array that it is slicing into.
>    2. The shorter the data is going to be around, the better the sliced
>    variant is. This point is about speed of execution. The cost of the
>    unsliced variant is typically a memcpy. If the data is getting GCed quickly
>    anyway, the unsliced copy doesn't save you any space.
>    3. The shorter the data is, the better the unsliced variant is.
>    4. The more sharing of the data there is, the better the sliced
>    variant is.
>
> But, users have to commit to which one they want to use in their data type
> definitions, and the data type may be used in different contexts where one
> or the other is preferable. For example, consider the following:
>
>     data Student = Student
>       { name :: {-# UNPACK #-} !BytesType
>       , grade :: {-# UNPACK #-} !Int
>       }
>
> Which of the two types should we use for BytesType? (For this example,
> ignore the incorrectness of using bytes to represent text. With either of
> the two bytes types, it's easy to put a newtype on top that communicates
> that what's inside is guaranteed to be UTF-8 encoded text). It depends on
> the usage. For example, consider that we are parsing students from this xml
> file:
>
>     <students>
>       <student>
>         <name>Erica Chang</name>
>         <grade>87</grade>
>       </student>
>       <student>
>         <name>Lizbeth Cruz</name>
>         <grade>91</grade>
>       </student>
>       ...
>     </students>
>
> Let's say we've got about 1GB of this, and we run a parser on 64KB chunks
> as they are read from disk. There are two contasting scenarios to consider.
>
> If the file is parsed at startup and the resulting array of students is
> kept in memory for a long time afterward, then the unsliced variant is
> better. It allows us to throw away all of the bytes corresponding to the
> xml nodes, which are no longer relevant. The sliced variant, by constrast,
> would keep the entire contents of the original file in memory.
>
> However, if the file is parsed by a SAX-style streaming parser, the
> students being parsed, processed, and then discarded as they encountered,
> the sliced variant is better. Slicing wins here because the chunks from the
> file are discarded shortly after they are encountered. Read "shortly" as
> "before the next minor GC".
>
> In this example, Student was an application-specific data type, so the
> application author could figure out what they were actually doing with the
> data and define the data type accordingly. However, this isn't always a
> solution. What if they needed to use the same type differently in different
> contexts? What if the type was provided by a library and the library author
> doesn't know how it will be used?
>
> So far, this has just been an explanation of a problem that I've noticed.
> I understand the problem well, but now I'm going to propose a solution.
> There are a lot of holes in my understanding of GHC's runtime, so I don't
> understand if what I'm proposing is actually plausible.
>
> What if the GHC's runtime handled the copy decisions for us? What if we
> had these primitives:
>
>     data Bytes#
>     sliceBytes# :: Bytes# -> Int# -> Int# -> Bytes#
>     lengthBytes# :: Bytes# -> Int#
>
> Slicing causes the runtime to track that there is an additional reference
> to the byte array but that the offset and length are different. I'm not
> sure where this information would be stored though. When GC runs, it would
> copy slices from bytes if the there were no longer any references to the
> entire array. So a single type could serve the both purposes discussed in
> the scenario above. I'm not sure what kind of impact this might have on GC
> time.
>
> --
> -Andrew Thaddeus Martin
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/libraries/attachments/20180908/7414591f/attachment.html>