[Haskell-cafe] Re: Space Efficiency When Sorting a List of Many Lists

Sun Jan 24 08:07:59 EST 2010

Peter Green wrote:
>
>>> Another interesting problem is starting from a file of single wagers
>>> and
>>> trying to compress them  (i.e. inverse of 'explosion'. I believe this
>>> problem is analogous to Set Cover and therefore NP-Complete. Heuristics
>>> are the order of the day here.
>>
>> Interesting indeed! What kind of heuristics do you employ there? It
>> seems quite difficult to me, somehow.
>
> Essentially wager generation in multi-leg races works like this:
> 
> (1) We have estimated win probabilities for the entrants in each of the
> race legs. Assume that I have a black box which produces these estimated
> probabilities.
> 
> (2) We generate the CP of the entrant numbers in the race fields: Leg 1
> Entants x Leg 2 Entrants x..
> For each combination generated by the CP, we apply some formula to the
> probabilities associated with combination entrant numbers to arrive at a
> decision about whether or not that particular combination represents a
> viable wager.
> 
> (3) If a combination does represent a viable wager, we then need to
> ensure that we have sized its wager amount in a correct and optimal manner.
> 
> This gives output looking like:
> 
> 1, 1, 2, 3 - $1.10
> 1, 1, 2, 4 - $1.20
> ..
> ..
> 15, 15, 15, 12, - $0.35
> 
> Unfortunately these are hard to compress well - and that is ignoring
> issues of what to do about differing wager sizes when we can match two
> combinations for merging together.
> 
> So one possible hack is to insert step 1(a) where we k-means cluster the
> win probabilities in each leg, such that we end up with multiple entrant
> numbers per cluster mean (which is a probability). We then proceed
> through steps 2 and 3, but instead of generating single combinations, we
> are now directly producing wagers which look like:
> 
> 1+6+12/3+4+7/13+14 - $1.20
> 
> In this case, k-means clustering will have clustered 1,6,12 together
> with cluster mean/probability of (say) 0.1, etc.
> 
> This a bit of a hack, as clearly it will result in the wagers not
> covering exactly the same single combinations as if we had generated
> pure single wagers without k-means clustering. Hence my interest in
> doing set comparisons!
> 
> Another brute force greedy approach is to start with single combinations
> and do something like this:
> [...]
> 
> Unfortunately this approach never seems to result in more than a 10 x
> compression factor. Can do much better with the k-means approach - at
> the cost of missing some theoretically desirable combinations and
> betting on some spurious combinations.
> 
> As always, would be very interested to hear if you have any
> ideas/insights on first seeing this kind of problem. I'm almost
> certainly too fixed in my thinking due to long association with the
> wagering problem domain and probably cannot see the wood for the trees!

I think that you're on the right track with trying to compress the
output before actually generating it, i.e. compressing after step (1)
instead of after step (3).

My reasoning is that I don't see a good reason why, a priori, a *random*
list of single wager combinations could be compressed well into
*cartesian products*. I mean, if you have data without apparent
structure, the best you can do is to pipe it through  gzip , which has
nothing to do with cartesian products at all.

So, the fact that your "compression" works at all is a strong hint that
your data does have a lot of structure, which you saw in the graphviz
visualizations and which you are wisely exploiting at creation time with
step (1a).

(The greedy algorithm you mentioned is doomed to failure because it
doesn't respect symmetries. The optimal result is invariant under

* interchanging columns
* renaming all numbers in one column

but the greedy algorithm gives completely different results when you
apply these operations to the input.)

To achieve better compression, you have to open the black boxes in (1)
and (2). The good news is that you don't need to open them completely;
instead you just want to know properties like for example

* Prob( entrant A wins in leg 1) = Prob( entrant A wins in leg 2)
* Prob( entrant A wins) is independent of other participants
* if {A,B} is a viable wager for leg 1 , so are {A} and {B}
  ("monotonicity")
* Prob( A, B, C ) = Prob(A in leg 1) * Prob(B in leg 2)
                  * Prob(C in leg 3)
  ("independence")
* etc...

In other words, the idea is that the probability for race outcome with
multiple legs is, say, the product of the probabilities of the single
legs, which then leads to these hypercube structures in your list of wagers.

(I don't know which properties you need exactly because I don't
understand the details about the probabilities, choosing wagers and
pricing them.)

Armed with such properties, there are systematic ways to exploit them
for efficient algorithms. See for example the case study

  De Moor and Gibbons.
  Bridging the algorithm gap: A linear-time functional program for
  paragraph formatting.
  http://tinyurl.com/ydt5n5z  (This is a  .ps.gz  file!)

Understanding this might also lead to a better representation of
compressed wagers that also includes wager sizes. In a sense, you're
storing a recipe for creating a list of single wagers instead of the
list itself, and  [Row [a]]  is just one possible "recipe format".

Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com