[Haskell-cafe] vector-space and standard API for vectors

Sat Nov 6 04:06:49 EDT 2010

On 11/5/10 7:54 AM, Alexey Khudyakov wrote:
>
>> We already know that there are noncommutative modules/vectorspaces of
>> interest (e.g., modules over quaternions and modules over graph paths),
>> why not support them from the beginning? It seems like you're going out
>> of your way to exclude things that would be trivial to include. This is
>> exactly why this is my standard complaint against the various proposals
>> out there for new numeric hierarchies. People who are used to only using
>> R^n think the proposals are just fine, but none of the proposals capture
>> the structures I work with daily. Which means the new proposals are no
>> better than the Prelude for me.
>
> Could you tell what data structures do you use? It's difficult to think
> about them without concrete examples.

Data structures? That varies a lot depending on the task: Data.Map, 
Data.Set, Data.IntMap, Data.IntSet, Data.Trie, Data.ByteString...

A lot of my concrete examples of semirings and modules come from natural 
language processing tasks. One example I already mentioned is the 
semiring of a collection of paths over a graph (so something like 
Data.Set (Data.Seq Arc)). Path collections show up, for example, when 
dealing with Markov chains and HMMs where the goal is to maximize or sum 
the weights over all paths.

To make it clearer, a Markov chain is a probabilistic version of a 
finite state automaton, so you have some set of nodes, and the arcs for 
transitioning from one node to another have probabilistic weights on 
them. An HMM is an extension of a Markov chain into a probabilistic 
version of a Moore machine, so in addition to the probabilistic 
transitions from state to state, we also have for each state a 
probability distribution over emitted symbols.

An interesting problem for HMMs is this: given some observed sequence of 
emitted symbols, reconstruct the most likely path of states which would 
cause the symbol sequence to be emitted. A nearly identical problem is: 
given some observed sequence of symbols, determine the total probability 
of all state sequences which could have generated it.

In addition to the perspective of HMMs as probabilistic Moore machines, 
there are two other perspectives which are helpful. One is the graphical 
model perspective where we have something that looks a bit like this (in 
fixed-width font):

     Q0 -> Q1 -> Q2 -> ... -> Qn
           |     |            |
           v     v            v
           S1    S2           Sn

Each Sk and Qk are random variables. The random variables Qk represent 
being in some particular state q at time k, and the choice of which 
state is drawn from a probability distribution based on the state 
Q(k-1). The random variables Sk represent emitting some particular 
symbol s at time k, and the choice of which symbol is drawn from a 
distribution based on the state Qk.

The third perspective, which is the most helpful one for solving our two 
problems, is if we take this graphical model and unfold it into a 
trellis graph (ignoring the Sk variables for now). Each node in the 
trellis represents an assignment of particular values to each of the 
random variables. So if Q1 could take on values qA, qB, and qC then we'd 
have three nodes for each of Q1=qA, Q1=qB, and Q1=qC. The arcs in the 
trellis are weighted with the probability of transitioning from one node 
to the next; so an arc Q1=q1 -> Q2=q2 has weight Pr(Q2=q2 | Q1=q1). A 
path through the trellis represents a variable assignment, which is to 
say a sequence of states in the Markov chain; and the weight of the path 
is the probability of the Markov chain taking that path.

There is a general algorithm for solving the two problems I mentioned, 
and ultimately they're the same algorithm except with different 
semirings. Note that a collection of paths between two points on a graph 
forms a semiring[1] where sum is the union of path collections and 
product is the extension of paths[2], so the answers we want can be 
gotten by semiring homomorphisms from the collection of paths to some 
other domain. To get the probability of all state sequences which could 
give rise to a given symbol sequence we can use the probability 
semiring[3] ---which we can simplify to the metric space [0..1] with (+) 
and (*), since the algorithm ensures that all events are disjoint. This 
version is called the "forward algorithm". To get the probability the 
most likely state sequence we can use the semiring [0..1] with max and 
(+), which is called the "Viterbi algorithm". In practice we tend to use 
the log version of these semirings in order to prevent underflow. 
There's also a variant of the Viterbi algorithm which stores 
backpointers to the most likely previous state, which makes it easier to 
recover the most likely state sequence instead of just the 
(log)probability of the sequence. The Viterbi algorithm with back 
pointers is also a semiring: Maybe(Prob, Maybe State) with Nothing as 
zero, Just(1,Nothing) as one, argmax as sum: mx<+>my = do { (px,_) <- mx 
; (py,_) <- my ; if px > py then mx else my }, and product: mx<*>my = do 
{ (px,x) <- mx ; (py,y) <- my ; Just (px*py, y `mplus` x)}.

The collection-of-paths semiring is noncommutative since extending the 
end of a path is different than extending the beginning. The Viterbi 
with backpointers semiring is noncommutative since we'll get different 
backpointers depending on the order of arguments to the product. If we 
extend our probabilities to use quantum probability theory then all of 
these probabilistic semirings become noncommutative because of order 
effects on quantum probabilities.

These HMM problems can also be thought of from the perspective of a 
grammar, where HMMs happen to be restricted to produce linear derivation 
"trees". Looking at other grammars like probabilistic CFGs allows us to 
extend the idea of generating sequences to the idea of generating trees. 
The forward-backward algorithm for sequences (of which the forward 
algorithm is half) generalizes to the inside-outside algorithm for 
trees. Ultimately, chart parsing algorithms like CKY are doing something 
very similar to the Viterbi algorithm in order to determine the most 
likely parse tree which would give rise to some observed sentence. So 
all these semirings have tree-based analogues as well. This use of 
semirings is just like the use of monoids in finger trees[4], except 
that we generalize the idea from a monoid on one tree into a semiring on 
a packed forest of trees. Naturally there are many such semirings which 
are noncommutative since adding a left-child to a tree is different than 
adding a right-child.

[1] Or actually a 2-semiring, much as a groupoid (2-group) is a 
generalization of groups and a category (2-monoid) is a generalization 
of a monoid.

[2] Note that the collection of paths semiring also has something like a 
module structure. The "scalars" are single arcs in the underlying graph. 
I only say that it's something like a module/vector space, because 
whether the family of arcs supports its own semiring/ring/field 
operations depends on the specific graph in question. If we assume 
certain closure properties on the family of arcs, similar to the 
transitive closure properties of composition in categories, then it is 
indeed a module. However, we do not have those closure properties for 
the specific example in question.

[3] That is the event space E with operations <+> and <*> with the empty 
event as zero and the certainty event as one. Colloquially <+> is the 
union of events and <*> is the intersection. For x,y in E their 
probabilities are defined by Pr(x<+>y) = Pr(x) + Pr(y) - Pr(x,y) and 
Pr(x<*>y) = Pr(x,y) where Pr(x,y) = Pr(x)*Pr(y) iff x and y are independent.

[4] http://apfelmus.nfshost.com/articles/monoid-fingertree.html

-- 
Live well,
~wren