[Haskell-cafe] Detecting overlay-ed signal patterns

Mon May 30 15:09:58 CEST 2011

Hello,
I am trying to detect signal patterns in a system with numerous signals
overlaying each other in time. I am aware that there is a whole class of
problem solving methods for this task, yet trying my own algorithm, that may
be wrong of course :)
Next I use a simple example to describe the problem and my algorithm. Maybe
someone will find this problem interesting enough to comment. Thanks in
advance!

In the following example signals from varying sources are represented by
chars -  unique signal = unique char.
Examples of signal patterns (Pk) to detect:
P1 = abcd
P2 = 12345
Examples of a mixture of signals (pattern signals are underscored):
M1 = s s n a_b_c_d_ r t 6 1_ 2_ 3_ 4_ 5_ 8 7 k l t
M2 = g h a_ r 1_ w e 2_ j k b_ c_ l m p 3_ d_ 4_ e r 5_ v x
M3 = ...
MN = ...

In the above examples two patterns P1 and P2 are interleaved with each other
and other patterns. All mixtures are recorded for 24 hours for 12 month. One
mixture corresponds to one 24 period (day).
To find signal patterns in all these mixtures (M1, M2) my algorithm goes
through these steps:
1) For every day:
        For every signal 'Y' create a list of 'weight pairs' such as:
        (Y, X1, 1),  (Y, X2, 1), .. (Y, XN, 1)
         where X1, X2, ... XN are signals  'accessible from' (or following)
signal 'Y' during this day

So, for example, given the sequence: "abcde", chars in a left column of a
table can access a list of chars in the right column:
a | b c d e
b | c d
e

c | d
e

d |
e

2) For every signal sum all its 'weight pairs' for all days of the year:
W(Y) = Sum((Y, Xk, 1), j=1... 364)
For the above example, M1+ M2 will double the value of weight pairs in
patterns:
P1 = abcd
P2 = 12345

such as:
(a,b,2) (a,c,2) (a,d,2)
(b,c,2) (b,d,2)  (b,e,2)
(c,d,2)  (c,e,2)
(d,e,2)
and also:
(1,2,2) (1,3,2) (1,4,2), (1,5,2)
(2,3,2) (2,4,2) (2,5,2)
etc...
On the other hand, after summation M1+M2,  weight pairs for other char
sequences, that do not form  patterns of often repeated signals, will not
increase much. To demonstrate:  M_1 and M_2 are sequences with repeating
patterns being removed:
M_1 = s s n r t 6 8 7 k l t
M_2 = g h r w e  j k l m p e r  v x
Here most pair weights will be small, same as before addition:
(s,s,2) (g,h,1), (w,e, 1) ...

3) Compute weight of all possible chains in this year.
Each chain consists of chars that can be accessed from one to another. Chars
are examined, proceeding from the beginning of the sequence to its end, and
never in the opposite direction. Going always in one direction and being
bound with sequence end, makes chains finite.

Chains for 'abcde' example will be:
abcde, acde, ade, ae,
bcde, bde, be,
cde, cd, ce,
de

Weight of the chain is computed as a sum of its pairs:
W(Ci) = Sum(Pj, j=1,..R)
where R  is a number of pairs that this chain is built from.
For example for chain a 'acde' the weight will be:
W('acde') = W(a,c)+ W(a,d) + W(a,e)

4) Split chains in groups by its length

5) For each group select chains with maximum weight.
With pattern P1 = 'abcd' spread over some signal mixture, my algorithm will
find (I hope) the following 4 classes of chains with maximum weight:
1) abcde, acde, ade, ae,
2) bcde, bde, be,
3) cde, cd, ce,
4) de
In the same way, chains for pattern P2 = 12345 should be found.

Thus, grounding on the idea of 'signal precedence' found in repeating
pattern of several signals, and using pair 'weights' to mark signal chains -
patterns are detected.
Summation of pair weights allows to stress out, or highlight the repeating
patterns. Inserting other signals inside a 'pattern pair' does not obscure
it because in a long run (after many summations)  its weight will make such
pair stand out of  the crowd.

I have not yet fully developed this algorithm and have not tested it on big
enough training sets. I hope it will work on training sets exhibiting some
regularities - which in this case means repeating patterns. Of course it
will not work on randomly generated data, where all pairs have approximately
the same weight.

Working with pairs and chains will allow for natural parallelism in
processing huge sequences of events, I hope.  It would be nice to use some
analog of Hadoop map-reduce for this task. Any experience with Haskell
analog of Apache Hadoop cluster (http://hadoop.apache.org/mapreduce/)? Does
such thing exist in Haskell world at all?
I would be happy to get criticism of my algorithm and comments on
'route-cause analysis' algorithms in general, thanks!

Dmitri
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110530/cff03295/attachment.htm>