[Haskell-cafe] A "less" complex median

Tue May 28 14:54:26 UTC 2019

> The code works, at least in the simple cases I've tried.
> It seems, however, way too complex and not very composable.
> So I was wondering if anyone could tell me where I went wrong, and how 
> I can make the code simpler and smaller.
>
> Of course, it is completely possible to divide my insertion function, 
> but the goal is to do all of the steps in one iteration over the list. 
> This allows me to later compare what I have with divided functions to 
> discover if the compiler would optimize them into one traversal.
>
> But even with its current logic, I believe the code can be better 
> written. I just don't know how :p
>
> So I really appreciate any advice on how to make it better (or fix a 
> bug I missed, or maybe even the logic of the whole algorithm).
>
Hi Michael,

first of all, let's see what makes your code look so complex. Part of 
that is not inherent complexity, it's a lack of documentation. And by 
that I don't mean a lack of comments, but rather that the code is not 
self-documenting. Names are too abstract and/or too short. For example 
the name "getTo" means almost nothing on its own. You could also use 
records and types for this kind of documentation. The function-arguments 
of "goThrough", "flt" and "cmp", may be extension points in the future 
(see below), but right now they make the code more complex than 
necessary as well. Then there's monadic syntax for what is essentially 
pure computation and a few other bits and bobs.

All this is not bad for some quick and dirty experiment, but changing 
that is a great first step for a clean-up.

So here's one proposal how you could clean up the core part of the code. 
Everything is (almost) just renamed and reordered parts of the original, 
with obvious environmental stuff left out as an exercise to the reader. 
I also use -XUnicodeSyntax, but that's just my personal preference.

     data MedianFilter a = MF
         { kernelSize    ∷ !KernelSize
         , srcPosition   ∷ !Position
         , currentKernel ∷ ![(Position, a)]
         }
       deriving ( Show )

     stepMedianFilterS ∷ Ord a ⇒ a → State (MedianFilter a) a
     stepMedianFilterS = state . stepMedianFilter

     stepMedianFilter ∷ Ord v ⇒ v → MedianFilter v → (Median 
v,MedianFilter v)
     stepMedianFilter _   MF{..} | kernelSize < 1 || srcPosition < 0
                                 = error "stepMedianFilter: bad argument"
     stepMedianFilter v s at MF{..} = (median , s{ srcPosition = 
srcPosition + 1, currentKernel = newKernel } )
       where
         (newKernel, median) = updateKernelAndGetMedian medianIdx 
currentKernel

         -- medianIdx = kernelSize `div` 2, unless we're at the start of 
the stream
         medianIdx           = min srcPosition kernelSize `div` 2
         isTooOld x          = not $ validAt kernelSize srcPosition x
         v'                  = (srcPosition, v)

              -- kernel is empty: create new kernel
         updateKernelAndGetMedian _                 []    = ([v'], v)
              -- clean up old values
         updateKernelAndGetMedian stepsToMedian   (x:xs)  | isTooOld x = 
updateKernelAndGetMedian stepsToMedian xs
         updateKernelAndGetMedian stepsToMedian k@(x:xs)  = case (v < 
snd x, stepsToMedian) of
              -- the new value is also the median
             (True,0) → (v':k , v)
              -- the value belongs here, but the median has not been 
found yet
             (True,_) → (v':k , snd $ nthOrLast (stepsToMedian-1) k)
              -- the median is here, but the value must still be inserted
             (_   ,0) → (x :ys, snd x)
              -- neither median nor position have been found yet
             (_   ,_) → (x :ys, y)
           where
             (ys, y) = updateKernelAndGetMedian (stepsToMedian-1) xs

Of course this does little in the way of making the code more composable 
or simpler from a purely logical perspective. Nesting the functions does 
help in reducing the number of arguments that have to be passed around 
though. And without getting here I would not have been able to think 
about the code in any meaningful way as well.

Now let's look at a few other properties of the code:

There are few avenues that lead to dead ends, I believe.

  * Can the core function "goThrough" (my "updateKernelAndGetMedian") be
    replaced by a primitive like a fold, a scan, or similar? I don't
    think so.
  * Can the kernel be represented with a set, a priority queue, or any
    other type of tree? Yes, but the cleanup of old values would be even
    harder. Not worth the effort, unless you plan on using huge kernels.

Bugs: I think I found one. The 
"goThrough"/"updateKernelAndGetMedian"-function does three things at 
once: inserting the new value into the sorted list, cleaning up old 
values, and searching for the current median. But if the place to insert 
is found before the median is found, you drop the cleaning. Which means 
if there is an old, invalid value between the position of the new value 
and the true median, you should not be computing the right median. 
(Unless it's directly behind the position of the new value.) One example 
is the sequence [3,6,5,4,2,1]. With a kernel size of 5, your algorithm 
computes a median of 3 for the 6th position (the 1), where it seems you 
would want a 4.

On to composability. I agree that there's tension between efficiency and 
extendability, mostly inside "goThrough"/"updateKernelAndGetMedian" In 
particular, I can think of three important ways you might want to reuse 
parts of this algorithm.

 1.

    You might want to have a generic "iterate over a stream with a
    kernel-based function"-function. This way you could implement things
    like a Gaussian blur or edge detection. To get there you would have
    to replace your "cmp" and "idx==0" (my "v < snd x" and
    "stepsToMedian") with more complex functions and to replace "getTo"
    (my "nthOrLast") with another recursive call or similar. (This part
    is probably also the way to correct the bug mentioned above.) So
    your "flt" and "cmp" where not a bad idea, but not quite there either.

 2.

    You might want to handle start and end of the stream in more
    different ways. You could "extend" the first/last value, mirror or
    extrapolate the bordering values, fill with a constant, append
    secondary streams, ignore border values, … Right now both ends are
    handled differently. Also, both parts are hard coded into
    "goThrough"/"updateKernelAndGetMedian" (and "ret"/"medianIdx").
    Again, making it more complex. So pulling this handling out might
    also make the code simpler. The best way I can see to solve this is
    to detect start and end outside this function and direct all three
    to different, exchangeable functions – which, in turn, will probably
    interact with a simplified version of
    "goThrough"/"updateKernelAndGetMedian" again. You could also assume
    that some other function prepends/appends meaningful extensions to
    both ends and also cleans up the ends afterwards. The same way you
    cut the end right now, but expect some outside user to clean up the
    start. But then you should default to squashing the kernel at both ends.

 3.

    You might want to create a pipeline of different filters for every
    point of the stream. This gets easier if you separate the state from
    its monad like I did, because now you can compose states before
    putting them inside that monad. It also helps write nicer-looking
    functions – at least in my opinion.

Lastly, most of these extensibility problems should have been solved in 
OpenCV, so you could search for inspiration there.

Just a few idea. Hope they help!

Cheers,
MarLinn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20190528/49341c40/attachment.html>