<div dir="ltr">Thanks Mark, this is very helpful. <div><br><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, May 24, 2018 at 6:36 PM, Mark Wotton <span dir="ltr"><<a href="mailto:mwotton@gmail.com" target="_blank">mwotton@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">there are a few things you can do.<br>

<br>

I ran it locally and got 3s cpu time and 3gb of allocation, more or less.<br>

<br>

first off, your subset implementation computes the subsets twice.<br>

taking that off took it down to 2s.<br>

<br>

After that it's mostly in the Set implementation. Using Map a ()<br>

instead gives you access to Data.Map.Merge.Strict.merge, which is a<br>

bit more efficient (going over the two data structures once only)<br>

<br>

```<br>

module Main where<br>

<br>

import Data.List<br>

import Data.Map.Strict (Map)<br>

import qualified Data.Map.Merge.Strict as Map<br>

import qualified Data.Map.Strict as Map<br>

<br>

nearest_k :: (Ord a) => Int -> [(Map a (), v)] -> Map a () -> [(Map a (), v)]<br>

<span class="">nearest_k k bs b = take k bs' where<br>

    bs' = sortOn (hamming b) bs<br>

<br>

</span>hamming :: (Ord a)=> Map a () -> (Map a (), v) -> Int<br>

<span class="">hamming x (y, _) = hamming_distance x y<br>

<br>

</span>hamming_distance :: (Ord a)=> Map a () -> Map a () -> Int<br>

hamming_distance xs ys = Map.size (Map.merge Map.preserveMissing<br>

Map.preserveMissing (Map.zipWithMaybeMatched (\_ _ _ -> Nothing)) xs<br>

ys)<br>

<span class=""><br>

subsets :: [a] -> [[a]]<br>

subsets []  = [[]]<br>

</span>subsets (x:xs) = let rs = subsets xs in rs  ++ map (x:) rs<br>

<span class=""><br>

int_lists :: [[Int]]<br>

int_lists = subsets [1..20]<br>

<br>

</span>values :: [(Map Int (), Int)]<br>

<span class="">values = map f (zip [1..] int_lists) where<br>

</span>    f (i, x) = (Map.fromList(zip  x (repeat ())), i)<br>

<br>

test = nearest_k 8 values (Map.fromList (zip [1,2,3] (repeat ())))<br>

<br>

<br>

main = print test<br>

```<br>

<br>

that took it to 1s, and now profiling indicates more than half the<br>

time is spent in generation of the test values, so I'll leave it<br>

there.<br>

<br>

I think if you wanted to do better than this you'd have to do some<br>

algorithmic changes - for instance, once your worst candidate is n<br>

steps away, you can stop calculating the hamming distance for anything<br>

else once it's > n, as it can't contribute usefully to the nearest<br>

neighbours.<br>

<br>

cheers,<br>

Mark<br>

<div class="HOEnZb"><div class="h5"><br>

On Thu, May 24, 2018 at 12:15 PM, Richard Evans<br>

<<a href="mailto:richardprideauxevans@gmail.com">richardprideauxevans@gmail.<wbr>com</a>> wrote:<br>

> Dear Haskell Cafe,<br>

><br>

> Given a set of sets, and a particular target set, I want to find the sets<br>

> that are nearest (in terms of Hamming distance) to the target set.<br>

><br>

> I am using the following code:<br>

><br>

> import Data.List<br>

> import qualified Data.Set as Set<br>

><br>

> nearest_k :: Ord a => Int -> [(Set.Set a, v)] -> Set.Set a -> [(Set.Set a,<br>

> v)]<br>

> nearest_k k bs b = take k bs' where<br>

>     bs' = sortOn (hamming b) bs<br>

><br>

> hamming :: Ord a => Set.Set a -> (Set.Set a, v) -> Int<br>

> hamming x (y, _) = hamming_distance x y<br>

><br>

> hamming_distance :: Ord a => Set.Set a -> Set.Set a -> Int<br>

> hamming_distance xs ys = Set.size (Set.difference xs ys) + Set.size<br>

> (Set.difference ys xs)<br>

><br>

><br>

><br>

> subsets :: [a] -> [[a]]<br>

> subsets []  = [[]]<br>

> subsets (x:xs) = subsets xs ++ map (x:) (subsets xs)<br>

><br>

> int_lists :: [[Int]]<br>

> int_lists = subsets [1..20]<br>

><br>

> values :: [(Set.Set Int, Int)]<br>

> values = map f (zip [1..] int_lists) where<br>

>     f (i, x) = (Set.fromList x, i)<br>

><br>

> test = nearest_k 8 values (Set.fromList [1,2,3])<br>

><br>

> ----<br>

><br>

> This works ok for the test above (with sets of ints), but is rather slow in<br>

> my actual application (in which the sets are large sets of ground atoms of<br>

> first-order logic). Is there some major optimization I should be doing here?<br>

><br>

> thanks,<br>

> Richard<br>

><br>

</div></div><div class="HOEnZb"><div class="h5">> ______________________________<wbr>_________________<br>

> Haskell-Cafe mailing list<br>

> To (un)subscribe, modify options or view archives go to:<br>

> <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-<wbr>bin/mailman/listinfo/haskell-<wbr>cafe</a><br>

> Only members subscribed via the mailman list are allowed to post.<br>

<br>

<br>

<br>

</div></div><span class="HOEnZb"><font color="#888888">-- <br>

A UNIX signature isn't a return address, it's the ASCII equivalent of a<br>

black velvet clown painting. It's a rectangle of carets surrounding a<br>

quote from a literary giant of weeniedom like Heinlein or Dr. Who.<br>

        -- Chris Maeda<br>

</font></span></blockquote></div><br></div>