[Haskell-cafe] Optimizing a title matcher

Bulat Ziganshin bulat.ziganshin at gmail.com
Thu Sep 28 12:28:48 EDT 2006


Hello Lyle,

Wednesday, September 27, 2006, 12:44:05 AM, you wrote:

> It's supposed to match movie titles from an imported database to a
> reference database.

> The import files each have 3,000 records, and the reference table has 
> 137,986 records.

> Building the hash tables out of the files is quick - it just takes a few
> seconds. But doing the matching of title_id in one table to title_id in
> the other, in a nested loop between both tables, takes way too long.

something not unexpected :)  i solve the similar problem in my program
- when it updates archive, it should join two filelists, avoiding all
the duplicates. and it processes 250k lists in a few seconds ;)

basically it uses two techniques: first, use Hash to filter out
repetitions:

     table <- Hash.new eqFunc hashFunc
     let insert_no_dups value = do
           let key  =  keyFunc value
           Hash.insert table key value
     mapM insert_no_dups main_list

     let select_file arcfile = do
           diskfile <- Hash.lookup table (keyFunc arcfile)
           return (diskfile==Nothing)
     result_list <- mapMaybeM select_file checked_list

this code removes from checked_list items present in main_list
     
second technique - merging of sorted lists:

merge (sortOn keyFunc main_list)
      (sortOn keyFunc checked_list)


      
... i was so hot-interested in solving this problem that i've started my own
program that does the trick. i attached the first part - those that
manages the Map but uses extremely simple heuristic to find closest
match. i've tried it on the program source itself - it was printed
exactly :)  you can try it yourself - 'database' file should contain
Right Titles - one per line, while 'new' file should contain titles to
check. now i will work on edit-distance algorithm. i'm have an idea of
checking the real distance between chars - as on my keyboard, so for
example "sprry" can be mistake for "sorry", but not for "serry"

for a Real Performance, you should replace String with ByteString and
probably Map with Hash, but not before algorithm itself will be
finished. also your approach to split line into the fields is bad
(unless you have some plans for other fields) - using just index into
global list of full lines (as i do) will be enough

it's also interesting to think about sort-and-merge approach, but at
first look it seems unobvious how it can be used here




-- 
Best regards,
 Bulat                            mailto:Bulat.Ziganshin at gmail.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: titles.hs
Type: application/octet-stream
Size: 2594 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20060928/2549b170/titles.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: new
Type: application/octet-stream
Size: 2585 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20060928/2549b170/new.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: database
Type: application/octet-stream
Size: 2594 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20060928/2549b170/database.obj


More information about the Haskell-Cafe mailing list