[Haskell-cafe] design question: decision tree from "Programming Collective Intelligence"

Mon Jun 21 14:22:01 EDT 2010

Hi,

I'm having a little trouble figuring out precisely how to port the decision tree code from the book "Programming Collective Intelligence." You can see the code here: http://code.google.com/p/memothing/source/browse/trunk/PCI/ch7/treepredict.py?r=29

The design issue is that this code depends on dynamic typing pretty heavily. He has a table of heterogenous data. Each column has the same type, but that's implicit, not explicit. He depends on all of the values supporting "==", which he uses to get a list of distinct values from each column. These values are used by the divide set function which takes a column and a value as parameters. Based on the type of the value, he chooses a different partitioning function; for strings, it's just equality again, but for numeric types he uses >=. The decision tree nodes then collect not just the left and right branches but also the column number and value on which the split was performed.

I have thought about this for several days and can't seem to come to a design that I like, so I'm wondering how others would approach the problem. I guess you could implement it as a list of lists of some data type that is either a string or numeric, but this feels a bit like a cop-out. How would you support creating a decision tree with different types of data? I imagine possibilities using existential quantification, SYB, Data.Data and other stuff, but I don't know how to make it work. I wonder if there is an obvious functional design hiding in here that doesn't depend on any fancy stuff, but I'm blinded to it by having read and understood the Python version of the code.

Anybody have any suggestions? I'd greatly appreciate any help.

Thanks,

— 
Daniel Lyons