Naive question on lists of duplicates
Stecher, Jack
jack.stecher@retek.com
Sat, 7 Jun 2003 20:24:41 -0500
Thanks so much for the reply.
On Thu, Jun 06, 2003 at 08:06 PM, Dylan Thurston wrote:
> Was there actually a problem with the efficiency of your first code?
No -- it was untested, and as I developed more of the code in Haskell, I
found that there were slight differences between what I thought we
needed to do and what my co-worker had in mind. So, even if my
co-worker goes back to writing things in SAS, this has already been
helpful for clarifying requirements. In particular, it turns out that
he needs to know when promotions overlap for the same SKU, as this is an
indication that the data we've received are invalid.
I tested the mergeSort to make sure I didn't have any errors, though I
used the fairly standard one. It was much faster than a sort in SAS, at
least with a PROC SQL, though I ran into size limitations because I had
to compile on the Windows side. (We don't have gcc installed on our
UNIX side, so ghci will work but ghc won't compile my code.) Since I'm
analyzing roughly 600MB of data, I will probably add a step or two in
order to chunk the data. In any case, I'm not noticing any obvious
inefficiencies, but my Haskell programming is self-taught, and this is
my first attempt at solving a real problem at work using Haskell, so I
figured I'd get feedback from the list.
> (Unless this is really a one-shot deal, I suspect using Ints for dates
> is a bad decision...)
When I first read this message, I thought, "Well, this really is only a
one-shot deal." However, as I thought more about this, I decided that I
should write this so that we can use it in case another client comes
along with similar data for us to analyze on another project. So I will
change this and define a date type, or else I'll read the ghc
documentation and learn how to use what's in the time library.
> import List(sortBy, insertBy)
>
> data PromotionRec =3D PR {sku :: String, store :: String, startDate =
::
Int, endDate :: Int, amount::Float}
As it turns out, the same promotions are always on across all stores.
Thus, each SKU/startDate/endDate will seem to have a duplicate
promotion. Also, I don't need the amount information, so I wrote the
import function so that it would ignore the amount information. One
further change: the SKU and store are identified by their number, not
their descriptions.
The new data type is like this:
data PromoRec =3D PR {startDate, endDate, skuID, storeID :: Int}
deriving Show
though I will likely implement a better show function and will, as noted
above, change the dates to a date type.
The file is a fixed-width file, so I just do this:
loadData :: Handle -> IO [String]
loadData handle =3D do s <- hGetContents handle
return (lines s)
toPromoRecs :: [String] -> [PromoRec]
toPromoRecs [] =3D []
toPromoRecs (x:xs) =3D (PR (read (take 8 x)) (read (take 8 (drop
8 x)))
(read (take 20 (drop 16 x)))
(read (take 20 (drop 36 x)))) :
toPromoRecs xs
There's probably a much more clever way to do this, but it seems fairly
fast. The file allowed 20 spaces for the SKU and store numbers, but
they're at most 7 digits long. Again, I'll need to change the first and
second reads to some function that converts a string in the form
yyyymmdd into a date type (or I'll need to write one).
> compareStart, compareEnd :: PromotionRec -> PromotionRec -> Ordering
> compareStart x y =3D compare (startDate x) (startDate y)
> compareEnd x y =3D compare (endDate x) (endDate y)
This is helpful; I'll also need to add a function like this:
compareSku :: PromoRec -> PromoRec -> Ordering
compareSku x y =3D compare (skuID x) (skuID y)
> overlap :: [PromoRec] -> [[PromoRec]]
> overlap l =3D filter (lambda l. length l > 1)=20
> (overlap' [] (sortBy compareStart l))
>
> overlap' _ [] =3D []
> overlap' active (x:xs) =3D
> let {active' =3D dropWhile (lambda y. endDate y < startDate x) =
active}
in
> (x:active') : overlap' (insertBy compareEnd x active') xs
Okay; I see what additional modifications I should make. I need to add
a step sort by store and delete duplicate stores where the SKU, start
date, and end date are the same. But I should be able to get there from
here.
> The key is that, by keeping a list of the currently active promotions
> in order sorted by the ending date, we only need to discared an
> initial portion of the list.
Good insight. Many thanks.
> You could get a moderately more efficient implementation by keeping
> the active list as a heap rather than a list.
I had thought about that, and took the BinomialHeap.hs file from
Okasaki, but I must have a typo somewhere, because I was having typing
clashes that I couldn't easily clarify. At least, when I loaded the
BinomialHeap.hs into Hugs, it didn't complain, but when I tried to
create an empty heap using the heapEmpty function, Hugs screamed at me.
I got scared and fled the scene, retreating into the safety of lists.
Thanks again,
Jack