Naive question on lists of duplicates
Sarah Thompson
sarah@telergy.com
Thu, 05 Jun 2003 14:29:39 +0100
>I'm pretty confident that this will be more efficient than my colleague'=
s SAS code, as he was comparing each record to every other record (giving=
n (n-1) comparisons). It seems like this, in the worst case where every=
thing is on promotion at distinct times, will compare the first record to=
(n-1) records, the second to (n-2) records, etc., giving n (n-1)/2 compa=
risons. Thus, while this is worst-case O(n^2), it seems like it should h=
ave at most half as much work to do as the earlier approach in SAS. On t=
he other hand, in the best case, when everything is on promotion at the s=
ame time, there should be something like n-1 comparisons made and then th=
at should be it. So it seems like, depending on how frequently promotion=
s co-occur for this retailer, the above should be somewhere between O(n) =
and O(n^2) time complexity. Since these people are always having some sa=
le or other, I suspect this won't be much worse than O(n log n).
>=20
>Is there anything that is an obvious improvement in efficiency -- some c=
lever tricks with the fold functions, some way of loading the data into s=
ome sort of convenient structure first, etc -- that I could easily implem=
ent?
> =20
>
Firstly I'm assuming that you are working with a granularity of days, and=
that each promotion will always have a relatively small maximum number o=
f days. If so, how about something like the following:
1: Filter the existing data structure, resulting in a lazy list of tuples=
(a, b) where a is a day number and b is an identifier for the promotion.=
Where a promotion spans n days, the list will contain n entries, one for=
each day of the promotion. Complexity is O(M x N), where M is the (small=
) maximum number of days. If this is fixed *and* small, we can discard th=
is any regard the complexity as just O(N).
2: Sort the list with a as the primary key and b as the secondary key. Co=
mplexity should be near enough O(N log N)
3: Traverse the results of the sort, outputting a lazy list of lists such=
that the elements of each sub-list are references to the promotions that=
overlap for one specific day number. Where no overlap is detected for on=
e specific day, that day can simply be ignored. Complexity should be line=
ar.
4: Sort this list, and discard duplicates. Complexity should be O(N log N=
) for the sort and O(N) for the 'uniq'.
5: You are now left with a list which describes all overlapping promotion=
s.
Total complexity should effectively be O(N + N log N + N + N log N + N) w=
hich of course just collapses to O(N log N).
--=20
----------------------------------------------
/ __ + / Sarah Thompson **** /
/ (_ _ _ _ |_ / * /
/ __)(_|| (_|| ) / sarah@telergy.com * /
/ + / http://findatlantis.com/ /
----------------------------------------------