Naive question on lists of duplicates

Sarah Thompson sarah@telergy.com
Thu, 05 Jun 2003 14:29:39 +0100


>I'm pretty confident that this will be more efficient than my colleague'=
s SAS code, as he was comparing each record to every other record (giving=
 n (n-1) comparisons).  It seems like this, in the worst case where every=
thing is on promotion at distinct times, will compare the first record to=
 (n-1) records, the second to (n-2) records, etc., giving n (n-1)/2 compa=
risons.  Thus, while this is worst-case O(n^2), it seems like it should h=
ave at most half as much work to do as the earlier approach in SAS.  On t=
he other hand, in the best case, when everything is on promotion at the s=
ame time, there should be something like n-1 comparisons made and then th=
at should be it.  So it seems like, depending on how frequently promotion=
s co-occur for this retailer, the above should be somewhere between O(n) =
and O(n^2) time complexity.  Since these people are always having some sa=
le or other, I suspect this won't be much worse than O(n log n).
>=20
>Is there anything that is an obvious improvement in efficiency -- some c=
lever tricks with the fold functions, some way of loading the data into s=
ome sort of convenient structure first, etc -- that I could easily implem=
ent?
> =20
>
Firstly I'm assuming that you are working with a granularity of days, and=
 that each promotion will always have a relatively small maximum number o=
f days. If so, how about something like the following:

1: Filter the existing data structure, resulting in a lazy list of tuples=
 (a, b) where a is a day number and b is an identifier for the promotion.=
 Where a promotion spans n days, the list will contain n entries, one for=
 each day of the promotion. Complexity is O(M x N), where M is the (small=
) maximum number of days. If this is fixed *and* small, we can discard th=
is any regard the complexity as just O(N).

2: Sort the list with a as the primary key and b as the secondary key. Co=
mplexity should be near enough O(N log N)

3: Traverse the results of the sort, outputting a lazy list of lists such=
 that the elements of each sub-list are references to the promotions that=
 overlap for one specific day number. Where no overlap is detected for on=
e specific day, that day can simply be ignored. Complexity should be line=
ar.

4: Sort this list, and discard duplicates. Complexity should be O(N log N=
) for the sort and O(N) for the 'uniq'.

5: You are now left with a list which describes all overlapping promotion=
s.

Total complexity should effectively be O(N + N log N + N + N log N + N) w=
hich of course just collapses to O(N log N).

--=20
      ----------------------------------------------
     / __            + / Sarah Thompson      **** /
    / (_  _  _ _ |_   /                        * /
   /  __)(_|| (_|| ) / sarah@telergy.com      * /
  / +               / http://findatlantis.com/ /
 ----------------------------------------------