
Scott.Daniels at Acm
Jul 6, 2009, 6:59 AM
Post #24 of 27
(1060 views)
Permalink
|
|
Re: finding most common elements between thousands of multiple arrays.
[In reply to]
|
|
Peter Otten wrote: > Scott David Daniels wrote: > >> Scott David Daniels wrote: > >> t = timeit.Timer('sum(part[:-1]==part[1:])', >> 'from __main__ import part') > > What happens if you calculate the sum in numpy? Try > > t = timeit.Timer('(part[:-1]==part[1:]).sum()', > 'from __main__ import part') Good idea, I hadn't thought of adding numpy bools. (part[:-1]==part[1:]).sum() is only a slight improvement over len(part[part[:-1]==part[1:]]) when there are few elements, but it is almost twice as fast when there are a lot (reflecting the work of allocating and copying). >>> import numpy >>> import timeit >>> original = numpy.random.normal(0, 100, (1000, 1000)).astype(int) >>> data = original.flatten() >>> data.sort() >>> t = timeit.Timer('sum(part[:-1]==part[1:])', 'from __main__ import part') >>> u = timeit.Timer('len(part[part[:-1]==part[1:]])', 'from __main__ import part') >>> v = timeit.Timer('(part[:-1]==part[1:]).sum()', 'from __main__ import part') >>> part = data[::100] >>> (part[:-1]==part[1:]).sum() 9390 >>> t.repeat(3, 10) [0.56368281443587875, 0.55615057220961717, 0.55465764503594528] >>> u.repeat(3, 1000) [0.89576580263690175, 0.89276374511291579, 0.8937328626963108] >>> v.repeat(3, 1000) [0.24798598704592223, 0.24715431709898894, 0.24498979618920202] >>> >>> part = original.flatten()[::100] >>> (part[:-1]==part[1:]).sum() 27 >>> t.repeat(3, 10) [0.57576898739921489, 0.56410158274297828, 0.56988248506445416] >>> u.repeat(3, 1000) [0.27312186325366383, 0.27315007913011868, 0.27214492344683094] >>> v.repeat(3, 1000) [0.28410342655297427, 0.28374053126867693, 0.28318990262732768] >>> Net result: go back to former definition of candidates (a number, not the actual entries), but calculate that number as matches.sum(), not len(part[matches]). Now the latest version of this (compressed) code: > ... > sampled = data[::stride] > matches = sampled[:-1] == sampled[1:] > candidates = sum(matches) # count identified matches > while candidates > N * 10: # 10 -- heuristic > stride *= 2 # # heuristic increase > sampled = data[::stride] > matches = sampled[:-1] == sampled[1:] > candidates = sum(matches) > while candidates < N * 3: # heuristic slop for long runs > stride //= 2 # heuristic decrease > sampled = data[::stride] > matches = sampled[:-1] == sampled[1:] > candidates = sum(matches) > former = None > past = 0 > for value in sampled[matches]: > ... is: ... sampled = data[::stride] matches = sampled[:-1] == sampled[1:] candidates = matches.sum() # count identified matches while candidates > N * 10: # 10 -- heuristic stride *= 2 # # heuristic increase sampled = data[::stride] matches = sampled[:-1] == sampled[1:] candidates = matches.sum() while candidates < N * 3: # heuristic slop for long runs stride //= 2 # heuristic decrease sampled = data[::stride] matches = sampled[:-1] == sampled[1:] candidates = matches.sum() former = None past = 0 for value in sampled[matches]: ... Now I think I can let this problem go, esp. since it was mclovin's problem in the first place. --Scott David Daniels Scott.Daniels [at] Acm -- http://mail.python.org/mailman/listinfo/python-list
|