Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

sorting 1172026 entries

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


jmwebaze at gmail

May 6, 2012, 8:57 AM

Post #1 of 22 (435 views)
Permalink
sorting 1172026 entries

I have several lists with approx 1172026 entries. I have been trying to
sort the records, but have failed.. I tried lists.sort() i also trired
sorted python's inbuilt method. This has been running for weeks.

Any one knows of method that can handle such lists.

cheers



--
*Mob UG: +256 (0) 70 1735800 | NL +31 (0) 6 852 841 38 | Gtalk: jmwebaze |
skype: mwebazej | URL: www.astro.rug.nl/~jmwebaze

/* Life runs on code */*


benjamin at schollnick

May 6, 2012, 9:07 AM

Post #2 of 22 (435 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On May 6, 2012, at 11:57 AM, J. Mwebaze wrote:

> I have several lists with approx 1172026 entries. I have been trying to sort the records, but have failed.. I tried lists.sort() i also trired sorted python's inbuilt method. This has been running for weeks.
>
> Any one knows of method that can handle such lists.

The issue there is the sheer size of the list. I can't think of an algorithium that wouldn't have a problem with a list of that size.

Two suggestions....

1) Is there no other way to organize this data, other than having it in a single list? You can't organize it by, for example, zip code, area code, year, or something, and make multiple lists? Reducing the size would speed the sort up.

2) Maybe consider a different storage method, for example, adding the data into a database? And then connecting to the database via python?

- Benjamin


jeanpierreda at gmail

May 6, 2012, 9:09 AM

Post #3 of 22 (436 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On Sun, May 6, 2012 at 11:57 AM, J. Mwebaze <jmwebaze [at] gmail> wrote:
> I have several lists with approx 1172026 entries. I have been trying to sort
> the records, but have failed.. I tried lists.sort() i also trired sorted
> python's inbuilt method. This has been running for weeks.

Sorting 1172026 random floats takes about 1.5 seconds to do on my
machine. More complicated objects take more time, but usually not that
much more time. What exactly are you sorting?

-- Devin
--
http://mail.python.org/mailman/listinfo/python-list


jmwebaze at gmail

May 6, 2012, 9:10 AM

Post #4 of 22 (435 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On Sun, May 6, 2012 at 6:07 PM, Benjamin Schollnick <benjamin [at] schollnick
> wrote:

>
> On May 6, 2012, at 11:57 AM, J. Mwebaze wrote:
>
> I have several lists with approx 1172026 entries. I have been trying to
> sort the records, but have failed.. I tried lists.sort() i also trired
> sorted python's inbuilt method. This has been running for weeks.
>
> Any one knows of method that can handle such lists.
>
>
> The issue there is the sheer size of the list. I can't think of an
> algorithium that wouldn't have a problem with a list of that size.
>
> Two suggestions....
>
> 1) Is there no other way to organize this data, other than having it in a
> single list? You can't organize it by, for example, zip code, area code,
> year, or something, and make multiple lists? Reducing the size would speed
> the sort up.
>
> 2) Maybe consider a different storage method, for example, adding the data
> into a database? And then connecting to the database via python?
>
> - Benjamin
>
>
I could try the database option..



--
*Mob UG: +256 (0) 70 1735800 | NL +31 (0) 6 852 841 38 | Gtalk: jmwebaze |
skype: mwebazej | URL: www.astro.rug.nl/~jmwebaze

/* Life runs on code */*


jmwebaze at gmail

May 6, 2012, 9:11 AM

Post #5 of 22 (435 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On Sun, May 6, 2012 at 6:09 PM, Devin Jeanpierre <jeanpierreda [at] gmail>wrote:

> On Sun, May 6, 2012 at 11:57 AM, J. Mwebaze <jmwebaze [at] gmail> wrote:
> > I have several lists with approx 1172026 entries. I have been trying to
> sort
> > the records, but have failed.. I tried lists.sort() i also trired sorted
> > python's inbuilt method. This has been running for weeks.
>
> Sorting 1172026 random floats takes about 1.5 seconds to do on my
> machine. More complicated objects take more time, but usually not that
> much more time. What exactly are you sorting?
>
> -- Devin
>

[ (datatime, int) ] * 1172026
--
*Mob UG: +256 (0) 70 1735800 | NL +31 (0) 6 852 841 38 | Gtalk: jmwebaze |
skype: mwebazej | URL: www.astro.rug.nl/~jmwebaze

/* Life runs on code */*


jeanpierreda at gmail

May 6, 2012, 9:21 AM

Post #6 of 22 (435 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On Sun, May 6, 2012 at 12:11 PM, J. Mwebaze <jmwebaze [at] gmail> wrote:
> [ (datatime, int) ] * 1172026

I can't duplicate slowness. It finishes fairly quickly here. Maybe you
could try posting specific code? It might be something else that is
making your program take forever.

>>> x = [(datetime.datetime.now() + datetime.timedelta(random.getrandbits(10)), random.getrandbits(32)) for _ in xrange(1172026)]
>>> random.shuffle(x)
>>> x.sort()
>>>

-- Devin
--
http://mail.python.org/mailman/listinfo/python-list


jmwebaze at gmail

May 6, 2012, 9:26 AM

Post #7 of 22 (435 views)
Permalink
Re: sorting 1172026 entries [In reply to]

I have attached one of the files, try to sort and let me know the results.
Kindly sort by date. ooops - am told the file exceed 25M.

below is the code

import glob
txtfiles =glob.glob('*.txt')
import dateutil.parser as parser


for filename in txtfiles:
temp=[]
f=open(filename)
for line in f.readlines():
line = line.strip()
line=line.split()
temp.append((parser.parse(line[0]), float(line[1])))
temp=sorted(temp)
with open(filename.strip('.txt')+ '.sorted', 'wb') as p:
for i, j in temp:
p.write('%s %s\n' %(str(i),j))


On Sun, May 6, 2012 at 6:21 PM, Devin Jeanpierre <jeanpierreda [at] gmail>wrote:

> On Sun, May 6, 2012 at 12:11 PM, J. Mwebaze <jmwebaze [at] gmail> wrote:
> > [ (datatime, int) ] * 1172026
>
> I can't duplicate slowness. It finishes fairly quickly here. Maybe you
> could try posting specific code? It might be something else that is
> making your program take forever.
>
> >>> x = [(datetime.datetime.now() +
> datetime.timedelta(random.getrandbits(10)), random.getrandbits(32)) for _
> in xrange(1172026)]
> >>> random.shuffle(x)
> >>> x.sort()
> >>>
>
> -- Devin
>



--
*Mob UG: +256 (0) 70 1735800 | NL +31 (0) 6 852 841 38 | Gtalk: jmwebaze |
skype: mwebazej | URL: www.astro.rug.nl/~jmwebaze

/* Life runs on code */*


jmwebaze at gmail

May 6, 2012, 9:29 AM

Post #8 of 22 (435 views)
Permalink
Re: sorting 1172026 entries [In reply to]

sorry see, corrected code


for filename in txtfiles:
temp=[]
f=open(filename)
for line in f.readlines():
line = line.strip()
line=line.split()
temp.append((parser.parse(line[0]), float(line[1])))
temp=sorted(temp)
with open(filename.strip('.txt')+ '.sorted', 'wb') as p:
for i, j in temp:
p.write('%s %s\n' %(str(i),j))


On Sun, May 6, 2012 at 6:26 PM, J. Mwebaze <jmwebaze [at] gmail> wrote:

> I have attached one of the files, try to sort and let me know the results.
> Kindly sort by date. ooops - am told the file exceed 25M.
>
> below is the code
>
> import glob
> txtfiles =glob.glob('*.txt')
> import dateutil.parser as parser
>
>
> for filename in txtfiles:
> temp=[]
> f=open(filename)
> for line in f.readlines():
> line = line.strip()
> line=line.split()
> temp.append((parser.parse(line[0]), float(line[1])))
> temp=sorted(temp)
> with open(filename.strip('.txt')+ '.sorted', 'wb') as p:
> for i, j in temp:
> p.write('%s %s\n' %(str(i),j))
>
>
> On Sun, May 6, 2012 at 6:21 PM, Devin Jeanpierre <jeanpierreda [at] gmail>wrote:
>
>> On Sun, May 6, 2012 at 12:11 PM, J. Mwebaze <jmwebaze [at] gmail> wrote:
>> > [ (datatime, int) ] * 1172026
>>
>> I can't duplicate slowness. It finishes fairly quickly here. Maybe you
>> could try posting specific code? It might be something else that is
>> making your program take forever.
>>
>> >>> x = [(datetime.datetime.now() +
>> datetime.timedelta(random.getrandbits(10)), random.getrandbits(32)) for _
>> in xrange(1172026)]
>> >>> random.shuffle(x)
>> >>> x.sort()
>> >>>
>>
>> -- Devin
>>
>
>
>
> --
> *Mob UG: +256 (0) 70 1735800 | NL +31 (0) 6 852 841 38 | Gtalk: jmwebaze
> | skype: mwebazej | URL: www.astro.rug.nl/~jmwebaze
>
> /* Life runs on code */*
>
>


--
*Mob UG: +256 (0) 70 1735800 | NL +31 (0) 6 852 841 38 | Gtalk: jmwebaze |
skype: mwebazej | URL: www.astro.rug.nl/~jmwebaze

/* Life runs on code */*


jmwebaze at gmail

May 6, 2012, 9:36 AM

Post #9 of 22 (437 views)
Permalink
Re: sorting 1172026 entries [In reply to]

I noticed the error in code please ignore this post..

On Sun, May 6, 2012 at 6:29 PM, J. Mwebaze <jmwebaze [at] gmail> wrote:

> sorry see, corrected code
>
>
> for filename in txtfiles:
> temp=[]
> f=open(filename)
> for line in f.readlines():
> line = line.strip()
> line=line.split()
> temp.append((parser.parse(line[0]), float(line[1])))
> temp=sorted(temp)
> with open(filename.strip('.txt')+ '.sorted', 'wb') as p:
> for i, j in temp:
> p.write('%s %s\n' %(str(i),j))
>
>
> On Sun, May 6, 2012 at 6:26 PM, J. Mwebaze <jmwebaze [at] gmail> wrote:
>
>> I have attached one of the files, try to sort and let me know the
>> results. Kindly sort by date. ooops - am told the file exceed 25M.
>>
>> below is the code
>>
>> import glob
>> txtfiles =glob.glob('*.txt')
>> import dateutil.parser as parser
>>
>>
>> for filename in txtfiles:
>> temp=[]
>> f=open(filename)
>> for line in f.readlines():
>> line = line.strip()
>> line=line.split()
>> temp.append((parser.parse(line[0]), float(line[1])))
>> temp=sorted(temp)
>> with open(filename.strip('.txt')+ '.sorted', 'wb') as p:
>> for i, j in temp:
>> p.write('%s %s\n' %(str(i),j))
>>
>>
>> On Sun, May 6, 2012 at 6:21 PM, Devin Jeanpierre <jeanpierreda [at] gmail>wrote:
>>
>>> On Sun, May 6, 2012 at 12:11 PM, J. Mwebaze <jmwebaze [at] gmail> wrote:
>>> > [ (datatime, int) ] * 1172026
>>>
>>> I can't duplicate slowness. It finishes fairly quickly here. Maybe you
>>> could try posting specific code? It might be something else that is
>>> making your program take forever.
>>>
>>> >>> x = [(datetime.datetime.now() +
>>> datetime.timedelta(random.getrandbits(10)), random.getrandbits(32)) for _
>>> in xrange(1172026)]
>>> >>> random.shuffle(x)
>>> >>> x.sort()
>>> >>>
>>>
>>> -- Devin
>>>
>>
>>
>>
>> --
>> *Mob UG: +256 (0) 70 1735800 | NL +31 (0) 6 852 841 38 | Gtalk: jmwebaze
>> | skype: mwebazej | URL: www.astro.rug.nl/~jmwebaze
>>
>> /* Life runs on code */*
>>
>>
>
>
> --
> *Mob UG: +256 (0) 70 1735800 | NL +31 (0) 6 852 841 38 | Gtalk: jmwebaze
> | skype: mwebazej | URL: www.astro.rug.nl/~jmwebaze
>
> /* Life runs on code */*
>
>


--
*Mob UG: +256 (0) 70 1735800 | NL +31 (0) 6 852 841 38 | Gtalk: jmwebaze |
skype: mwebazej | URL: www.astro.rug.nl/~jmwebaze

/* Life runs on code */*


gary.herron at islandtraining

May 6, 2012, 9:37 AM

Post #10 of 22 (432 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On 05/06/2012 09:29 AM, J. Mwebaze wrote:
> sorry see, corrected code
>
>
> for filename in txtfiles:
> temp=[]
> f=open(filename)
> for line in f.readlines():
> line = line.strip()
> line=line.split()
> temp.append((parser.parse(line[0]), float(line[1])))
> temp=sorted(temp)
> with open(filename.strip('.txt')+ '.sorted', 'wb') as p:
> for i, j in temp:
> p.write('%s %s\n' %(str(i),j))

Don't do
temp = sorted(temp)
That will create a *new* copy of the list to sort, and the assignment
will free up the original list for deletion and garbage collection.

Instead do the in-place sort:
temp.sort()
Same result, less thrashing.

This will make your program slightly more efficient, HOWEVER, it is not
the solution of your week-long sort problem.


Gary Herron



>
>
> On Sun, May 6, 2012 at 6:26 PM, J. Mwebaze <jmwebaze [at] gmail
> <mailto:jmwebaze [at] gmail>> wrote:
>
> I have attached one of the files, try to sort and let me know the
> results. Kindly sort by date. ooops - am told the file exceed 25M.
>
> below is the code
>
> import glob
> txtfiles =glob.glob('*.txt')
> import dateutil.parser as parser
>
>
> for filename in txtfiles:
> temp=[]
> f=open(filename)
> for line in f.readlines():
> line = line.strip()
> line=line.split()
> temp.append((parser.parse(line[0]), float(line[1])))
> temp=sorted(temp)
> with open(filename.strip('.txt')+ '.sorted', 'wb') as p:
> for i, j in temp:
> p.write('%s %s\n' %(str(i),j))
>
>
> On Sun, May 6, 2012 at 6:21 PM, Devin Jeanpierre
> <jeanpierreda [at] gmail <mailto:jeanpierreda [at] gmail>> wrote:
>
> On Sun, May 6, 2012 at 12:11 PM, J. Mwebaze
> <jmwebaze [at] gmail <mailto:jmwebaze [at] gmail>> wrote:
> > [ (datatime, int) ] * 1172026
>
> I can't duplicate slowness. It finishes fairly quickly here.
> Maybe you
> could try posting specific code? It might be something else
> that is
> making your program take forever.
>
> >>> x = [(datetime.datetime.now() +
> datetime.timedelta(random.getrandbits(10)),
> random.getrandbits(32)) for _ in xrange(1172026)]
> >>> random.shuffle(x)
> >>> x.sort()
> >>>
>
> -- Devin
>
>
>
>
> --
> *Mob UG: +256 (0) 70 1735800 <tel:%2B256%20%280%29%2070%201735800>
> | NL +31 (0) 6 852 841 38
> <tel:%2B31%20%280%29%206%20852%20841%2038> | Gtalk: jmwebaze |
> skype: mwebazej | URL: www.astro.rug.nl/~jmwebaze
> <http://www.astro.rug.nl/%7Ejmwebaze>
>
> /* Life runs on code */*
>
>
>
>
> --
> *Mob UG: +256 (0) 70 1735800 | NL +31 (0) 6 852 841 38 | Gtalk:
> jmwebaze | skype: mwebazej | URL: www.astro.rug.nl/~jmwebaze
> <http://www.astro.rug.nl/%7Ejmwebaze>
>
> /* Life runs on code */*
>
>
>


thudfoo at gmail

May 6, 2012, 9:49 AM

Post #11 of 22 (436 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On Sunday 2012 May 06 09:29, J. Mwebaze wrote:
>  temp=sorted(temp)

Change to:

temp.sort()

RTFM on sorted() and <list>.sort().

--
Yonder nor sorghum stenches shut ladle gulls stopper torque wet
strainers.

--
http://mail.python.org/mailman/listinfo/python-list


clp2 at rebertia

May 6, 2012, 10:03 AM

Post #12 of 22 (435 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On Sun, May 6, 2012 at 9:29 AM, J. Mwebaze <jmwebaze [at] gmail> wrote:
> sorry see, corrected code
>
>
> for filename in txtfiles:
>    temp=[]
>    f=open(filename)

Why not use `with` here too?

>    for line in f.readlines():

readlines() reads *the entire file contents* into memory all at once!
Use `for line in f:` instead, which will read from the file one line
at a time.

>      line = line.strip()
>      line=line.split()
>      temp.append((parser.parse(line[0]), float(line[1])))
>    temp=sorted(temp)

As already pointed out, use temp.sort() instead.

>    with open(filename.strip('.txt')+ '.sorted', 'wb') as p:

strip() doesn't do quite what you think it does:
$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
>>> '.xtx.foo'.strip('.txt')
'foo'
>>>

Consult the docs.


Also, please avoid top-posting in the future.

Cheers,
Chris
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

May 6, 2012, 10:15 AM

Post #13 of 22 (436 views)
Permalink
Re: sorting 1172026 entries [In reply to]

J. Mwebaze, 06.05.2012 18:29:
> sorry see, corrected code
>
> for filename in txtfiles:
> temp=[]
> f=open(filename)
> for line in f.readlines():
> line = line.strip()
> line=line.split()
> temp.append((parser.parse(line[0]), float(line[1])))
> temp=sorted(temp)
> with open(filename.strip('.txt')+ '.sorted', 'wb') as p:
> for i, j in temp:
> p.write('%s %s\n' %(str(i),j))
>
>> I have attached one of the files, try to sort and let me know the results.
>> Kindly sort by date. ooops - am told the file exceed 25M.
>>
>> below is the code
>>
>> import glob
>> txtfiles =glob.glob('*.txt')
>> import dateutil.parser as parser
>>
>>
>> for filename in txtfiles:
>> temp=[]
>> f=open(filename)
>> for line in f.readlines():
>> line = line.strip()
>> line=line.split()
>> temp.append((parser.parse(line[0]), float(line[1])))
>> temp=sorted(temp)
>> with open(filename.strip('.txt')+ '.sorted', 'wb') as p:
>> for i, j in temp:
>> p.write('%s %s\n' %(str(i),j))

How much memory do you have on your system? Does the list fit into memory
easily or is it swapping to disk while you are running the sort?

Stefan

--
http://mail.python.org/mailman/listinfo/python-list


alec.taylor6 at gmail

May 6, 2012, 12:11 PM

Post #14 of 22 (421 views)
Permalink
Re: sorting 1172026 entries [In reply to]

Also, is there a reason you are sorting the data-set after insert
rather than using a self-sorting data-structure?

A well chosen self-sorting data-structure is always more efficient
when full data flow is controlled.

I.e.: first insert can be modified to use the self-sorting data-structure

I cannot think of a situation requiring permanently sorted data-sets
which would be better served with a non-self-sorting data-structure
except those where you do not have full data-flow control.
--
http://mail.python.org/mailman/listinfo/python-list


breamoreboy at yahoo

May 6, 2012, 12:52 PM

Post #15 of 22 (418 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On 06/05/2012 20:11, Alec Taylor wrote:
> Also, is there a reason you are sorting the data-set after insert
> rather than using a self-sorting data-structure?
>
> A well chosen self-sorting data-structure is always more efficient
> when full data flow is controlled.
>
> I.e.: first insert can be modified to use the self-sorting data-structure
>
> I cannot think of a situation requiring permanently sorted data-sets
> which would be better served with a non-self-sorting data-structure
> except those where you do not have full data-flow control.

Possibly blist from http://pypi.python.org/pypi/blist/1.3.4 ?

--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


drsalists at gmail

May 6, 2012, 3:30 PM

Post #16 of 22 (416 views)
Permalink
Re: sorting 1172026 entries [In reply to]

How much physical RAM (not the virtual memory, but the physical memory)
does your machine have available? We know the number of elements in your
dataset, but how big are the individual elements? If a sort is never
completing, you're probably swapping.

list.sort() is preferrable to sorted(list), if you need to conserve memory.

A self-sorting datastructure like a B Tree or whatever is rarely faster
than using a good sorting algorithm, assuming that you only need to sort
the data once per program invocation. If you're sorting the data in a
loop, then yes, a self-sorting datastructure is most likely a much better
idea. I've started a page about python trees and heaps at
http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/ -
but again, unless you're sorting in a loop, you're probably better off with
a good sorting algorithm, and Python's built-in Timsort is excellent for
all-in-RAM sorting.

If you have too much data to sort in RAM all at once, it's best to sort
sublists that -do- fit in memory, using something like list_.sort() (that's
Timsort by another name in CPython), and then write the sorted sublists to
disk for subsequent merging using the merge step of a mergesort.

If you just need something quick-to-code, GNU sort(1) is very highly
optimized (for a text-based sort), and will sort datasets that are larger
than your physical memory efficiently using an algorithm similar to (not
identical) what I described in the previous paragraph. It's /usr/bin/sort
on most Linux systems, and is available for a large number of other
operating systems including windows via http://cygwin.com/ or
https://github.com/bmatzelle/gow/wiki/

On Sun, May 6, 2012 at 8:57 AM, J. Mwebaze <jmwebaze [at] gmail> wrote:

> I have several lists with approx 1172026 entries. I have been trying to
> sort the records, but have failed.. I tried lists.sort() i also trired
> sorted python's inbuilt method. This has been running for weeks.
>
> Any one knows of method that can handle such lists.
>
> cheers
>
>
>
> --
> *Mob UG: +256 (0) 70 1735800 | NL +31 (0) 6 852 841 38 | Gtalk: jmwebaze
> | skype: mwebazej | URL: www.astro.rug.nl/~jmwebaze<http://www.astro.rug.nl/%7Ejmwebaze>
>
> /* Life runs on code */*
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>


cs at zip

May 6, 2012, 4:54 PM

Post #17 of 22 (416 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On 06May2012 18:36, J. Mwebaze <jmwebaze [at] gmail> wrote:
| > for filename in txtfiles:
| > temp=[]
| > f=open(filename)
| > for line in f.readlines():
| > line = line.strip()
| > line=line.split()
| > temp.append((parser.parse(line[0]), float(line[1])))

Have you timed the different parts of your code instead of the whole
thing?

Specificly, do you know the sort time is the large cost?

I would point out that the loop above builds the list by append(), one
item at a time. That should have runtime cost of the square of the list
length, 1172026 * 1172026. Though I've just done this:

[Documents/python]oscar1*> python
Python 2.7.3 (default, May 4 2012, 16:19:02)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> L1 = []
>>> for i in range(1000000): L1.append(0)
...

and it only took a few seconds.

As pointed out by others, the readlines() is also a little expensive,
conceivably similarly so (it also needs to build a huge list).

Anyway, put some:

print time.time()

at various points. Not in the inner bits of the loops, but around larger
chunks, example:

from time import time
temp=[]
f=open(filename)
print "after open", time()
lines = f.readlines()
print "after readlines", time()
for line in lines:
line = line.strip()
line=line.split()
temp.append((parser.parse(line[0]), float(line[1])))
print "after read loop", time()

and so on. AT least then you will have more feel for what part of your
code is taking so long.

Ceers,
--
Cameron Simpson <cs [at] zip> DoD#743
http://www.cskk.ezoshosting.com/cs/

The shortest path between any two truths in the real domain passes through
the complex domain. - J. Hadamand
--
http://mail.python.org/mailman/listinfo/python-list


clp2 at rebertia

May 6, 2012, 5:10 PM

Post #18 of 22 (417 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On Sun, May 6, 2012 at 4:54 PM, Cameron Simpson <cs [at] zip> wrote:
> On 06May2012 18:36, J. Mwebaze <jmwebaze [at] gmail> wrote:
> | > for filename in txtfiles:
> | >    temp=[]
> | >    f=open(filename)
> | >    for line in f.readlines():
> | >      line = line.strip()
> | >      line=line.split()
> | >      temp.append((parser.parse(line[0]), float(line[1])))
>
> Have you timed the different parts of your code instead of the whole
> thing?
>
> Specificly, do you know the sort time is the large cost?
>
> I would point out that the loop above builds the list by append(), one
> item at a time. That should have runtime cost of the square of the list
> length, 1172026 * 1172026. Though I've just done this:

Er, what? list.append() is O(1) amortized.
Perhaps you're confusing list.append() with list.insert(), which is indeed O(n)?

Cheers,
Chris
--
http://mail.python.org/mailman/listinfo/python-list


cs at zip

May 6, 2012, 5:31 PM

Post #19 of 22 (416 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On 06May2012 17:10, Chris Rebert <clp2 [at] rebertia> wrote:
| On Sun, May 6, 2012 at 4:54 PM, Cameron Simpson <cs [at] zip> wrote:
| > On 06May2012 18:36, J. Mwebaze <jmwebaze [at] gmail> wrote:
| > | > for filename in txtfiles:
| > | >    temp=[]
| > | >    f=open(filename)
| > | >    for line in f.readlines():
| > | >      line = line.strip()
| > | >      line=line.split()
| > | >      temp.append((parser.parse(line[0]), float(line[1])))
| >
| > Have you timed the different parts of your code instead of the whole
| > thing?
| >
| > Specificly, do you know the sort time is the large cost?
| >
| > I would point out that the loop above builds the list by append(), one
| > item at a time. That should have runtime cost of the square of the list
| > length, 1172026 * 1172026. Though I've just done this:
|
| Er, what? list.append() is O(1) amortized.

I didn't mean per .append() call (which I'd expect to be O(n) for large
n), I meant overall for the completed list.

Don't the realloc()s make it O(n^2) overall for large n? The list
must get copied when the underlying space fills. I confess to being
surprised at how quick it went for me though. I suppose reallocing in
chunks (eg to double the available size) might conceal this for a while.
It should still be well over O(n) overall (not amortised per .append()
call).

| Perhaps you're confusing list.append() with list.insert(), which is indeed O(n)?

I wasn't, though .insert() is surely way more expensive.

Cheers,
--
Cameron Simpson <cs [at] zip> DoD#743
http://www.cskk.ezoshosting.com/cs/

The Borg assimilated my race and all I got was this lousy tagline.
- Cath Lawrence <Cath_Lawrence [at] premium>
--
http://mail.python.org/mailman/listinfo/python-list


rosuav at gmail

May 6, 2012, 6:02 PM

Post #20 of 22 (417 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On Mon, May 7, 2012 at 10:31 AM, Cameron Simpson <cs [at] zip> wrote:
> I didn't mean per .append() call (which I'd expect to be O(n) for large
> n), I meant overall for the completed list.
>
> Don't the realloc()s make it O(n^2) overall for large n? The list
> must get copied when the underlying space fills. I confess to being
> surprised at how quick it went for me though. I suppose reallocing in
> chunks (eg to double the available size) might conceal this for a while.
> It should still be well over O(n) overall (not amortised per .append()
> call).

I haven't checked the CPython code, but the usual way to do
reallocations is to double the size (or add 50% or something) each
time, meaning that as n increases, the frequency of reallocations
decreases - hence the O(1) amortized time. In any case, it's the
recommended way to build large strings:

s = ""
while condition:
s += "some_value"

versus:

s = []
while condition:
s.append("some_value")
s = "".join(s)

so I would be very much surprised if this common wisdom is merely
replacing one O(n*n) operation with another O(n*n) operation.

ChrisA
--
http://mail.python.org/mailman/listinfo/python-list


cs at zip

May 7, 2012, 2:52 PM

Post #21 of 22 (413 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On 07May2012 11:02, Chris Angelico <rosuav [at] gmail> wrote:
| On Mon, May 7, 2012 at 10:31 AM, Cameron Simpson <cs [at] zip> wrote:
| > I didn't mean per .append() call (which I'd expect to be O(n) for large
| > n), I meant overall for the completed list.
| >
| > Don't the realloc()s make it O(n^2) overall for large n? The list
| > must get copied when the underlying space fills. I confess to being
| > surprised at how quick it went for me though. I suppose reallocing in
| > chunks (eg to double the available size) might conceal this for a while.
| > It should still be well over O(n) overall (not amortised per .append()
| > call).
|
| I haven't checked the CPython code, but the usual way to do
| reallocations is to double the size

My example above:-)

| (or add 50% or something) each
| time, meaning that as n increases, the frequency of reallocations
| decreases - hence the O(1) amortized time.

Hmm, yes. But it is only O(1) for doubling. If one went with a smaller
increment (to waste less space in the end case where one stops growing the
array) then there are more copies and less .append efficiency, trading
potential memory bloat for compute time in .append(). If you went all
the way to adding only one item the cost goes to O(n) for .append()
and O(n^2) overall, with varying costs in between.

| In any case, it's the
| recommended way to build large strings:
|
| s = ""
| while condition:
| s += "some_value"
|
| versus:
|
| s = []
| while condition:
| s.append("some_value")
| s = "".join(s)
|
| so I would be very much surprised if this common wisdom is merely
| replacing one O(n*n) operation with another O(n*n) operation.

It sort of is (except that list.append may be specially tuned to realloc
in big chunks). The join-lots-of-strings standard example is based on
the length of the strings in characters/bytes being far more than the
number of strings. So you do a lot of tiny listappends instead, copying
nothing, then a big allocate-the-correct-size-and-copy-once with s.join.

Also, this:

s += "some_value"

is not a list extension. Because strings are immutable (and anyway in
general for "+" in python) you're making a new string and copying two
other string contents into it. Inherently a copy of the whole of "s" on
every "+" operation. Compared with list.append() or list.extend(), which
are in-place modifications to the original list.

So these are really apples and oranges here.
--
Cameron Simpson <cs [at] zip> DoD#743
http://www.cskk.ezoshosting.com/cs/

We are in great haste to construct a magnetic telegraph from Maine to Texas;
but Maine and Texas, it may be, have nothing important to communicate.
- H. D. Thoreau
--
http://mail.python.org/mailman/listinfo/python-list


ian.g.kelly at gmail

May 7, 2012, 11:15 PM

Post #22 of 22 (396 views)
Permalink
Re: sorting 1172026 entries [In reply to]

On Mon, May 7, 2012 at 3:52 PM, Cameron Simpson <cs [at] zip> wrote:
> | (or add 50% or something) each
> | time, meaning that as n increases, the frequency of reallocations
> | decreases - hence the O(1) amortized time.
>
> Hmm, yes. But it is only O(1) for doubling. If one went with a smaller
> increment (to waste less space in the end case where one stops growing the
> array) then there are more copies and less .append efficiency, trading
> potential memory bloat for compute time in .append(). If you went all
> the way to adding only one item the cost goes to O(n) for .append()
> and O(n^2) overall, with varying costs in between.

It's O(1) amortized as long as the increment is exponential. IIRC
Python actually grows the list by a factor of 1/8 in the limit, which
is still exponential and still O(1) amortized.

Cheers,
Ian
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.