Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

Reading a large csv file

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


magawake at gmail

Jun 22, 2009, 8:17 PM

Post #1 of 13 (2184 views)
Permalink
Reading a large csv file

Hello All,

I have a very large csv file 14G and I am planning to move all of my
data to hdf5. I am using h5py to load the data. The biggest problem I
am having is, I am putting the entire file into memory and then
creating a dataset from it. This is very inefficient and it takes over
4 hours to create the hdf5 file.

The csv file has various types:
int4, int4, str, str, str, str, str

I was wondering if anyone knows of any techniques to load this file faster?

TIA
--
http://mail.python.org/mailman/listinfo/python-list


steven at REMOVE

Jun 22, 2009, 8:42 PM

Post #2 of 13 (2162 views)
Permalink
Re: Reading a large csv file [In reply to]

On Mon, 22 Jun 2009 23:17:22 -0400, Mag Gam wrote:

> Hello All,
>
> I have a very large csv file 14G and I am planning to move all of my
> data to hdf5.
[...]
> I was wondering if anyone knows of any techniques to load this file
> faster?

Faster than what? What are you using to load the file?



--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


tkjthingone at gmail

Jun 22, 2009, 9:28 PM

Post #3 of 13 (2151 views)
Permalink
Re: Reading a large csv file [In reply to]

Do you even HAVE 14 gigs of memory? I can imagine that if the OS needs to
start writing to the page file, things are going to slow down.


magawake at gmail

Jun 22, 2009, 10:27 PM

Post #4 of 13 (2150 views)
Permalink
Re: Reading a large csv file [In reply to]

Yes, the system has 64Gig of physical memory.


What I meant was, is it possible to load to a hdf5 dataformat
(basically NumPy array) without reading the entire file at first? I
would like to splay to disk beforehand so it would be a bit faster
instead of having 2 copies in memory.





On Tue, Jun 23, 2009 at 12:28 AM, Horace Blegg<tkjthingone [at] gmail> wrote:
> Do you even HAVE 14 gigs of memory? I can imagine that if the OS needs to
> start writing to the page file, things are going to slow down.
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>
--
http://mail.python.org/mailman/listinfo/python-list


__peter__ at web

Jun 22, 2009, 11:47 PM

Post #5 of 13 (2156 views)
Permalink
Re: Reading a large csv file [In reply to]

Mag Gam wrote:

> Yes, the system has 64Gig of physical memory.
>
>
> What I meant was, is it possible to load to a hdf5 dataformat
> (basically NumPy array) without reading the entire file at first? I
> would like to splay to disk beforehand so it would be a bit faster
> instead of having 2 copies in memory.

It is certainly possible to read a csv file one line at a time. What exactly
are you doing to convert it into hdf5?

Showing some of your code might help.

Peter

--
http://mail.python.org/mailman/listinfo/python-list


tjreedy at udel

Jun 23, 2009, 12:09 AM

Post #6 of 13 (2156 views)
Permalink
Re: Reading a large csv file [In reply to]

Mag Gam wrote:
> Yes, the system has 64Gig of physical memory.

drool ;-).

> What I meant was, is it possible to load to a hdf5 dataformat
> (basically NumPy array) without reading the entire file at first? I
> would like to splay to disk beforehand so it would be a bit faster
> instead of having 2 copies in memory.

If you can write hdf5 a line at a time, you should be able to something like

<open cvs>
<open hdf5>
for line in cvs:
process line
write hdf5 line

this assumes 1-1 lines.

--
http://mail.python.org/mailman/listinfo/python-list


python at bdurham

Jun 23, 2009, 11:05 AM

Post #7 of 13 (2133 views)
Permalink
Re: Reading a large csv file [In reply to]

Mag,

If your source data is clean, it may also be faster for you to parse
your input files directly vs. use the CSV module which may(?) add some
overhead.

Check out the struct module and/or use the split() method of strings.

We do a lot of ETL processing with flat files and on a slow single core
processing workstation, we can typically process 2 Gb of data in ~5
minutes. I would think a worst case processing time would be less than
an hour for 14 Gb of data.

Malcolm
--
http://mail.python.org/mailman/listinfo/python-list


chris at simplistix

Jun 24, 2009, 12:13 AM

Post #8 of 13 (2125 views)
Permalink
Re: Reading a large csv file [In reply to]

Terry Reedy wrote:
> Mag Gam wrote:
>> Yes, the system has 64Gig of physical memory.
>
> drool ;-).

Well, except that, dependent on what OS he's using, the size of one
process may well still be limited to 2GB...

Chris

--
Simplistix - Content Management, Zope & Python Consulting
- http://www.simplistix.co.uk
--
http://mail.python.org/mailman/listinfo/python-list


magawake at gmail

Jun 24, 2009, 4:38 AM

Post #9 of 13 (2116 views)
Permalink
Re: Reading a large csv file [In reply to]

Sorry for the delayed response. I was trying to figure this problem
out. The OS is Linux, BTW


Here is some code I have:
import numpy as np
from numpy import *

import gzip
import h5py
import re
import sys, string, time, getopt
import os

src=sys.argv[1]
fs = gzip.open(src)
x=src.split("/")
filename=x[len(x)-1]

#Get YYYY/MM/DD format
YYYY=(filename.rsplit(".",2)[0])[0:4]
MM=(filename.rsplit(".",2)[0])[4:6]
DD=(filename.rsplit(".",2)[0])[6:8]

f=h5py.File('/tmp/test_foo/FE.hdf5','w')

grp="/"+YYYY
try:
f.create_group(grp)
except ValueError:
print "Year group already exists"

grp=grp+"/"+MM
try:
f.create_group(grp)
except ValueError:
print "Month group already exists"

grp=grp+"/"+DD
try:
group=f.create_group(grp)
except ValueError:
print "Day group already exists"


str_type=h5py.new_vlen(str)
mydescriptor = {'names': ('gender','age','weight'), 'formats': ('S1',
'f4', 'f4')}
print "Filename is: ",src
fs = gzip.open(src)

dset = f.create_dataset ('Foo',data=arr,compression='gzip')

s=0

#Takes the longest here
for y in fs:
continue
a=y.split(',')
s=s+1
dset.resize(s,axis=0)
fs.close()

f.close()


This works but just takes a VERY long time.

Any way to optimize this?

TIA


On Wed, Jun 24, 2009 at 12:13 AM, Chris Withers<chris [at] simplistix> wrote:
> Terry Reedy wrote:
>>
>> Mag Gam wrote:
>>>
>>> Yes, the system has 64Gig of physical memory.
>>
>> drool ;-).
>
> Well, except that, dependent on what OS he's using, the size of one process
> may well still be limited to 2GB...
>
> Chris
>
> --
> Simplistix - Content Management, Zope & Python Consulting
>           - http://www.simplistix.co.uk
> --
> http://mail.python.org/mailman/listinfo/python-list
>
--
http://mail.python.org/mailman/listinfo/python-list


skip at pobox

Jun 24, 2009, 9:21 AM

Post #10 of 13 (2113 views)
Permalink
Re: Reading a large csv file [In reply to]

Mag> s=0

Mag> #Takes the longest here
Mag> for y in fs:
Mag> continue
Mag> a=y.split(',')
Mag> s=s+1
Mag> dset.resize(s,axis=0)
Mag> fs.close()

Mag> f.close()

Mag> This works but just takes a VERY long time.

Mag> Any way to optimize this?

I sort of suspect you're missing something there. Is there nothing between
the for loop and the overly indented continue statement?

At any rate, try using the csv module to read in your records:

import csv

reader = csv.reader(fs)
...

for s, row in enumerate(reader):
dset.resize(s, axis=0)

--
Skip Montanaro - skip [at] pobox - http://www.smontanaro.net/
when i wake up with a heart rate below 40, i head right for the espresso
machine. -- chaos @ forums.usms.org
--
http://mail.python.org/mailman/listinfo/python-list


lie.1296 at gmail

Jun 24, 2009, 12:57 PM

Post #11 of 13 (2110 views)
Permalink
Re: Reading a large csv file [In reply to]

Mag Gam wrote:
> Sorry for the delayed response. I was trying to figure this problem
> out. The OS is Linux, BTW

Maybe I'm just being pedantic, but saying your OS is Linux means little
as there are hundreds of variants (distros) of Linux. (Not to mention
that Linux is a kernel, not a full blown OS, and people in GNU will
insist to call Linux-based OS GNU/Linux)

> Here is some code I have:
> import numpy as np
> from numpy import *

Why are you importing numpy twice as np and as *?

> import gzip
> import h5py
> import re
> import sys, string, time, getopt
> import os
>
> src=sys.argv[1]
> fs = gzip.open(src)
> x=src.split("/")
> filename=x[len(x)-1]
>
> #Get YYYY/MM/DD format
> YYYY=(filename.rsplit(".",2)[0])[0:4]
> MM=(filename.rsplit(".",2)[0])[4:6]
> DD=(filename.rsplit(".",2)[0])[6:8]

>
> f=h5py.File('/tmp/test_foo/FE.hdf5','w')

this particular line would make it impossible to have more than one
instance of the program open. May not be your concern...

>
> grp="/"+YYYY
> try:
> f.create_group(grp)
> except ValueError:
> print "Year group already exists"
>
> grp=grp+"/"+MM
> try:
> f.create_group(grp)
> except ValueError:
> print "Month group already exists"
>
> grp=grp+"/"+DD
> try:
> group=f.create_group(grp)
> except ValueError:
> print "Day group already exists"
>

> str_type=h5py.new_vlen(str)

> mydescriptor = {'names': ('gender','age','weight'), 'formats': ('S1',
> 'f4', 'f4')}
> print "Filename is: ",src
> fs = gzip.open(src)

> dset = f.create_dataset ('Foo',data=arr,compression='gzip')

What is `arr`?

> s=0
>
> #Takes the longest here
> for y in fs:
> continue
> a=y.split(',')

> s=s+1
> dset.resize(s,axis=0)

You increment s by 1 for each iteration, would this copy the dataset? (I
never worked with h5py, so I don't know how it works)
--
http://mail.python.org/mailman/listinfo/python-list


magawake at gmail

Jun 26, 2009, 3:47 AM

Post #12 of 13 (2090 views)
Permalink
Re: Reading a large csv file [In reply to]

Thankyou everyone for the responses! I took some of your suggestions
and my loading sped up by 25%



On Wed, Jun 24, 2009 at 3:57 PM, Lie Ryan<lie.1296 [at] gmail> wrote:
> Mag Gam wrote:
>> Sorry for the delayed response. I was trying to figure this problem
>> out. The OS is Linux, BTW
>
> Maybe I'm just being pedantic, but saying your OS is Linux means little
> as there are hundreds of variants (distros) of Linux. (Not to mention
> that Linux is a kernel, not a full blown OS, and people in GNU will
> insist to call Linux-based OS GNU/Linux)
>
>> Here is some code I have:
>> import numpy as np
>> from numpy import *
>
> Why are you importing numpy twice as np and as *?
>
>> import gzip
>> import h5py
>> import re
>> import sys, string, time, getopt
>> import os
>>
>> src=sys.argv[1]
>> fs = gzip.open(src)
>> x=src.split("/")
>> filename=x[len(x)-1]
>>
>> #Get YYYY/MM/DD format
>> YYYY=(filename.rsplit(".",2)[0])[0:4]
>> MM=(filename.rsplit(".",2)[0])[4:6]
>> DD=(filename.rsplit(".",2)[0])[6:8]
>
>>
>> f=h5py.File('/tmp/test_foo/FE.hdf5','w')
>
> this particular line would make it impossible to have more than one
> instance of the program open. May not be your concern...
>
>>
>> grp="/"+YYYY
>> try:
>>   f.create_group(grp)
>> except ValueError:
>>   print "Year group already exists"
>>
>> grp=grp+"/"+MM
>> try:
>>   f.create_group(grp)
>> except ValueError:
>>   print "Month group already exists"
>>
>> grp=grp+"/"+DD
>> try:
>>   group=f.create_group(grp)
>> except ValueError:
>>   print "Day group already exists"
>>
>
>> str_type=h5py.new_vlen(str)
>
>> mydescriptor = {'names': ('gender','age','weight'), 'formats': ('S1',
>> 'f4', 'f4')}
>> print "Filename is: ",src
>> fs = gzip.open(src)
>
>> dset = f.create_dataset ('Foo',data=arr,compression='gzip')
>
> What is `arr`?
>
>> s=0
>>
>> #Takes the longest here
>> for y in fs:
>>      continue
>>   a=y.split(',')
>
>>   s=s+1
>>   dset.resize(s,axis=0)
>
> You increment s by 1 for each iteration, would this copy the dataset? (I
> never worked with h5py, so I don't know how it works)
> --
> http://mail.python.org/mailman/listinfo/python-list
>
--
http://mail.python.org/mailman/listinfo/python-list


drozzy at gmail

Jul 15, 2009, 1:00 PM

Post #13 of 13 (1951 views)
Permalink
Re: Reading a large csv file [In reply to]

On Jun 26, 6:47 am, Mag Gam <magaw...@gmail.com> wrote:
> Thankyou everyone for the responses! I took some of your suggestions
> and my loading sped up by 25%

what a useless post...
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.