Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

Multi thread reading a file

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


magawake at gmail

Jun 30, 2009, 6:52 PM

Post #1 of 19 (605 views)
Permalink
Multi thread reading a file

Hello All,

I am very new to python and I am in the process of loading a very
large compressed csv file into another format. I was wondering if I
can do this in a multi thread approach.

Here is the pseudo code I was thinking about:

Let T = Total number of lines in a file, Example 1000000 (1 million files)
Let B = Total number of lines in a buffer, for example 10000 lines


Create a thread to read until buffer
Create another thread to read buffer+buffer ( So we have 2 threads
now. But since the file is zipped I have to wait until the first
thread is completed. Unless someone knows of a clever technique.
Write the content of thread 1 into a numpy array
Write the content of thread 2 into a numpy array

But I don't think we are capable of multiprocessing tasks for this....


Any ideas? Has anyone ever tackled a problem like this before?
--
http://mail.python.org/mailman/listinfo/python-list


gagsl-py2 at yahoo

Jun 30, 2009, 9:54 PM

Post #2 of 19 (583 views)
Permalink
Re: Multi thread reading a file [In reply to]

En Tue, 30 Jun 2009 22:52:18 -0300, Mag Gam <magawake[at]gmail.com> escribió:

> I am very new to python and I am in the process of loading a very
> large compressed csv file into another format. I was wondering if I
> can do this in a multi thread approach.

Does the format conversion involve a significant processing time? If not,
the total time is dominated by the I/O time (reading and writing the file)
so it's doubtful you gain anything from multiple threads.

> Here is the pseudo code I was thinking about:
>
> Let T = Total number of lines in a file, Example 1000000 (1 million
> files)
> Let B = Total number of lines in a buffer, for example 10000 lines
>
>
> Create a thread to read until buffer
> Create another thread to read buffer+buffer ( So we have 2 threads
> now. But since the file is zipped I have to wait until the first
> thread is completed. Unless someone knows of a clever technique.
> Write the content of thread 1 into a numpy array
> Write the content of thread 2 into a numpy array

Can you process each line independently? Is the record order important? If
not (or at least, some local dis-ordering is acceptable) you may use a few
worker threads (doing the conversion), feed them thru a Queue object, put
the converted lines into another Queue, and let another thread write the
results onto the destination file.

import Queue, threading, csv

def convert(in_queue, out_queue):
while True:
row = in_queue.get()
if row is None: break
# ... convert row
out_queue.put(converted_line)

def write_output(out_queue):
while True:
line = out_queue.get()
if line is None: break
# ... write line to output file

in_queue = Queue.Queue()
out_queue = Queue.Queue()
tlist = []
for i in range(4):
t = threading.Thread(target=convert, args=(in_queue, out_queue))
t.start()
tlist.append(t)
output_thread = threading.Thread(target=write_output, args=(out_queue,))
output_thread.start()

with open("...") as csvfile:
reader = csv.reader(csvfile, ...)
for row in reader:
in_queue.put(row)

for t in tlist: in_queue.put(None) # indicate end-of-work
for t in tlist: t.join() # wait until finished
out_queue.put(None)
output_thread.join() # wait until finished

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


ldo at geek-central

Jun 30, 2009, 10:07 PM

Post #3 of 19 (559 views)
Permalink
Re: Multi thread reading a file [In reply to]

In message <mailman.2420.1246413152.8015.python-list[at]python.org>, Mag Gam
wrote:

> I am very new to python and I am in the process of loading a very
> large compressed csv file into another format. I was wondering if I
> can do this in a multi thread approach.

Why bother?

--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Jun 30, 2009, 10:30 PM

Post #4 of 19 (568 views)
Permalink
Re: Multi thread reading a file [In reply to]

Gabriel Genellina wrote:
> En Tue, 30 Jun 2009 22:52:18 -0300, Mag Gam <magawake[at]gmail.com> escribió:
>
>> I am very new to python and I am in the process of loading a very
>> large compressed csv file into another format. I was wondering if I
>> can do this in a multi thread approach.
>
> Does the format conversion involve a significant processing time? If
> not, the total time is dominated by the I/O time (reading and writing
> the file) so it's doubtful you gain anything from multiple threads.

Well, the OP didn't say anything about multiple processors, so multiple
threads may not help wrt. processing time. However, if the file is large
and the OS can schedule the I/O in a way that a seek disaster is avoided
(although that's hard to assure with today's hard disk storage density, but
SSDs may benefit), multiple threads reading multiple partial streams may
still reduce the overall runtime due to increased I/O throughput.

That said, the OP was mentioning that the data was compressed, so I doubt
that the I/O bandwidth is a problem here. As another poster put it: why
bother? Run a few benchmarks first to see where (and if!) things really get
slow, and then check what to do about the real problem.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


magawake at gmail

Jul 1, 2009, 4:41 AM

Post #5 of 19 (559 views)
Permalink
Re: Multi thread reading a file [In reply to]

Thanks for the response Gabriel.



On Wed, Jul 1, 2009 at 12:54 AM, Gabriel
Genellina<gagsl-py2[at]yahoo.com.ar> wrote:
> En Tue, 30 Jun 2009 22:52:18 -0300, Mag Gam <magawake[at]gmail.com> escribió:
>
>> I am very new to python and I am in the process of loading a very
>> large compressed csv file into another format.  I was wondering if I
>> can do this in a multi thread approach.
>
> Does the format conversion involve a significant processing time? If not,
> the total time is dominated by the I/O time (reading and writing the file)
> so it's doubtful you gain anything from multiple threads.

The format does inolve significant time processing each line.
>
>> Here is the pseudo code I was thinking about:
>>
>> Let T  = Total number of lines in a file, Example 1000000 (1 million
>> files)
>> Let B = Total number of lines in a buffer, for example 10000 lines
>>
>>
>> Create a thread to read until buffer
>> Create another thread to read buffer+buffer  ( So we have 2 threads
>> now. But since the file is zipped I have to wait until the first
>> thread is completed. Unless someone knows of a clever technique.
>> Write the content of thread 1 into a numpy array
>> Write the content of thread 2 into a numpy array
>
> Can you process each line independently? Is the record order important? If
> not (or at least, some local dis-ordering is acceptable) you may use a few
> worker threads (doing the conversion), feed them thru a Queue object, put
> the converted lines into another Queue, and let another thread write the
> results onto the destination file.

Yes, each line can be independent. The original file is a time series
file which I am placing it into a Numpy array therefore I don't think
the record order is important. The writing is actually done when I
place a value into a "dset" object.


Let me show you what I mean.
reader=csv.reader(open("100000.csv"))
for s,row in enumerate(reader):
if s!=0 and s%bufsize==0:
dset[s-bufsize:s] = t #here is where I am writing the data to
the data structure. Using a range or broadcasting.

#15 columns
if len(row) != 15:
break

t[s%bufsize] = tuple(row)

#Do this all the way at the end for flushing.
if (s%bufsize != 0):
dset[(s//bufsize)*bufsize:s]=t[0:s%bufsize]












>
> import Queue, threading, csv
>
> def convert(in_queue, out_queue):
>  while True:
>    row = in_queue.get()
>    if row is None: break
>    # ... convert row
>    out_queue.put(converted_line)
>
> def write_output(out_queue):
>  while True:
>    line = out_queue.get()
>    if line is None: break
>    # ... write line to output file
>
> in_queue = Queue.Queue()
> out_queue = Queue.Queue()
> tlist = []
> for i in range(4):
>  t = threading.Thread(target=convert, args=(in_queue, out_queue))
>  t.start()
>  tlist.append(t)
> output_thread = threading.Thread(target=write_output, args=(out_queue,))
> output_thread.start()
>
> with open("...") as csvfile:
>  reader = csv.reader(csvfile, ...)
>  for row in reader:
>    in_queue.put(row)
>
> for t in tlist: in_queue.put(None) # indicate end-of-work
> for t in tlist: t.join() # wait until finished
> out_queue.put(None)
> output_thread.join() # wait until finished
>
> --
> Gabriel Genellina
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
--
http://mail.python.org/mailman/listinfo/python-list


Scott.Daniels at Acm

Jul 1, 2009, 8:49 AM

Post #6 of 19 (566 views)
Permalink
Re: Multi thread reading a file [In reply to]

Gabriel Genellina wrote:
> ...
> def convert(in_queue, out_queue):
> while True:
> row = in_queue.get()
> if row is None: break
> # ... convert row
> out_queue.put(converted_line)

These loops work well with the two-argument version of iter,
which is easy to forget, but quite useful to have in your bag
of tricks:

def convert(in_queue, out_queue):
for row in iter(in_queue.get, None):
# ... convert row
out_queue.put(converted_line)

--Scott David Daniels
Scott.Daniels[at]Acm.Org
--
http://mail.python.org/mailman/listinfo/python-list


magawake at gmail

Jul 1, 2009, 6:28 PM

Post #7 of 19 (559 views)
Permalink
Re: Multi thread reading a file [In reply to]

LOL! :-)

Why not. I think I will take just do it single thread for now and if
performance is really that bad I can re investigate.

Either way, thanks everyone for your feedback! I think I like python a
lot because of the great user community and wiliness to help!



On Wed, Jul 1, 2009 at 1:07 AM, Lawrence
D'Oliveiro<ldo[at]geek-central.gen.new_zealand> wrote:
> In message <mailman.2420.1246413152.8015.python-list[at]python.org>, Mag Gam
> wrote:
>
>> I am very new to python and I am in the process of loading a very
>> large compressed csv file into another format.  I was wondering if I
>> can do this in a multi thread approach.
>
> Why bother?
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Jul 1, 2009, 9:50 PM

Post #8 of 19 (558 views)
Permalink
Re: Multi thread reading a file [In reply to]

Mag Gam wrote:
> On Wed, Jul 1, 2009 at 12:54 AM, Gabriel Genellina wrote:
>> Does the format conversion involve a significant processing time? If not,
>> the total time is dominated by the I/O time (reading and writing the file)
>> so it's doubtful you gain anything from multiple threads.
>
> The format does inolve significant time processing each line.
>
>>> Here is the pseudo code I was thinking about:
>>>
>>> Let T = Total number of lines in a file, Example 1000000 (1 million
>>> files)
>>> Let B = Total number of lines in a buffer, for example 10000 lines
>>>
>>>
>>> Create a thread to read until buffer
>>> Create another thread to read buffer+buffer ( So we have 2 threads
>>> now. But since the file is zipped I have to wait until the first
>>> thread is completed. Unless someone knows of a clever technique.
>>> Write the content of thread 1 into a numpy array
>>> Write the content of thread 2 into a numpy array
>> Can you process each line independently? Is the record order important? If
>> not (or at least, some local dis-ordering is acceptable) you may use a few
>> worker threads (doing the conversion), feed them thru a Queue object, put
>> the converted lines into another Queue, and let another thread write the
>> results onto the destination file.
>
> Yes, each line can be independent. The original file is a time series
> file which I am placing it into a Numpy array therefore I don't think
> the record order is important. The writing is actually done when I
> place a value into a "dset" object.
>
> Let me show you what I mean.
> reader=csv.reader(open("100000.csv"))

Have you tried if using the csv reader here is actually fast enough for
you? Maybe you can just .split(',') each line or something (no idea what
else the csv reader may do, but anyway...)


> for s,row in enumerate(reader):
> if s!=0 and s%bufsize==0:
> dset[s-bufsize:s] = t #here is where I am writing the data to
> the data structure. Using a range or broadcasting.

Erm, what's "t"? And where do you think the "significant" processing time
comes from in your example? From your code, that's totally non-obvious to me.


> #15 columns
> if len(row) != 15:
> break
>
> t[s%bufsize] = tuple(row)
>
> #Do this all the way at the end for flushing.
> if (s%bufsize != 0):
> dset[(s//bufsize)*bufsize:s]=t[0:s%bufsize]

If you *really* think your code is not I/O bound, but that the code for
writing the data into the NumPy array is the slow part (which I actually
doubt, but anyway), you can try writing your code in Cython, which has
direct support for writing to NumPy arrays through their buffer interface.

http://docs.cython.org/docs/numpy_tutorial.html

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


gagsl-py2 at yahoo

Jul 2, 2009, 3:10 AM

Post #9 of 19 (550 views)
Permalink
Re: Multi thread reading a file [In reply to]

En Wed, 01 Jul 2009 12:49:31 -0300, Scott David Daniels
<Scott.Daniels[at]acm.org> escribió:

> Gabriel Genellina wrote:
>> ...
>> def convert(in_queue, out_queue):
>> while True:
>> row = in_queue.get()
>> if row is None: break
>> # ... convert row
>> out_queue.put(converted_line)
>
> These loops work well with the two-argument version of iter,
> which is easy to forget, but quite useful to have in your bag
> of tricks:
>
> def convert(in_queue, out_queue):
> for row in iter(in_queue.get, None):
> # ... convert row
> out_queue.put(converted_line)

Yep, I always forget about that variant of iter() -- very handy!

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


rylesny at gmail

Jul 2, 2009, 7:10 PM

Post #10 of 19 (539 views)
Permalink
Re: Multi thread reading a file [In reply to]

On Jul 2, 6:10 am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar> wrote:
> En Wed, 01 Jul 2009 12:49:31 -0300, Scott David Daniels  
> <Scott.Dani...@acm.org> escribió:
> > These loops work well with the two-argument version of iter,
> > which is easy to forget, but quite useful to have in your bag
> > of tricks:
>
> >      def convert(in_queue, out_queue):
> >          for row in iter(in_queue.get, None):
> >              # ... convert row
> >              out_queue.put(converted_line)
>
> Yep, I always forget about that variant of iter() -- very handy!

Yes, at first glance using iter() here seems quite elegant and clever.
You might even pat yourself on the back, or treat yourself to an ice
cream cone, as I once did. There is one subtle distinction, however.
Please allow me to demonstrate.

>>> import Queue
>>>
>>> queue = Queue.Queue()
>>>
>>> queue.put(1)
>>> queue.put("la la la")
>>> queue.put(None)
>>>
>>> list(iter(queue.get, None))
[1, 'la la la']
>>>
>>> # Cool, it really works! I'm going to change all my old code to use this... new and *improved*
...
>>> # And then one day your user inevitably does something like this.
...
>>> class A(object):
... def __init__(self, value):
... self.value = value
...
... def __eq__(self, other):
... return self.value == other.value
...
>>> queue.put(A(1))
>>> queue.put(None)
>>>
>>> # And then this happens inside your 'generic' code (which probably even passed your unit tests).
...
>>> list(iter(queue.get, None))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in __eq__
AttributeError: 'NoneType' object has no attribute 'value'
>>>
>>> # Oh... yeah. I really *did* want 'is None' and not '== None' which iter() will do. Sorry guys!

Please don't let this happen to you too ;)
--
http://mail.python.org/mailman/listinfo/python-list


http://phr.cx at NOSPAM

Jul 2, 2009, 7:20 PM

Post #11 of 19 (539 views)
Permalink
Re: Multi thread reading a file [In reply to]

ryles <rylesny[at]gmail.com> writes:
> >>> # Oh... yeah. I really *did* want 'is None' and not '== None'
> >>> which iter() will do. Sorry guys!
>
> Please don't let this happen to you too ;)

None is a perfectly good value to put onto a queue. I prefer
using a unique sentinel to mark the end of the stream:

sentinel = object()
--
http://mail.python.org/mailman/listinfo/python-list


rylesny at gmail

Jul 2, 2009, 8:09 PM

Post #12 of 19 (539 views)
Permalink
Re: Multi thread reading a file [In reply to]

On Jul 2, 10:20 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> ryles <ryle...@gmail.com> writes:
> > >>> # Oh... yeah. I really *did* want 'is None' and not '== None'
> > >>> which iter() will do. Sorry guys!
>
> > Please don't let this happen to you too ;)
>
> None is a perfectly good value to put onto a queue.  I prefer
> using a unique sentinel to mark the end of the stream:
>
>    sentinel = object()

I agree, this is cleaner than None. We're still in the same boat,
though, regarding iter(). Either it's 'item == None' or 'item == object
()', and depending on the type, __eq__ can introduce some avoidable
risk.

FWIW, even object() has its disadvantages. Namely, it doesn't work for
multiprocessing.Queue which pickles and unpickles, thus giving you a
new object. One way to deal with this is to define a "Stopper" class
and type check objects taken from the queue. This is not news to
anyone who's worked with multiprocessing.Queue, though.
--
http://mail.python.org/mailman/listinfo/python-list


http://phr.cx at NOSPAM

Jul 2, 2009, 8:15 PM

Post #13 of 19 (535 views)
Permalink
Re: Multi thread reading a file [In reply to]

ryles <rylesny[at]gmail.com> writes:
> >    sentinel = object()
>
> I agree, this is cleaner than None. We're still in the same boat,
> though, regarding iter(). Either it's 'item == None' or 'item == object ()'

Use "item is sentinel".
--
http://mail.python.org/mailman/listinfo/python-list


gagsl-py2 at yahoo

Jul 2, 2009, 8:47 PM

Post #14 of 19 (535 views)
Permalink
Re: Multi thread reading a file [In reply to]

En Fri, 03 Jul 2009 00:15:40 -0300, <//phr.cx[at]nospam.invalid>> escribió:

> ryles <rylesny[at]gmail.com> writes:
>> >    sentinel = object()
>>
>> I agree, this is cleaner than None. We're still in the same boat,
>> though, regarding iter(). Either it's 'item == None' or 'item == object
>> ()'
>
> Use "item is sentinel".

We're talking about the iter() builtin behavior, and that uses ==
internally.

It could have used an identity test, and that would be better for this
specific case. But then iter(somefile.read, '') wouldn't work. A
compromise solution is required; since one can customize the equality test
but not an identity test, the former has a small advantage. (I don't know
if this was the actual reason, or even if this really was a concious
decision, but that's why *I* would choose == to test against the sentinel
value).

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


http://phr.cx at NOSPAM

Jul 2, 2009, 8:55 PM

Post #15 of 19 (537 views)
Permalink
Re: Multi thread reading a file [In reply to]

"Gabriel Genellina" <gagsl-py2[at]yahoo.com.ar> writes:
> We're talking about the iter() builtin behavior, and that uses ==
> internally.

Oh, I see. Drat.

> It could have used an identity test, and that would be better for this
> specific case. But then iter(somefile.read, '') wouldn't work.

Yeah, it should allow supplying a predicate instead of using == on
a value. How about (untested):

from itertools import *
...
for row in takewhile(lambda x: x is sentinel,
starmap(in_queue.get, repeat(()))):
...
--
http://mail.python.org/mailman/listinfo/python-list


rylesny at gmail

Jul 2, 2009, 9:13 PM

Post #16 of 19 (536 views)
Permalink
Re: Multi thread reading a file [In reply to]

On Jul 2, 11:55 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> Yeah, it should allow supplying a predicate instead of using == on
> a value.  How about (untested):
>
>    from itertools import *
>    ...
>    for row in takewhile(lambda x: x is sentinel,
>                          starmap(in_queue.get, repeat(()))):
>       ...

Yeah, it's a small recipe I'm sure a lot of others have written as
well. My old version:

def iterwhile(callable_, predicate):
""" Like iter() but with a predicate instead of a sentinel. """
return itertools.takewhile(predicate, repeatfunc(callable_))

where repeatfunc is as defined here:

http://docs.python.org/library/itertools.html#recipes

I wish all of these little recipes made their way into itertools or a
like module; itertools seems a bit tightly guarded.
--
http://mail.python.org/mailman/listinfo/python-list


ldo at geek-central

Jul 4, 2009, 5:12 PM

Post #17 of 19 (510 views)
Permalink
Re: Multi thread reading a file [In reply to]

In message <1beffd94-cfe6-4cf6-
bd48-2ccac8637796[at]j32g2000yqh.googlegroups.com>, ryles wrote:

>>>> # Oh... yeah. I really *did* want 'is None' and not '== None' which
>>>> # iter() will do. Sorry guys!
>
> Please don't let this happen to you too ;)

Strange. others have got told off for using "== None" instead of "is None"
<http://groups.google.co.nz/group/comp.lang.python/msg/a1f3170fa202af57>,
and yet it turns out Python itself does exactly the same thing.

--
http://mail.python.org/mailman/listinfo/python-list


steve at REMOVE-THIS-cybersource

Jul 4, 2009, 6:35 PM

Post #18 of 19 (510 views)
Permalink
Re: Multi thread reading a file [In reply to]

On Sun, 05 Jul 2009 12:12:22 +1200, Lawrence D'Oliveiro wrote:

> In message <1beffd94-cfe6-4cf6-
> bd48-2ccac8637796[at]j32g2000yqh.googlegroups.com>, ryles wrote:
>
>>>>> # Oh... yeah. I really *did* want 'is None' and not '== None' which
>>>>> # iter() will do. Sorry guys!
>>
>> Please don't let this happen to you too ;)
>
> Strange. others have got told off for using "== None" instead of "is
> None"
> <http://groups.google.co.nz/group/comp.lang.python/msg/
a1f3170fa202af57>,
> and yet it turns out Python itself does exactly the same thing.

That's not "strange", that's a bug. Did you report it to the tracker?



--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


ldo at geek-central

Jul 5, 2009, 2:52 AM

Post #19 of 19 (490 views)
Permalink
Re: Multi thread reading a file [In reply to]

In message <025ff4f1$0$20657$c3e8da3[at]news.astraweb.com>, Steven D'Aprano wrote:

> On Sun, 05 Jul 2009 12:12:22 +1200, Lawrence D'Oliveiro wrote:
>
>> In message <1beffd94-cfe6-4cf6-bd48-2ccac8637796[at]j32g2000yqh.googlegroups.com>, ryles wrote:
>>
>>>>>> # Oh... yeah. I really *did* want 'is None' and not '== None' which
>>>>>> # iter() will do. Sorry guys!
>>>
>>> Please don't let this happen to you too ;)
>>
>> Strange. others have got told off for using "== None" instead of "is
>> None" <http://groups.google.co.nz/group/comp.lang.python/msg/a1f3170fa202af57>,
>> and yet it turns out Python itself does exactly the same thing.
>
> That's not "strange", that's a bug.

It's not a bug, as Gabriel Genellina has pointed out.

--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.