Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

Cpython optimization

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


qreesp at gmail

Oct 21, 2009, 10:28 AM

Post #1 of 15 (449 views)
Permalink
Cpython optimization

Hello

As my Master's dissertation I chose Cpython optimization. That's why
i'd like to ask what are your suggestions what can be optimized. Well,
I know that quite a lot. I've downloaded the source code (I plan to
work on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at
the code I've found comment's like "this can be optimized by..." etc.
but maybe you guide me what should I concentrate on in my work?

I've 6-7 month for this and If I create something decent I can
publish it.

Thank you in advance for any help
--
http://mail.python.org/mailman/listinfo/python-list


python at mrabarnett

Oct 22, 2009, 8:27 AM

Post #2 of 15 (406 views)
Permalink
Re: Cpython optimization [In reply to]

Olof Bjarnason wrote:
[snip]
> A short question after having read through most of this thread, on the
> same subject (time-optimizing CPython):
>
> http://mail.python.org/pipermail/python-list/2007-September/098964.html
>
> We are experiencing multi-core processor kernels more and more these
> days. But they are all still connected to the main memory, right?
>
> To me that means, even though some algorithm can be split up into
> several threads that run on different cores of the processor, that any
> algorithm will be memory-speed limited. And memory access is a quite
> common operation for most algorithms.
>
> Then one could ask oneself: what is the point of multiple cores, if
> memory bandwidth is the bottleneck? Specifically, what makes one expect
> any speed gain from parallelizing a sequential algorithm into four
> threads, say, when the memory shuffling is the same speed in both
> scenarios? (Assuming memory access is much slower than ADDs, JMPs and
> such instructions - a quite safe assumption I presume)
>
> [. If every core had it's own primary memory, the situation would be
> different. It would be more like the situation in a distributed/internet
> based system, spread over several computers. One could view each core as
> a separate computer actually ]
>
Don't forget about the on-chip cache! :-)
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Oct 22, 2009, 12:16 PM

Post #3 of 15 (418 views)
Permalink
Re: Cpython optimization [In reply to]

Qrees wrote:
>> http://www.cython.org/
>
> I was thinking about sacrificing some flexibility of Python and thank
> you for pointing me to this project. I didn't about it.
> [...]
> BTW: My seminar deals with object oriented programming.

It's actually not hard to extend the set of optimisations in Cython, and
it's plain object oriented programming against the parse tree. Basically
just finding a code pattern and transforming it into something that does
the same thing, but faster.

It's also a lot of fun, and it certainly feels good to know that your tiny
tweak just made tons of other programs a bit faster.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


debatem1 at gmail

Oct 22, 2009, 1:27 PM

Post #4 of 15 (417 views)
Permalink
Re: Cpython optimization [In reply to]

On Wed, Oct 21, 2009 at 1:28 PM, Qrees <qreesp [at] gmail> wrote:
> Hello
>
> As my Master's dissertation I chose Cpython optimization. That's why
> i'd like to ask what are your suggestions what can be optimized. Well,
> I know that quite a lot. I've downloaded the source code (I plan to
> work on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at
> the code I've found comment's like "this can be optimized by..." etc.
> but maybe you guide me what should I concentrate on in my work?
>
> I've 6-7 month  for this and If I create something decent I can
> publish it.
>
> Thank you in advance for any help

Could always port it to CUDA ;)

Geremy Condra
--
http://mail.python.org/mailman/listinfo/python-list


tjreedy at udel

Oct 23, 2009, 12:14 AM

Post #5 of 15 (406 views)
Permalink
Re: Cpython optimization [In reply to]

Qrees wrote:
> Hello
>
> As my Master's dissertation I chose Cpython optimization. That's why
> i'd like to ask what are your suggestions what can be optimized. Well,
> I know that quite a lot. I've downloaded the source code (I plan to
> work on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at
> the code I've found comment's like "this can be optimized by..." etc.
> but maybe you guide me what should I concentrate on in my work?

2.6.3 is about to be replaced with 2.6.4, but even that will not have
improvements that have already been added to 3.x or 2.7. I strongly
suggest that you work with the 3.2 code, since there is then a maximal
chance that your improvements will actually be adopted.
tjr

--
http://mail.python.org/mailman/listinfo/python-list


olof.bjarnason at gmail

Oct 23, 2009, 12:45 AM

Post #6 of 15 (406 views)
Permalink
Re: Cpython optimization [In reply to]

2009/10/22 MRAB <python [at] mrabarnett>

> Olof Bjarnason wrote:
> [snip]
>
> A short question after having read through most of this thread, on the
>> same subject (time-optimizing CPython):
>>
>> http://mail.python.org/pipermail/python-list/2007-September/098964.html
>>
>> We are experiencing multi-core processor kernels more and more these days.
>> But they are all still connected to the main memory, right?
>>
>> To me that means, even though some algorithm can be split up into several
>> threads that run on different cores of the processor, that any algorithm
>> will be memory-speed limited. And memory access is a quite common operation
>> for most algorithms.
>>
>> Then one could ask oneself: what is the point of multiple cores, if memory
>> bandwidth is the bottleneck? Specifically, what makes one expect any speed
>> gain from parallelizing a sequential algorithm into four threads, say, when
>> the memory shuffling is the same speed in both scenarios? (Assuming memory
>> access is much slower than ADDs, JMPs and such instructions - a quite safe
>> assumption I presume)
>>
>> [. If every core had it's own primary memory, the situation would be
>> different. It would be more like the situation in a distributed/internet
>> based system, spread over several computers. One could view each core as a
>> separate computer actually ]
>>
>> Don't forget about the on-chip cache! :-)
>

Sorry for continuing slightly OT:

Yes, that makes matters even more interesting.

Caches for single-cpu-boards speed up memory access quite dramatically. Are
caches for multi-core boards shared among the cores? Or do each core have a
separate cache? I can only imagine how complicated the read/write logic must
be of these tiny electronic devices, in any case.

Of course caches makes the memory access-operations must faster, but I'm
guessing register instructions are still orders of magnitude faster than
(cached) memory access. (or else registers would not really be needed - you
could just view the whole primary memory as an array of registers!)

So I think my first question is still interesting: What is the point of
multiple cores, if memory is the bottleneck?
(it helps to think of algorithms such as line-drawing or ray-tracing, which
is easy to parallellize, yet I believe are still faster using a single core
instead of multiple because of the read/write-to-memory-bottleneck. It does
help to bring more workers to the mine if only one is allowed access at a
time, or more likely, several are allowed yet it gets so crowded that
queues/waiting is inevitable)

--

> http://mail.python.org/mailman/listinfo/python-list
>



--
twitter.com/olofb
olofb.wordpress.com
olofb.wordpress.com/tag/english


olof.bjarnason at gmail

Oct 23, 2009, 12:46 AM

Post #7 of 15 (404 views)
Permalink
Re: Cpython optimization [In reply to]

2009/10/23 Olof Bjarnason <olof.bjarnason [at] gmail>

>
>
> 2009/10/22 MRAB <python [at] mrabarnett>
>
> Olof Bjarnason wrote:
>> [snip]
>>
>> A short question after having read through most of this thread, on the
>>> same subject (time-optimizing CPython):
>>>
>>> http://mail.python.org/pipermail/python-list/2007-September/098964.html
>>>
>>> We are experiencing multi-core processor kernels more and more these
>>> days. But they are all still connected to the main memory, right?
>>>
>>> To me that means, even though some algorithm can be split up into several
>>> threads that run on different cores of the processor, that any algorithm
>>> will be memory-speed limited. And memory access is a quite common operation
>>> for most algorithms.
>>>
>>> Then one could ask oneself: what is the point of multiple cores, if
>>> memory bandwidth is the bottleneck? Specifically, what makes one expect any
>>> speed gain from parallelizing a sequential algorithm into four threads, say,
>>> when the memory shuffling is the same speed in both scenarios? (Assuming
>>> memory access is much slower than ADDs, JMPs and such instructions - a quite
>>> safe assumption I presume)
>>>
>>> [. If every core had it's own primary memory, the situation would be
>>> different. It would be more like the situation in a distributed/internet
>>> based system, spread over several computers. One could view each core as a
>>> separate computer actually ]
>>>
>>> Don't forget about the on-chip cache! :-)
>>
>
> Sorry for continuing slightly OT:
>
> Yes, that makes matters even more interesting.
>
> Caches for single-cpu-boards speed up memory access quite dramatically. Are
> caches for multi-core boards shared among the cores? Or do each core have a
> separate cache? I can only imagine how complicated the read/write logic must
> be of these tiny electronic devices, in any case.
>
> Of course caches makes the memory access-operations must faster, but I'm
> guessing register instructions are still orders of magnitude faster than
> (cached) memory access. (or else registers would not really be needed - you
> could just view the whole primary memory as an array of registers!)
>
> So I think my first question is still interesting: What is the point of
> multiple cores, if memory is the bottleneck?
> (it helps to think of algorithms such as line-drawing or ray-tracing, which
> is easy to parallellize, yet I believe are still faster using a single core
> instead of multiple because of the read/write-to-memory-bottleneck. It does
> help to bring more workers to the mine if
>

um typo "It does NOT help to .."

only one is allowed access at a time, or more likely, several are allowed
> yet it gets so crowded that queues/waiting is inevitable)
>
> --
>
>> http://mail.python.org/mailman/listinfo/python-list
>>
>
>
>
> --
> twitter.com/olofb
> olofb.wordpress.com
> olofb.wordpress.com/tag/english
>
>


--
twitter.com/olofb
olofb.wordpress.com
olofb.wordpress.com/tag/english


stefan_ml at behnel

Oct 23, 2009, 2:22 AM

Post #8 of 15 (404 views)
Permalink
Re: Cpython optimization [In reply to]

> Olof Bjarnason wrote:
> [snip]
>> A short question after having read through most of this thread, on the
>> same subject (time-optimizing CPython):
>>
>> http://mail.python.org/pipermail/python-list/2007-September/098964.html
>>
>> We are experiencing multi-core processor kernels more and more these
>> days. But they are all still connected to the main memory, right?
>>
>> To me that means, even though some algorithm can be split up into
>> several threads that run on different cores of the processor, that any
>> algorithm will be memory-speed limited. And memory access is a quite
>> common operation for most algorithms.
>>
>> Then one could ask oneself: what is the point of multiple cores, if
>> memory bandwidth is the bottleneck? Specifically, what makes one
>> expect any speed gain from parallelizing a sequential algorithm into
>> four threads, say, when the memory shuffling is the same speed in both
>> scenarios? (Assuming memory access is much slower than ADDs, JMPs and
>> such instructions - a quite safe assumption I presume)

Modern (multi-core) processors have several levels of caches that interact
with the other cores in different ways.

You should read up on NUMA.

http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


olof.bjarnason at gmail

Oct 23, 2009, 4:50 AM

Post #9 of 15 (400 views)
Permalink
Re: Cpython optimization [In reply to]

2009/10/23 Stefan Behnel <stefan_ml [at] behnel>

> > Olof Bjarnason wrote:
> > [snip]
> >> A short question after having read through most of this thread, on the
> >> same subject (time-optimizing CPython):
> >>
> >> http://mail.python.org/pipermail/python-list/2007-September/098964.html
> >>
> >> We are experiencing multi-core processor kernels more and more these
> >> days. But they are all still connected to the main memory, right?
> >>
> >> To me that means, even though some algorithm can be split up into
> >> several threads that run on different cores of the processor, that any
> >> algorithm will be memory-speed limited. And memory access is a quite
> >> common operation for most algorithms.
> >>
> >> Then one could ask oneself: what is the point of multiple cores, if
> >> memory bandwidth is the bottleneck? Specifically, what makes one
> >> expect any speed gain from parallelizing a sequential algorithm into
> >> four threads, say, when the memory shuffling is the same speed in both
> >> scenarios? (Assuming memory access is much slower than ADDs, JMPs and
> >> such instructions - a quite safe assumption I presume)
>
> Modern (multi-core) processors have several levels of caches that interact
> with the other cores in different ways.
>
> You should read up on NUMA.
>
> http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access
>

Thanks that was a good read. Basically NUMA addresses the problems I mention
by creating several primary memories (called banks) - making the motherboard
contain several computers (=cpu+primary memory) instead of one primary
memory and several cores, if I read it correctly.


> Stefan
> --
> http://mail.python.org/mailman/listinfo/python-list
>



--
twitter.com/olofb
olofb.wordpress.com
olofb.wordpress.com/tag/english


solipsis at pitrou

Oct 23, 2009, 5:30 AM

Post #10 of 15 (399 views)
Permalink
Re: Cpython optimization [In reply to]

Le Fri, 23 Oct 2009 09:45:06 +0200, Olof Bjarnason a écrit :
>
> So I think my first question is still interesting: What is the point of
> multiple cores, if memory is the bottleneck?

Why do you think it is, actually? Some workloads are CPU-bound, some
others are memory- or I/O-bound.

You will find plenty of benchmarks on the Web showing that some workloads
scale almost linearly to the number of CPU cores, while some don't.

--
http://mail.python.org/mailman/listinfo/python-list


olof.bjarnason at gmail

Oct 23, 2009, 6:53 AM

Post #11 of 15 (399 views)
Permalink
Re: Cpython optimization [In reply to]

2009/10/23 Antoine Pitrou <solipsis [at] pitrou>

> Le Fri, 23 Oct 2009 09:45:06 +0200, Olof Bjarnason a écrit :
> >
> > So I think my first question is still interesting: What is the point of
> > multiple cores, if memory is the bottleneck?
>
> Why do you think it is, actually? Some workloads are CPU-bound, some
> others are memory- or I/O-bound.
>

Big I/O operations have even less chance of gaining anything from being
parallellized (quite the opposite I would guess due to synchronization
overhead).

Operations that uses very little memory, say some mathematical computations
that can fit in 16 registers, have a lot to gain in being parallellized, of
course. Maybe computing small-formulae fractals is a good example of this?

My question was regarding the most hard-to-tell and, sadly also the most
common, trying to speed up algorithms that work over big data structures in
primary memory. Things like drawing lines, doing image processing,
multiplication of large matrices, ray tracing.... Basically anything with
large amounts of input information/buffers/structures, stored in primary
memory.

This would be way to speed up things in an image processing algorithm:
1. divide the image into four subimages
2. let each core process each part independently
3. fix&merge (along split lines for example) into a resulting, complete
image

Notice that no gain would be achieved if the memory was shared among the
cores, at least not if the operation-per-pixel is faster than the read/write
into memory.


> You will find plenty of benchmarks on the Web showing that some workloads
> scale almost linearly to the number of CPU cores, while some don't.
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>



--
twitter.com/olofb
olofb.wordpress.com
olofb.wordpress.com/tag/english


solipsis at pitrou

Oct 23, 2009, 10:42 AM

Post #12 of 15 (393 views)
Permalink
Re: Cpython optimization [In reply to]

Le Fri, 23 Oct 2009 15:53:02 +0200, Olof Bjarnason a écrit :
>
> This would be way to speed up things in an image processing algorithm:
> 1. divide the image into four subimages 2. let each core process each
> part independently 3. fix&merge (along split lines for example) into a
> resulting, complete image

Well, don't assume you're the first to think about that.
I'm sure that performance-conscious image processing software already has
this kind of tile-based optimizations.
(actually, if you look at benchmarks of 3D rendering which are regularly
done by "enthusiast" websites, it shows exactly that)

Antoine.

--
http://mail.python.org/mailman/listinfo/python-list


olof.bjarnason at gmail

Oct 23, 2009, 11:31 AM

Post #13 of 15 (389 views)
Permalink
Re: Cpython optimization [In reply to]

>
> >
> > This would be way to speed up things in an image processing algorithm:
> > 1. divide the image into four subimages 2. let each core process each
> > part independently 3. fix&merge (along split lines for example) into a
> > resulting, complete image
>
> Well, don't assume you're the first to think about that.
> I'm sure that performance-conscious image processing software already has
> this kind of tile-based optimizations.
> (actually, if you look at benchmarks of 3D rendering which are regularly
> done by "enthusiast" websites, it shows exactly that)
>
> No I didn't assume I was the first to think about that - I wanted to learn
more about how optimization at all is possible/viable with multi-core
motherboards, when the memory speed is the bottleneck anyway, regardless of
smart caching technologies.

I still have not received a convincing answer :)

Antoine.

--

> http://mail.python.org/mailman/listinfo/python-list
>



--
twitter.com/olofb
olofb.wordpress.com
olofb.wordpress.com/tag/english


jasonsewall at gmail

Oct 23, 2009, 11:53 AM

Post #14 of 15 (387 views)
Permalink
Re: Cpython optimization [In reply to]

On Fri, Oct 23, 2009 at 2:31 PM, Olof Bjarnason
<olof.bjarnason [at] gmail> wrote:
>> >
>> > This would be way to speed up things in an image processing algorithm:
>> > 1. divide the image into four subimages 2. let each core process each
>> > part independently 3. fix&merge (along split lines for example) into a
>> > resulting, complete image
>>
>> Well, don't assume you're the first to think about that.
>> I'm sure that performance-conscious image processing software already has
>> this kind of tile-based optimizations.
>> (actually, if you look at benchmarks of 3D rendering which are regularly
>> done by "enthusiast" websites, it shows exactly that)

This is indeed a tried-and-true method for parallelizing certain image
and other grid-based algorithms, but it is in fact not appropriate for
a wide variety of techniques. Things like median filters, where f(A|B)
!= f(A)|f(B) (with | as some sort of concatenation), will not be able
to generate correct results given the scheme you outlined.

> No I didn't assume I was the first to think about that - I wanted to learn
> more about how optimization at all is possible/viable with multi-core
> motherboards, when the memory speed is the bottleneck anyway, regardless of
> smart caching technologies.
>
> I still have not received a convincing answer :)

Give Ulrich Drepper's "What Every Programmer Should Know about Memory"
a read (http://people.redhat.com/drepper/cpumemory.pdf) and you'll
hear all you want to know (and more) about how the memory hierarchy
plays with multi-core.

I don't contribute to CPython, but I am virtually certain that they
are not interested in having the compiler/interpreter try to apply
some generic threading to arbitrary code. The vast majority of Python
code wouldn't benefit from it even if it worked well, and I'm _very_
skeptical that there is any silver bullet for parallelizing general
code. If you think of one, tell Intel, they'll hire you.

_Perhaps_ the numpy or scipy people (I am not associated with either
of them) would be interested in some threading for certain array
operations. Maybe you could write something on top of what they have
to speed up array ops.

Jason
--
http://mail.python.org/mailman/listinfo/python-list


patrol at invisibleroads

Nov 6, 2009, 9:33 AM

Post #15 of 15 (304 views)
Permalink
Re: Cpython optimization [In reply to]

On 10/21/2009 01:28 PM, Qrees wrote:
> As my Master's dissertation I chose Cpython optimization. That's why
> i'd like to ask what are your suggestions what can be optimized. Well,
> I know that quite a lot. I've downloaded the source code (I plan to
> work on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at
> the code I've found comment's like "this can be optimized by..." etc.
> but maybe you guide me what should I concentrate on in my work?
>
> I've 6-7 month for this and If I create something decent I can
> publish it.
>

Hi Qrees,

Have you heard of the Unladen Swallow project? It is an optimization
branch of CPython. Perhaps you can talk to some of the people working on
the project. http://code.google.com/p/unladen-swallow/

RHH


--
Get customized Python tutorials and live instruction for your business,
research or programming problem
Tuesday November 17 2009 6:30pm – 10pm
NYC Seminar and Conference Center (71 W 23rd St @ 6th Ave, New York, NY)
11/12 slots available for $30 each
http://invisibleroads.com
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.