Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Dev

C-level duck typing

 

 

First page Previous page 1 2 Next page Last page  View All Python dev RSS feed   Index | Next | Previous | View Threaded


mark at hotpy

May 17, 2012, 3:38 AM

Post #26 of 34 (301 views)
Permalink
Re: C-level duck typing [In reply to]

Dag Sverre Seljebotn wrote:
> On 05/16/2012 10:24 PM, Robert Bradshaw wrote:
>> On Wed, May 16, 2012 at 11:33 AM, "Martin v.
>> Löwis"<martin [at] v> wrote:
>>>> Does this use case make sense to everyone?
>>>>
>>>> The reason why we are discussing this on python-dev is that we are
>>>> looking
>>>> for a general way to expose these C level signatures within the Python
>>>> ecosystem. And Dag's idea was to expose them as part of the type
>>>> object,
>>>> basically as an addition to the current Python level tp_call() slot.
>>>
>>> The use case makes sense, yet there is also a long-standing solution
>>> already
>>> to expose APIs and function pointers: the capsule objects.
>>>
>>> If you want to avoid dictionary lookups on the server side, implement
>>> tp_getattro, comparing addresses of interned strings.
>>
>> Yes, that's an idea worth looking at. The point implementing
>> tp_getattro to avoid dictionary lookups overhead is a good one, worth
>> trying at least. One drawback is that this approach does require the
>> GIL (as does _PyType_Lookup).
>>
>> Regarding the C function being faster than the dictionary lookup (or
>> at least close enough that the lookup takes time), yes, this happens
>> all the time. For example one might be solving differential equations
>> and the "user input" is essentially a set of (usually simple) double
>> f(double) and its derivatives.
>
> To underline how this is performance critical to us, perhaps a full
> Cython example is useful.
>
> The following Cython code is a real world usecase. It is not too
> contrived in the essentials, although simplified a little bit. For
> instance undergrad engineering students could pick up Cython just to
> play with simple scalar functions like this.
>
> from numpy import sin
> # assume sin is a Python callable and that NumPy decides to support
> # our spec to also support getting a "double (*sinfuncptr)(double)".
>
> # Our mission: Avoid to have the user manually import "sin" from C,
> # but allow just using the NumPy object and still be fast.
>
> # define a function to integrate
> cpdef double f(double x):
> return sin(x * x) # guess on signature and use "fastcall"!
>
> # the integrator
> def integrate(func, double a, double b, int n):
> cdef double s = 0
> cdef double dx = (b - a) / n
> for i in range(n):
> # This is also a fastcall, but can be cached so doesn't
> # matter...
> s += func(a + i * dx)
> return s * dx
>
> integrate(f, 0, 1, 1000000)
>
> There are two problems here:
>
> - The "sin" global can be reassigned (monkey-patched) between each call
> to "f", no way for "f" to know. Even "sin" could do the reassignment. So
> you'd need to check for reassignment to do caching...

Since Cython allows static typing why not just declare that func can
treat sin as if it can't be monkeypatched?
Moving the load of a global variable out of the loop does seem to be a
rather obvious optimisation, if it were declared to be legal.

>
> - The fastcall inside of "f" is separated from the loop in "integrate".
> And since "f" is often in another module, we can't rely on static full
> program analysis.
>
> These problems with monkey-patching disappear if the lookup is negligible.
>
> Some rough numbers:
>
> - The overhead with the tp_flags hack is a 2 ns overhead (something
> similar with a metaclass, the problems are more how to synchronize that
> metaclass across multiple 3rd party libraries)

Does your approach handle subtyping properly?

>
> - Dict lookup 20 ns

Did you time _PyType_Lookup() ?

>
> - The sin function is about 35 ns. And, "f" is probably only 2-3 ns,
> and there could very easily be multiple such functions, defined in
> different modules, in a chain, in order to build up a formula.
>

Such micro timings are meaningless, because the working set often tends
to fit in the hardware cache. A level 2 cache miss can takes 100s of cycles.


Cheers,
Mark.
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


stefan_ml at behnel

May 17, 2012, 5:14 AM

Post #27 of 34 (298 views)
Permalink
Re: C-level duck typing [In reply to]

Mark Shannon, 17.05.2012 12:38:
> Dag Sverre Seljebotn wrote:
>> On 05/16/2012 10:24 PM, Robert Bradshaw wrote:
>>> On Wed, May 16, 2012 at 11:33 AM, "Martin v. Löwis"<martin [at] v>
>>> wrote:
>>>>> Does this use case make sense to everyone?
>>>>>
>>>>> The reason why we are discussing this on python-dev is that we are
>>>>> looking
>>>>> for a general way to expose these C level signatures within the Python
>>>>> ecosystem. And Dag's idea was to expose them as part of the type object,
>>>>> basically as an addition to the current Python level tp_call() slot.
>>>>
>>>> The use case makes sense, yet there is also a long-standing solution
>>>> already
>>>> to expose APIs and function pointers: the capsule objects.
>>>>
>>>> If you want to avoid dictionary lookups on the server side, implement
>>>> tp_getattro, comparing addresses of interned strings.
>>>
>>> Yes, that's an idea worth looking at. The point implementing
>>> tp_getattro to avoid dictionary lookups overhead is a good one, worth
>>> trying at least. One drawback is that this approach does require the
>>> GIL (as does _PyType_Lookup).
>>>
>>> Regarding the C function being faster than the dictionary lookup (or
>>> at least close enough that the lookup takes time), yes, this happens
>>> all the time. For example one might be solving differential equations
>>> and the "user input" is essentially a set of (usually simple) double
>>> f(double) and its derivatives.
>>
>> To underline how this is performance critical to us, perhaps a full
>> Cython example is useful.
>>
>> The following Cython code is a real world usecase. It is not too
>> contrived in the essentials, although simplified a little bit. For
>> instance undergrad engineering students could pick up Cython just to play
>> with simple scalar functions like this.
>>
>> from numpy import sin
>> # assume sin is a Python callable and that NumPy decides to support
>> # our spec to also support getting a "double (*sinfuncptr)(double)".
>>
>> # Our mission: Avoid to have the user manually import "sin" from C,
>> # but allow just using the NumPy object and still be fast.
>>
>> # define a function to integrate
>> cpdef double f(double x):
>> return sin(x * x) # guess on signature and use "fastcall"!
>>
>> # the integrator
>> def integrate(func, double a, double b, int n):
>> cdef double s = 0
>> cdef double dx = (b - a) / n
>> for i in range(n):
>> # This is also a fastcall, but can be cached so doesn't
>> # matter...
>> s += func(a + i * dx)
>> return s * dx
>>
>> integrate(f, 0, 1, 1000000)
>>
>> There are two problems here:
>>
>> - The "sin" global can be reassigned (monkey-patched) between each call
>> to "f", no way for "f" to know. Even "sin" could do the reassignment. So
>> you'd need to check for reassignment to do caching...
>
> Since Cython allows static typing why not just declare that func can treat
> sin as if it can't be monkeypatched?

You'd simply say

cdef object sin # declare it as a C variable of type 'object'
from numpy import sin

That's also the one obvious way to do it in Cython.


> Moving the load of a global variable out of the loop does seem to be a
> rather obvious optimisation, if it were declared to be legal.

My proposal was to simply extract any C function pointers at assignment
time, i.e. at import time in the example above. Signature matching can then
be done at the first call and the result can be cached as long as the
object variable isn't changed. All of that is local to the module and can
thus easily be controlled at code generation time.

Stefan

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


d.s.seljebotn at astro

May 17, 2012, 11:13 AM

Post #28 of 34 (298 views)
Permalink
Re: C-level duck typing [In reply to]

Mark Shannon <mark [at] hotpy> wrote:
>Dag Sverre Seljebotn wrote:
>> from numpy import sin
>> # assume sin is a Python callable and that NumPy decides to support
>> # our spec to also support getting a "double (*sinfuncptr)(double)".
>>
>> # Our mission: Avoid to have the user manually import "sin" from C,
>> # but allow just using the NumPy object and still be fast.
>>
>> # define a function to integrate
>> cpdef double f(double x):
>> return sin(x * x) # guess on signature and use "fastcall"!
>>
>> # the integrator
>> def integrate(func, double a, double b, int n):
>> cdef double s = 0
>> cdef double dx = (b - a) / n
>> for i in range(n):
>> # This is also a fastcall, but can be cached so doesn't
>> # matter...
>> s += func(a + i * dx)
>> return s * dx
>>
>> integrate(f, 0, 1, 1000000)
>>
>> There are two problems here:
>>
>> - The "sin" global can be reassigned (monkey-patched) between each
>call
>> to "f", no way for "f" to know. Even "sin" could do the reassignment.
>So
>> you'd need to check for reassignment to do caching...
>
>Since Cython allows static typing why not just declare that func can
>treat sin as if it can't be monkeypatched?

If you want to manually declare stuff, you can always use a C function
pointer too...

>Moving the load of a global variable out of the loop does seem to be a
>rather obvious optimisation, if it were declared to be legal.

In case you didn't notice, there was no global variable loads inside the
loop...

You can keep chasing this, but there's *always* cases where they don't
(and you need to save the situation by manual typing).

Anyway: We should really discuss Cython on the Cython list. If my
motivating example wasn't good enough for you there's really nothing I
can do.

>> Some rough numbers:
>>
>> - The overhead with the tp_flags hack is a 2 ns overhead (something
>> similar with a metaclass, the problems are more how to synchronize
>that
>> metaclass across multiple 3rd party libraries)
>
>Does your approach handle subtyping properly?

Not really.

>>
>> - Dict lookup 20 ns
>
>Did you time _PyType_Lookup() ?

No, didn't get around to it yet (and thanks for pointing it out).
(Though the GIL requirement is an issue too for Cython.)

>> - The sin function is about 35 ns. And, "f" is probably only 2-3 ns,
>
>> and there could very easily be multiple such functions, defined in
>> different modules, in a chain, in order to build up a formula.
>>
>
>Such micro timings are meaningless, because the working set often tends
>
>to fit in the hardware cache. A level 2 cache miss can takes 100s of
>cycles.

I find this sort of response arrogant -- do you know the details of
every usecase for a programming language under the sun?

Many Cython users are scientists. And in scientific computing in
particular you *really* have the whole range of problems and working
sets. Honestly. In some codes you only really care about the speed of
the disk controller. In other cases you can spend *many seconds* working
almost only in L1 or perhaps L2 cache (for instance when integrating
ordinary differential equations in a few variables, which is not
entirely different in nature from the example I posted). (Then, those
many seconds are replicated many million times for different parameters
on a large cluster, and a 2x speedup translates directly into large
amounts of saved money.)

Also, with numerical codes you block up the problem so that loads to L2
are amortized over sufficient FLOPs (when you can).

Every time Cython becomes able to do stuff more easily in this domain,
people thank us that they didn't have to dig up Fortran but can stay
closer to Python.

Sorry for going off on a rant. I find that people will give well-meant
advice about performance, but that advice is just generalizing from
computer programs in entirely different domains (web apps?), and
sweeping generalizations has a way of giving the wrong answer.

Dag
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


d.s.seljebotn at astro

May 17, 2012, 11:34 AM

Post #29 of 34 (301 views)
Permalink
Re: C-level duck typing [In reply to]

On 05/17/2012 08:13 PM, Dag Sverre Seljebotn wrote:
> Mark Shannon <mark [at] hotpy> wrote:
>> Dag Sverre Seljebotn wrote:
>>> from numpy import sin
>>> # assume sin is a Python callable and that NumPy decides to support
>>> # our spec to also support getting a "double (*sinfuncptr)(double)".
>>>
>>> # Our mission: Avoid to have the user manually import "sin" from C,
>>> # but allow just using the NumPy object and still be fast.
>>>
>>> # define a function to integrate
>>> cpdef double f(double x):
>>> return sin(x * x) # guess on signature and use "fastcall"!
>>>
>>> # the integrator
>>> def integrate(func, double a, double b, int n):
>>> cdef double s = 0
>>> cdef double dx = (b - a) / n
>>> for i in range(n):
>>> # This is also a fastcall, but can be cached so doesn't
>>> # matter...
>>> s += func(a + i * dx)
>>> return s * dx
>>>
>>> integrate(f, 0, 1, 1000000)
>>>
>>> There are two problems here:
>>>
>>> - The "sin" global can be reassigned (monkey-patched) between each
>> call
>>> to "f", no way for "f" to know. Even "sin" could do the reassignment.
>> So
>>> you'd need to check for reassignment to do caching...
>>
>> Since Cython allows static typing why not just declare that func can
>> treat sin as if it can't be monkeypatched?
>
> If you want to manually declare stuff, you can always use a C function
> pointer too...
>
>> Moving the load of a global variable out of the loop does seem to be a
>> rather obvious optimisation, if it were declared to be legal.
>
> In case you didn't notice, there was no global variable loads inside the
> loop...
>
> You can keep chasing this, but there's *always* cases where they don't
> (and you need to save the situation by manual typing).
>
> Anyway: We should really discuss Cython on the Cython list. If my
> motivating example wasn't good enough for you there's really nothing I
> can do.
>
>>> Some rough numbers:
>>>
>>> - The overhead with the tp_flags hack is a 2 ns overhead (something
>>> similar with a metaclass, the problems are more how to synchronize
>> that
>>> metaclass across multiple 3rd party libraries)
>>
>> Does your approach handle subtyping properly?
>
> Not really.
>
>>>
>>> - Dict lookup 20 ns
>>
>> Did you time _PyType_Lookup() ?
>
> No, didn't get around to it yet (and thanks for pointing it out).
> (Though the GIL requirement is an issue too for Cython.)
>
>>> - The sin function is about 35 ns. And, "f" is probably only 2-3 ns,
>>
>>> and there could very easily be multiple such functions, defined in
>>> different modules, in a chain, in order to build up a formula.
>>>
>>
>> Such micro timings are meaningless, because the working set often tends
>>
>> to fit in the hardware cache. A level 2 cache miss can takes 100s of
>> cycles.

I'm sorry; if my rant wasn't clear: Such micro-benchmarks do in fact
mimic very closely what you'd do if you'd, say, integrate an ordinary
differential equation. You *do* have a tight loop like that, just
hammering on floating point numbers. Making that specific usecase more
convenient was actually the original usecase that spawned this
discussion on the NumPy list over a month ago...

Dag

>
> I find this sort of response arrogant -- do you know the details of
> every usecase for a programming language under the sun?
>
> Many Cython users are scientists. And in scientific computing in
> particular you *really* have the whole range of problems and working
> sets. Honestly. In some codes you only really care about the speed of
> the disk controller. In other cases you can spend *many seconds* working
> almost only in L1 or perhaps L2 cache (for instance when integrating
> ordinary differential equations in a few variables, which is not
> entirely different in nature from the example I posted). (Then, those
> many seconds are replicated many million times for different parameters
> on a large cluster, and a 2x speedup translates directly into large
> amounts of saved money.)
>
> Also, with numerical codes you block up the problem so that loads to L2
> are amortized over sufficient FLOPs (when you can).
>
> Every time Cython becomes able to do stuff more easily in this domain,
> people thank us that they didn't have to dig up Fortran but can stay
> closer to Python.
>
> Sorry for going off on a rant. I find that people will give well-meant
> advice about performance, but that advice is just generalizing from
> computer programs in entirely different domains (web apps?), and
> sweeping generalizations has a way of giving the wrong answer.
>
> Dag
> _______________________________________________
> Python-Dev mailing list
> Python-Dev [at] python
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> http://mail.python.org/mailman/options/python-dev/d.s.seljebotn%40astro.uio.no
>

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


rdmurray at bitdance

May 17, 2012, 11:48 AM

Post #30 of 34 (309 views)
Permalink
Re: C-level duck typing [In reply to]

On Thu, 17 May 2012 20:13:41 +0200, Dag Sverre Seljebotn <d.s.seljebotn [at] astro> wrote:
> Every time Cython becomes able to do stuff more easily in this domain,
> people thank us that they didn't have to dig up Fortran but can stay
> closer to Python.
>
> Sorry for going off on a rant. I find that people will give well-meant
> advice about performance, but that advice is just generalizing from
> computer programs in entirely different domains (web apps?), and
> sweeping generalizations has a way of giving the wrong answer.

I don't have opinions on the specific topic under discussion, since I
don't get involved in the C level stuff unless I have to, but I do have
some small amount of background in scientific computing (many years ago).
I just want to chime in to say that I think it benefits the whole Python
community to to extend welcoming arms to the scientific Python community
and see what we can do to help them (without, of course, compromising
Python).

I think it is safe to assume that they do have significant experience
with real applications where timings at this level of detail do matter.
The scientific computing community is pretty much by definition pushing
the limits of what's possible.

--David
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


d.s.seljebotn at astro

May 17, 2012, 1:23 PM

Post #31 of 34 (292 views)
Permalink
Re: C-level duck typing [In reply to]

On 05/17/2012 05:00 AM, Greg Ewing wrote:
> On 17/05/12 12:17, Robert Bradshaw wrote:
>
>> This is exactly what was proposed to start this thread (with minimal
>> collusion to avoid conflicts, specifically partitioning up a global ID
>> space).
>
> Yes, but I think this part of the mechanism needs to be spelled out in
> more detail, perhaps in the form of a draft PEP. Then there will be
> something concrete to discuss in python-dev.
>

Well, we weren't 100% sure what is the best mechanism, so the point
really was to solicit input, even if I got a bit argumentative along the
way. Thanks to all of you!

If we in the end decide that we would like a propose the PEP, does
anyone feel the odds are anything but very, very slim? I don't think
I've heard a single positive word about the proposal so far except from
Cython devs, so I'm reluctant to spend my own and your time on a
fleshing out a full PEP for that reason.

In a PEP, the proposal would likely be an additional pointer to a table
of "custom PyTypeObject extensions"; not a flag bit. The whole point
would be to only do that once, and after that PyTypeObject would be
infinitely extensible for custom purposes without collisions (even as a
way of pre-testing PEPs about PyTypeObject in the wild before final
approval!). Of course, a pointer more per type object is a bigger burden
to push on others.

The thing is, you *can* just use a subtype of PyType_Type for this
purpose (or any purpose), it's just my opinion that it's not be best
solution here; it means many different libraries need a common
dependency for this reason alone (or dynamically handshake on a base
class at runtime). You could just stick that base class in CPython,
which would be OK I guess but not great (using the type hierarchy is
quite intrusive in general; you didn't subclass PyType_Type to stick in
tp_as_buffer either).

Dag
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


ncoghlan at gmail

May 17, 2012, 3:57 PM

Post #32 of 34 (293 views)
Permalink
Re: C-level duck typing [In reply to]

I think the main things we'd be looking for would be:
- a clear explanation of why a new metaclass is considered too complex a
solution
- what the implications are for classes that have nothing to do with the
SciPy/NumPy ecosystem
- how subclassing would behave (both at the class and metaclass level)

Yes, defining a new metaclass for fast signature exchange has its
challenges - but it means that *our* concerns about maintaining consistent
behaviour in the default object model and avoiding adverse effects on code
that doesn't need the new behaviour are addressed automatically.

Also, I'd consider a functioning reference implementation using a custom
metaclass a requirement before we considered modifying type anyway, so I
think that's the best thing to pursue next rather than a PEP. It also has
the virtue of letting you choose which Python versions to target and
iterating at a faster rate than CPython.

Cheers,
Nick.
--
Sent from my phone, thus the relative brevity :)


martin at v

May 17, 2012, 5:49 PM

Post #33 of 34 (293 views)
Permalink
Re: C-level duck typing [In reply to]

> If we in the end decide that we would like a propose the PEP, does
> anyone feel the odds are anything but very, very slim? I don't think
> I've heard a single positive word about the proposal so far except
> from Cython devs, so I'm reluctant to spend my own and your time on
> a fleshing out a full PEP for that reason.

Before you do that, it might be useful to publish a precise, reproducible,
complete benchmark first, to support the performance figures you have been
quoting.

I'm skeptical by nature, so I don't believe any of the numbers you have given
until I can reproduce them myself. More precisely, I fail to understand what
they mean without seeing the source code that produced them (perhaps along
with an indication what hardware, operating system, compiler version,
and Python version were used to produce them).

Regards,
Martin


_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


d.s.seljebotn at astro

May 18, 2012, 1:30 AM

Post #34 of 34 (291 views)
Permalink
Re: C-level duck typing [In reply to]

On 05/18/2012 12:57 AM, Nick Coghlan wrote:
> I think the main things we'd be looking for would be:
> - a clear explanation of why a new metaclass is considered too complex a
> solution
> - what the implications are for classes that have nothing to do with the
> SciPy/NumPy ecosystem
> - how subclassing would behave (both at the class and metaclass level)
>
> Yes, defining a new metaclass for fast signature exchange has its
> challenges - but it means that *our* concerns about maintaining
> consistent behaviour in the default object model and avoiding adverse
> effects on code that doesn't need the new behaviour are addressed
> automatically.
>
> Also, I'd consider a functioning reference implementation using a custom
> metaclass a requirement before we considered modifying type anyway, so I
> think that's the best thing to pursue next rather than a PEP. It also
> has the virtue of letting you choose which Python versions to target and
> iterating at a faster rate than CPython.

This seems right on target. I could make a utility code C header for
such a metaclass, and then the different libraries can all include it
and handshake on which implementation becomes the real one through
sys.modules during module initialization. That way an eventual PEP will
only be a natural incremental step to make things more polished, whether
that happens by making such a metaclass part of the standard library or
by extending PyTypeObject.

Thanks,

Dag
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com

First page Previous page 1 2 Next page Last page  View All Python dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.