Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Dev

Optionally using GMP to implement long if available

 

 

First page Previous page 1 2 Next page Last page  View All Python dev RSS feed   Index | Next | Previous | View Threaded


martin at v

Nov 4, 2008, 11:53 AM

Post #26 of 33 (781 views)
Permalink
Re: Optionally using GMP to implement long if available [In reply to]

skip [at] pobox wrote:
> >> OTOH, it should be no big deal to drop a zip archive of the GMP
> >> sources which correspond to the code bound into the DLL.
>
> Martin> How would end users then extract the sources from the DLL? How
> Martin> would they learn that they even have them in the first place?
>
> I think you missed an implied comma in my statement. I meant:
>
> 1. distribute pythonxy.dll which binds to GMP.
> 2. distribute a gmpmn.zip which contains the source for the version used
> in #1.

Ah, I see. That *is* a big deal to many Python users, which want to
distribute pythonxy.dll as part of their application, in a single file
(e.g. py2exe), and then don't want to worry about shipping another zip
file along with their single-file executable (plus including a third
file explaining why you need to do this).

Regards,
Martin
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


dickinsm at gmail

Nov 4, 2008, 12:59 PM

Post #27 of 33 (792 views)
Permalink
Re: Optionally using GMP to implement long if available [In reply to]

On Tue, Nov 4, 2008 at 6:33 PM, Tim Peters <tim.peters [at] gmail> wrote:
> Whenever
> two digits are multiplied, the code intends to cast (at least) one of
> them to "twodigits" first (if you ever see a spot where this doesn't
> happen, it's a bug). "twodigits" is typedef'ed to C long. C89
> guarantees that a long holds at least 32 bits.

There are indeed one or two spots that seem to be missing a cast,
for example the line "rem -= hi*n" in inplace_divrem1. I've fixed
all those I found, in the issue 4258 patch.

>> And for 32x32 -> 64, can't this simply be replaced by "(uint64_t)i * j",
> 1. That's C99, not C89, and therefore less portable.

Understood; my thought was to use uint32_t and uint64_t for
digit and twodigits when available (with longs being stored in
base 2**30), falling back to the current ushort/ulong/base 2**15
otherwise. On Unix, autoconf makes this easy by providing
macros like 'AC_TYPE_INT32_T', which makes sure that
int32_t is defined to be an integer type of exact width 32,
when available.

> 2. On platforms that support it, this is at least 64x64->64 multiplication,
> potentially much more expensive than the 32x32->64 (or 31x31->62?)
> flavor you /intend/ to move to.
>
> 3. There's no way to know exactly what compilers will do with this short of
> staring at generated code.

Yes---I've done the staring for gcc, so I now have precisely *one*
data point, which is that various flavours of gcc on x86/x86_64
*are* clever enough to turn

(uint64_t)my_uint32 * my_other_uint32

into the appropriate HW instruction. Unfortunately I don't have
easy access to other compilers or platforms right now. :-(
Am still working on the benchmarking, but I'm definitely seeing
speedup on gcc/x86---between 10% and 100% depending
on the operations.

> FYI, in a previous life working in speech recognition, under
> Microsoft's Visual Studio 6 the only way we found to get at the
> Pentium's 32x32->64 HW ability efficiently was via using inline
> assembler.

Urk. We definitely don't want to go there. Though I guess this
is how packages like gmp and GP/Pari manage.

> C89). That's why it's impossible here to write portable code in C
> that's also efficient. Even what Python does now is vulnerable on the

But maybe it's possible to write portable code (by providing fallbacks)
that turns out to be efficient on the majority of mainstream systems?
The extent of the ifdef'ery in the patch is really rather small: one
(compound) #ifdef in longintrepr.h for defining digit, twodigits, stwodigits
etc, and a couple more for the places where digits are read and written
in marshal.c.

>> I agree that very-long-integer optimizations probably don't really belong in
>> Python,
>
> Depends in part on whether Python can attract as many obsessed
> maintainers and porters for such gonzo algorithms as GMP attracts ;-)

Well, you can count me in. :)

Mark
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


tim.peters at gmail

Nov 9, 2008, 8:11 PM

Post #28 of 33 (780 views)
Permalink
Re: Optionally using GMP to implement long if available [In reply to]

[Tim Peters]
>> ..
>> Whenever two digits are multiplied, the code intends to
>> cast (at least) one of them to "twodigits" first (if you
>> ever see a spot where this doesn't happen, it's a bug).

[Mark Dickinson]
> There are indeed one or two spots that seem to be missing a
> cast, for example the line "rem -= hi*n" in inplace_divrem1.

Definitely a bug! Alas, it's not surprising -- there are no platforms
on which this bug has a visible consequence (because `digit` is
currently type `unsigned short`, C coerces `hi` and `n` to `int`
before multiplying, and on all platforms to date a C int is at least
32 bits, so that the multiply is at least 32x32->32 despite the lack
of a `twodigts` cast).

>> ...
>> 3. There's no way to know exactly what compilers will do with
>> this short of staring at generated code.

> Yes---I've done the staring for gcc, so I now have precisely *one*
> data point, which is that various flavours of gcc on x86/x86_64
> *are* clever enough to turn
>
> (uint64_t)my_uint32 * my_other_uint32
>
> into the appropriate HW instruction.

Nice! Is this a documented feature? "The problem" is that it
probably depends on a combination of gcc version and compilation
flags, and because /knowing/ whether it works requires staring at
generated code, there's probably no sane way to automate detection of
when, if, and under what conditions it breaks. "Serious" packages use
assembler to avoid all this uncertainty.

> Unfortunately I don't have easy access to other compilers or
> platforms right now. :-(

Sorry, neither do I. If you can dream up a way to automate testing
for generation of the desired HW instructions, you could post the test
program here and I bet a lot of people would run it. Maybe even if
you just described what to look for "by eyeball" in the generated
code.


> Am still working on the benchmarking, but I'm definitely seeing
> speedup on gcc/x86---between 10% and 100% depending
> on the operations.

Sure -- it's not surprising that HW crunching more bits at a time is
significantly faster.

>> FYI, in a previous life working in speech recognition, under
>> Microsoft's Visual Studio 6 the only way we found to get at the
>> Pentium's 32x32->64 HW ability efficiently was via using inline
>> assembler.

> Urk. We definitely don't want to go there. Though I guess this
> is how packages like gmp and GP/Pari manage.

1. I have no idea what versions of Microsoft's compiler
after MSVC 6 do here; perhaps it's no longer an issue
(the Windows Python distro no longer uses MSVC 6).

2. If this is thought to be worth having, then on very
widely used platforms I think a good case /can/ be
made for incorporating some assembler.

3. GMP is "speed at any cost" -- they use assembler
even when it's a stupid idea ;-)

> ..
> But maybe it's possible to write portable code (by providing fallbacks)
> that turns out to be efficient on the majority of mainstream systems?

If "it works" under the gcc and Windows compilers du jour on x86
systems, that probably covers over 90% of Python installations. Good
enough -- stop before it gets pointlessly hard ;-)


> The extent of the ifdef'ery in the patch is really rather small: one
> (compound) #ifdef in longintrepr.h for defining digit, twodigits, stwodigits
> etc, and a couple more for the places where digits are read and written
> in marshal.c.

But so far it only works with an unknown subset of gcc versions,
right? These things don't get simpler, alas :-(


>>> I agree that very-long-integer optimizations probably don't really belong in
>>> Python,

>> Depends in part on whether Python can attract as many obsessed
>> maintainers and porters for such gonzo algorithms as GMP attracts ;-)

> Well, you can count me in. :)

Last I looked (which was at least 3 years ago), GMP's source code was
bigger than all of Python's combined. For a start, I'll have the PSF
draw up a contract obligating you to lifetime servitude :-)
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


greg.ewing at canterbury

Nov 9, 2008, 8:52 PM

Post #29 of 33 (768 views)
Permalink
Re: Optionally using GMP to implement long if available [In reply to]

Tim Peters wrote:
> because /knowing/ whether it works requires staring at
> generated code, there's probably no sane way to automate detection of
> when, if, and under what conditions it breaks.

Maybe it could be tested by compiling two small test programs,
one using (uint64_t)my_uint32 * my_other_uint32 and the other
using my_uint64 * my_other_uint64, and seeing whether one is
faster than the other?

--
Greg
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


nick at craig-wood

Nov 10, 2008, 8:26 AM

Post #30 of 33 (754 views)
Permalink
Re: Optionally using GMP to implement long if available [In reply to]

Tim Peters <tim.peters [at] gmail> wrote:
> > And for 32x32 -> 64, can't this simply be replaced by "(uint64_t)i * j",
> > where uint64_t is as in C99? I'd hope that most compilers would
> > manage to turn this into the appropriate 32x32-bit hardware multiply.
>
> 1. That's C99, not C89, and therefore less portable.
>
> 2. On platforms that support it, this is at least 64x64->64 multiplication,
> potentially much more expensive than the 32x32->64 (or 31x31->62?)
> flavor you /intend/ to move to.
>
> 3. There's no way to know exactly what compilers will do with this short of
> staring at generated code.

I've relied in the past for the compiler generating a 32*32->64 bit
multiply for this code

#include <stdint.h>

uint64_t mul(uint32_t a, uint32_t b)
{
return a*b;
}

Looking at the assembler it produces (x86)

mul:
pushl %ebp
xorl %edx, %edx
movl %esp, %ebp
movl 12(%ebp), %eax
imull 8(%ebp), %eax
popl %ebp
ret

Which I'm pretty sure is a 32x32->64 bit mul (though my x86 assembler
foo is weak).

I think a compiler would have to be pretty stupid not to take this
optimisation... But then there are some pretty stupid compilers out
there!

--
Nick Craig-Wood <nick [at] craig-wood> -- http://www.craig-wood.com/nick
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


dickinsm at gmail

Nov 10, 2008, 9:42 AM

Post #31 of 33 (755 views)
Permalink
Re: Optionally using GMP to implement long if available [In reply to]

On Mon, Nov 10, 2008 at 4:26 PM, Nick Craig-Wood <nick [at] craig-wood> wrote:
>
> Looking at the assembler it produces (x86)
>
> mul:
> pushl %ebp
> xorl %edx, %edx
> movl %esp, %ebp
> movl 12(%ebp), %eax
> imull 8(%ebp), %eax
> popl %ebp
> ret
>
> Which I'm pretty sure is a 32x32->64 bit mul (though my x86 assembler
> foo is weak).

My x86 assembler is also weak (or perhaps I should say nonexistent),
but I think this does exactly what the C standard says it should: that is,
it returns just the low 32-bits of the product.

Looking at the assembler, I think the imull does a 32-bit by
32-bit multiply and puts its (truncated) result back into the 32-bit
register eax. I'd guess that the 64-bit result is being returned to
the calling routine in registers edx (high 32 bits) and eax (low 32 bits);
this explains why edx has to be zeroed with the 'xorl' instruction.

And if we were really expecting a 64-bit result then there should
be an unsigned multiply (mull) there instead of a signed multiply
(imull); of course they're the same modulo 2**32, so for a 32-bit
result it doesn't matter which is used.

Mark
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


dickinsm at gmail

Nov 10, 2008, 9:57 AM

Post #32 of 33 (767 views)
Permalink
Re: Optionally using GMP to implement long if available [In reply to]

Here's the test code I was using, modeled on the basic operation
that longobject needs: multiply two digits together and extract
high and low digits from the result.


typedef uint32_t digit;
typedef uint64_t twodigits;
#define SHIFT 30
#define MASK (((digit)1 << SHIFT) - 1)

/* multiply a and b, and split into high digit (returned)
and low digit (put into *low) */

extern digit
digit_mul(digit *low, digit a, digit b) {
twodigits prod;
prod = (twodigits)a * b;
*low = (digit)(prod & MASK);
return (digit)(prod >> SHIFT);
}

Using gcc 4.0.1 on 32-bit x86 and compiling with "gcc -O1 -S test.c"
gives, for me, a file test.s that looks like:

.text
.globl _digit_mul
_digit_mul:
pushl %ebp
movl %esp, %ebp
pushl %esi
subl $4, %esp
movl 16(%ebp), %eax
mull 12(%ebp)
movl %eax, %esi
andl $1073741823, %esi
movl 8(%ebp), %ecx
movl %esi, (%ecx)
shrdl $30, %edx, %eax
addl $4, %esp
popl %esi
leave
ret
.subsections_via_symbols

There's only a single mull instruction there, showing that gcc
is doing the right thing, at least when optimization is turned on.
Without optimization, gcc produces three separate
multiplications, two of which are multiplications
by zero.

But if I compile with the -m64 flag to force 64-bit then the
multiply becomes:

imulq %rsi, %rdx

which looks a lot like a 64 x 64 -> 64 multiply to me. This
seems inefficient, when a 32 x 32 -> 64 bit multiply ought
to be good enough. But maybe there isn't a significant
performance difference on x86_64?

Mark
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


greg at krypto

Nov 15, 2008, 3:17 PM

Post #33 of 33 (711 views)
Permalink
Re: Optionally using GMP to implement long if available [In reply to]

On Tue, Nov 4, 2008 at 10:33 AM, Tim Peters <tim.peters [at] gmail> wrote:
>
> 2. On platforms that support it, this is at least 64x64->64 multiplication,
> potentially much more expensive than the 32x32->64 (or 31x31->62?)
> flavor you /intend/ to move to.

Thats a good point, thanks!

I am not averse to including a tiny bit of platform (i386) specific
inline asm (in its own header file as a macro to make it easy to
maintain and easy to turn off and easy to add versions for someone
elses favorite 32bit platform) to get that when compiled to use 30bit
digits since the C language has no way to express it directly.

-gps
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com

First page Previous page 1 2 Next page Last page  View All Python dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.