Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

PEP 450 Adding a statistics module to Python

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


steve+comp.lang.python at pearwood

Aug 9, 2013, 6:10 PM

Post #1 of 13 (42 views)
Permalink
PEP 450 Adding a statistics module to Python

I am seeking comments on PEP 450, Adding a statistics module to Python's
standard library:

http://www.python.org/dev/peps/pep-0450/

Please read the FAQs before asking anything :-)


Also relevant:

http://bugs.python.org/issue18606



--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


skip at pobox

Aug 9, 2013, 8:14 PM

Post #2 of 13 (37 views)
Permalink
Re: PEP 450 Adding a statistics module to Python [In reply to]

On Fri, Aug 9, 2013 at 8:10 PM, Steven D'Aprano
<steve+comp.lang.python [at] pearwood> wrote:
> I am seeking comments on PEP 450, Adding a statistics module to Python's
> standard library:
>
> http://www.python.org/dev/peps/pep-0450/
>
> Please read the FAQs before asking anything :-)

Given that installing numpy or scipy is generally no more difficult
that executing "pip install (scipy|numpy)" I'm not really feeling the
need for a battery here... (Of course, I use this stuff at work from
time-to-time, so maybe I'm more in the "nuclear reactor of batteries"
camp anyway.)

Skip
--
http://mail.python.org/mailman/listinfo/python-list


ben+python at benfinney

Aug 9, 2013, 10:05 PM

Post #3 of 13 (36 views)
Permalink
Re: PEP 450 Adding a statistics module to Python [In reply to]

Skip Montanaro <skip [at] pobox> writes:

> Given that installing numpy or scipy is generally no more difficult
> that executing "pip install (scipy|numpy)" I'm not really feeling the
> need for a battery here...

NumPy and SciPy are not available for many Python users, including those
using a Python implementation for which there is no Numpy support
<URL:http://new.scipy.org/faq.html#python-version-support> and those for
whom large, dependency-heavy third-party packages are too much burden.

See the Rationale of PEP 450 for more reasons why “install NumPy” is not
a feasible solution for many use cases, and why having ‘statistics’ as a
pure-Python, standard-library package is desirable.

--
\ “Dad always thought laughter was the best medicine, which I |
`\ guess is why several of us died of tuberculosis.” —Jack Handey |
_o__) |
Ben Finney

--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Aug 10, 2013, 12:55 AM

Post #4 of 13 (34 views)
Permalink
Re: PEP 450 Adding a statistics module to Python [In reply to]

Ben Finney, 10.08.2013 07:05:
> Skip Montanaro writes:
>> Given that installing numpy or scipy is generally no more difficult
>> that executing "pip install (scipy|numpy)" I'm not really feeling the
>> need for a battery here...
>
> See the Rationale of PEP 450 for more reasons why “install NumPy” is not
> a feasible solution for many use cases, and why having ‘statistics’ as a
> pure-Python, standard-library package is desirable.

The rationale suggests that the module is meant as a simple toolset for
non-NumPy users. Are the APIs (class model, function names, etc.) similar
enough to make it easy to switch, preferably in both directions?

It would be good if a stdlib statistics module could be used as a SciPy
fallback for the "simple" things, and if users of the stdlib module could
easily switch their code to SciPy if they need more speed/features/whatever
at some point, without having to relearn the name of each single function.

I'm not asking for compatibility (doesn't sound reasonable without NumPy
arrays), but I think that a similarity in terms of API naming (as far as it
makes sense) should be clearly stated, e.g. in the Design Decisions section.

Stefan


--
http://mail.python.org/mailman/listinfo/python-list


roy at panix

Aug 10, 2013, 4:50 AM

Post #5 of 13 (33 views)
Permalink
Re: PEP 450 Adding a statistics module to Python [In reply to]

In article <mailman.417.1376104455.1251.python-list [at] python>,
Skip Montanaro <skip [at] pobox> wrote:

> Given that installing numpy or scipy is generally no more difficult
> that executing "pip install (scipy|numpy)" I'm not really feeling the
> need for a battery here...

I just tried installing numpy in a fresh virtualenv on an Ubuntu Precise
box. I ran "pip install numpy". It took 1.5 minutes. It printed
almost 1800 lines of build crap, including 383 warnings and 83 errors.
For a newbie, that can be pretty intimidating.

That's for the case where I've already installed numpy elsewhere on that
box, so I already had the fortran compiler, and the rest of the build
chain. For fun, I just spun up a new Ubuntu Precise instance in AWS.
It came pre-installed with Python 2.7.3. I tried "pip install numpy",
which told me that pip was not installed.

At least it told me what I needed to do to get pip installed.
Unfortunately, I didn't read the message carefully enough and typed
"sudo apt-get install pip", which of course got me another error because
the correct name of the package is python-pip. Doing "sudo apt-get
install python-pip" finally got me to the point where I could start to
install numpy.

Of course, if I didn't have sudo privs on the box (most corporate
environments), I never would have gotten that far.

At this point, "sudo pip install numpy" got me a bunch of errors
culminating in "RuntimeError: Broken toolchain: cannot link a simple C
program", and no indication of how to get any further.

At this point, most people would give up. I don't remember the full set
of steps I needed to do the first time. Obviously, I would start with
installing gcc, but I seem to remember there were additional steps
needed to get fortran support.

Having some simple statistics baked into the standard python package
would be a big win. As shown above, installing numpy can be an
insurmountable hurdle for people with insufficient sysadmin-fu.

PEP-450 makes cogent arguments why rolling your own statistics routines
is fraught with peril. Looking over our source tree, I see we've
implemented std deviation in python at least twice. I'm sure they're
both naive implementations of the sort PEP-450 warns about.

And, yes, backporting to 2.7 would be a big win too. I know the goal is
to get everybody onto 3.x, but my pip external dependency list includes
40 modules. It's going to be a long and complicated road to get to the
point where I can move to 3.x, and I imagine most non-trivial projects
are in a similar situation.
--
http://mail.python.org/mailman/listinfo/python-list


oscar.j.benjamin at gmail

Aug 10, 2013, 5:23 AM

Post #6 of 13 (33 views)
Permalink
Re: PEP 450 Adding a statistics module to Python [In reply to]

On 10 August 2013 12:50, Roy Smith <roy [at] panix> wrote:
> In article <mailman.417.1376104455.1251.python-list [at] python>,
> Skip Montanaro <skip [at] pobox> wrote:
>
>> Given that installing numpy or scipy is generally no more difficult
>> that executing "pip install (scipy|numpy)" I'm not really feeling the
>> need for a battery here...
>
> I just tried installing numpy in a fresh virtualenv on an Ubuntu Precise
> box. I ran "pip install numpy". It took 1.5 minutes. It printed
> almost 1800 lines of build crap, including 383 warnings and 83 errors.
> For a newbie, that can be pretty intimidating.
>
> That's for the case where I've already installed numpy elsewhere on that
> box, so I already had the fortran compiler, and the rest of the build
> chain. For fun, I just spun up a new Ubuntu Precise instance in AWS.
> It came pre-installed with Python 2.7.3. I tried "pip install numpy",
> which told me that pip was not installed.
>
> At least it told me what I needed to do to get pip installed.
> Unfortunately, I didn't read the message carefully enough and typed
> "sudo apt-get install pip", which of course got me another error because
> the correct name of the package is python-pip. Doing "sudo apt-get
> install python-pip" finally got me to the point where I could start to
> install numpy.
>
> Of course, if I didn't have sudo privs on the box (most corporate
> environments), I never would have gotten that far.
>
> At this point, "sudo pip install numpy" got me a bunch of errors
> culminating in "RuntimeError: Broken toolchain: cannot link a simple C
> program", and no indication of how to get any further.

You should use apt-get for numpy/scipy on Ubuntu. Although
unfortunately IIRC this doesn't work as well as it should since Ubuntu
doesn't install the appropriate BLAS/LAPACK libraries by default
(leaving you with numpy's fallback libraries).

On Windows you should use the MSI installer (or easy_install).
Hopefully numpy/scipy will start distributing wheels soon and pip
install numpy will actually work.


Oscar
--
http://mail.python.org/mailman/listinfo/python-list


roy at panix

Aug 10, 2013, 5:43 AM

Post #7 of 13 (33 views)
Permalink
Re: PEP 450 Adding a statistics module to Python [In reply to]

Skip Montanaro <skip [at] pobox> wrote:
> >> installing numpy or scipy is generally no more difficult
> >> that executing "pip install (scipy|numpy)"

I described the problems I had trying to follow that advice.

In article <mailman.425.1376137459.1251.python-list [at] python>,
Oscar Benjamin <oscar.j.benjamin [at] gmail> wrote:

> You should use apt-get for numpy/scipy on Ubuntu. Although
> unfortunately IIRC this doesn't work as well as it should since Ubuntu
> doesn't install the appropriate BLAS/LAPACK libraries by default
> (leaving you with numpy's fallback libraries).

That really kind of proves my point. It's *not* easy to install.
Theres' a choice of methods, some of which work in some environments,
some of which work in others. And even if apt-get is the preferred
install method on Ubuntu, it's a method which is unavailable to people
without root access (and may be undesirable if you rely on virtualenv to
keep multiple projects cleanly separated).

And, what happens if you don't have the right libraries? Do you end up
with an install which is missing some functionality, or one where all
the calls work, but they're slower, or numerically unstable, or what?

All these questions go away if it's packaged with the standard library.

I'm not sure where the line should be drawn between "basic stuff that
should be included" and "advanced stuff that you need an add-on to get",
but certainly mean and std-dev should be in the default distribution.
--
http://mail.python.org/mailman/listinfo/python-list


oscar.j.benjamin at gmail

Aug 10, 2013, 6:17 AM

Post #8 of 13 (29 views)
Permalink
Re: PEP 450 Adding a statistics module to Python [In reply to]

On 10 August 2013 13:43, Roy Smith <roy [at] panix> wrote:
>
> In article <mailman.425.1376137459.1251.python-list [at] python>,
> Oscar Benjamin <oscar.j.benjamin [at] gmail> wrote:
>
>> You should use apt-get for numpy/scipy on Ubuntu. Although
>> unfortunately IIRC this doesn't work as well as it should since Ubuntu
>> doesn't install the appropriate BLAS/LAPACK libraries by default
>> (leaving you with numpy's fallback libraries).
>
> That really kind of proves my point. It's *not* easy to install.
> Theres' a choice of methods, some of which work in some environments,
> some of which work in others. And even if apt-get is the preferred
> install method on Ubuntu, it's a method which is unavailable to people
> without root access (and may be undesirable if you rely on virtualenv to
> keep multiple projects cleanly separated).
>
> And, what happens if you don't have the right libraries? Do you end up
> with an install which is missing some functionality, or one where all
> the calls work, but they're slower, or numerically unstable, or what?

AFAIK not having separate BLAS/LAPACK libraries just means that
certain operations are a lot slower. If there are differences in
accuracy then they aren't significant enough that I've noticed.

I think that the reason Ubuntu doesn't install them by default is
because it's not sure which ones you want to use. Possibly the best
free setup comes from using ATLAS but this is optimised in a
CPU-specific way at build time. Ubuntu doesn't provide binaries for it
as using generic x86 executables would defeat much of the point of the
library (they do make it a lot easier by providing a source package
though).


Oscar
--
http://mail.python.org/mailman/listinfo/python-list


skip at pobox

Aug 11, 2013, 4:50 AM

Post #9 of 13 (24 views)
Permalink
Re: PEP 450 Adding a statistics module to Python [In reply to]

> See the Rationale of PEP 450 for more reasons why “install NumPy” is not
> a feasible solution for many use cases, and why having ‘statistics’ as a
> pure-Python, standard-library package is desirable.

I read that before posting but am not sure I agree. I don't see the
screaming need for this package. Why can't it continue to live on
PyPI, where, once again, it is available as "pip install ..."?

S
--
http://mail.python.org/mailman/listinfo/python-list


nicholas.cole at gmail

Aug 11, 2013, 5:27 AM

Post #10 of 13 (24 views)
Permalink
Re: PEP 450 Adding a statistics module to Python [In reply to]

On Sun, Aug 11, 2013 at 12:50 PM, Skip Montanaro <skip [at] pobox> wrote:

> > See the Rationale of PEP 450 for more reasons why install NumPy is not
> > a feasible solution for many use cases, and why having statistics as a
> > pure-Python, standard-library package is desirable.
>
> I read that before posting but am not sure I agree. I don't see the
> screaming need for this package. Why can't it continue to live on
> PyPI, where, once again, it is available as "pip install ..."?


Well, I *do* think this module would be a wonderful addition to the
standard library. I've often used python to do analysis of data, nothing
complicated enough to need NumPy, but certainly things where I've needed to
find averages etc. I've rolled my own functions for these projects, and I'm
sure they are fragile. Besides, it was just a pain to do them.

PyPI is terrific. There are lots of excellent modules on there. It's a
wonderful resource. But I think that the standard library is also a
wonderful thing, and where there are clearly defined modules, that serve a
general, well-defined function and where development does not need to be
very rapid, I think they should go into the Standard Library.

I'm aware that my opinion is just that of one user, but I read this PEP and
I thought, "Thank Goodness! That looks great. About time too."

N.


steve+comp.lang.python at pearwood

Aug 11, 2013, 6:33 AM

Post #11 of 13 (22 views)
Permalink
Re: PEP 450 Adding a statistics module to Python [In reply to]

On Sun, 11 Aug 2013 06:50:36 -0500, Skip Montanaro wrote:

>> See the Rationale of PEP 450 for more reasons why “install NumPy” is
>> not a feasible solution for many use cases, and why having ‘statistics’
>> as a pure-Python, standard-library package is desirable.
>
> I read that before posting but am not sure I agree. I don't see the
> screaming need for this package. Why can't it continue to live on PyPI,
> where, once again, it is available as "pip install ..."?


The same could be said about any module, really. And indeed, some
languages have that philosophy, they provide no libraries to speak of, if
you want anything you have to either write it yourself or get it from
somebody else.

Not everyone has the luxury of being able, or allowed, to run "pip
install" to get additional, non-standard packages. E.g. in corporate
environments. But I've already said that in the PEP.


--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


roy at panix

Aug 11, 2013, 7:02 AM

Post #12 of 13 (22 views)
Permalink
Re: PEP 450 Adding a statistics module to Python [In reply to]

In article <mailman.479.1376221844.1251.python-list [at] python>,
Skip Montanaro <skip [at] pobox> wrote:

> > See the Rationale of PEP 450 for more reasons why “install NumPy” is not
> > a feasible solution for many use cases, and why having ‘statistics’ as a
> > pure-Python, standard-library package is desirable.
>
> I read that before posting but am not sure I agree. I don't see the
> screaming need for this package. Why can't it continue to live on
> PyPI, where, once again, it is available as "pip install ..."?

My previous comments on this topic were along the lines of "installing
numpy is a non-starter if all you need are simple mean/std-dev". You
do, however, make a good point here. Running "pip install statistics"
is a much lower barrier to entry than getting numpy going, especially if
statistics is pure python and thus has no dependencies on compiler tool
chains which may be missing.

Still, I see two classes of function in PEP-450. Class 1 is the really
basic stuff:

* mean
* std-dev

Class 2 are the more complicated things like:

* linear regression
* median
* mode
* functions for calculating the probability of random variables
from the normal, t, chi-squared, and F distributions
* inference on the mean
* anything that differentiates between population and sample

I could see leaving class 2 stuff in an optional pure-python module to
be installed by pip, but for (as the PEP phrases it), the simplest and
most obvious statistical functions (into which I lump mean and std-dev),
having them in the standard library would be a big win.


buzzard at invalid

Aug 11, 2013, 8:44 AM

Post #13 of 13 (14 views)
Permalink
Re: PEP 450 Adding a statistics module to Python [In reply to]

On 11/08/13 15:02, Roy Smith wrote:
> In article <mailman.479.1376221844.1251.python-list [at] python>,
> Skip Montanaro <skip [at] pobox> wrote:
>
>>> See the Rationale of PEP 450 for more reasons why “install NumPy” is not
>>> a feasible solution for many use cases, and why having ‘statistics’ as a
>>> pure-Python, standard-library package is desirable.
>>
>> I read that before posting but am not sure I agree. I don't see the
>> screaming need for this package. Why can't it continue to live on
>> PyPI, where, once again, it is available as "pip install ..."?
>
> My previous comments on this topic were along the lines of "installing
> numpy is a non-starter if all you need are simple mean/std-dev". You
> do, however, make a good point here. Running "pip install statistics"
> is a much lower barrier to entry than getting numpy going, especially if
> statistics is pure python and thus has no dependencies on compiler tool
> chains which may be missing.
>
> Still, I see two classes of function in PEP-450. Class 1 is the really
> basic stuff:
>
> * mean
> * std-dev
>
> Class 2 are the more complicated things like:
>
> * linear regression
> * median
> * mode
> * functions for calculating the probability of random variables
> from the normal, t, chi-squared, and F distributions
> * inference on the mean
> * anything that differentiates between population and sample
>
> I could see leaving class 2 stuff in an optional pure-python module to
> be installed by pip, but for (as the PEP phrases it), the simplest and
> most obvious statistical functions (into which I lump mean and std-dev),
> having them in the standard library would be a big win.
>

I would probably move other descriptive statistics (median, mode,
correlation, ...) into Class 1.

I roll my own statistical tests as I need them - simply to avoid having
a dependency on R. But I generally do end up with a dependency on scipy
because I need scipy.stats.distributions. So I guess a distinct library
for probability distributions would be handy - but maybe it should not
be in the standard library.

Once we move on to statistical modelling (e.g. linear regression) I
think the case for inclusion in the standard library becomes weaker
still. Cheers.

Duncan

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.