Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Dev

Re: [Python-ideas] itertools.chunks(iterable, size, fill=None)

 

 

Python dev RSS feed   Index | Next | Previous | View Threaded


raymond.hettinger at gmail

Jul 1, 2012, 12:07 AM

Post #1 of 13 (396 views)
Permalink
Re: [Python-ideas] itertools.chunks(iterable, size, fill=None)

On Jun 30, 2012, at 10:44 PM, Stefan Behnel wrote:

>>
>> Another addition could be a new subsection on grouping (chunking) that
>> would discuss post-processing of grouper (as discussed above), as well as
>> other recipes, including ones specific to strings and sequences. It would
>> essentially be a short how-to. Call it 9.1.3 "Grouping, Blocking, or
>> Chunking Sequences and Iterables". The synonyms will help external
>> searching. A toc would let people who have found this doc know to look for
>> this at the bottom.
>
> If it really is such an important use case for so many people, I agree that
> it's worth special casing it in the docs. It's not a trivial algorithmic
> step from a sequential iterable to a grouped iterable.

I'm not too keen on adding a section like this to the itertools docs.

Instead, I would be open adding "further reading" section with external links
to interesting iterator writeups in blogs, cookbooks, stack overflow answers, wikis, etc.

If one of you wants to craft an elegant blog post on "Grouping, Blocking, or
Chunking Sequences and Iterables", I would be happy to link to it.


Raymond


stefan_ml at behnel

Jul 1, 2012, 5:01 AM

Post #2 of 13 (388 views)
Permalink
Re: [Python-ideas] itertools.chunks(iterable, size, fill=None) [In reply to]

Hi Raymond,

Raymond Hettinger, 01.07.2012 09:07:
> On Jun 30, 2012, at 10:44 PM, Stefan Behnel wrote:
>>> Another addition could be a new subsection on grouping (chunking) that
>>> would discuss post-processing of grouper (as discussed above), as well as
>>> other recipes, including ones specific to strings and sequences. It would
>>> essentially be a short how-to. Call it 9.1.3 "Grouping, Blocking, or
>>> Chunking Sequences and Iterables". The synonyms will help external
>>> searching. A toc would let people who have found this doc know to look for
>>> this at the bottom.
>>
>> If it really is such an important use case for so many people, I agree that
>> it's worth special casing it in the docs. It's not a trivial algorithmic
>> step from a sequential iterable to a grouped iterable.
>
> I'm not too keen on adding a section like this to the itertools docs.

I've only just seen that the recipes section is part of the same page since
the 2.6 documentation was sphinxified. I had remembered it being on a
separate page before. That resolves most of my original concerns. Sorry,
should have looked earlier.

To address the main problem of users not finding what they need, what about
simply extending the docstring of the grouper() function with a sentence
like this:

"This functionality is also called 'chunking' or 'blocking' and can be used
for load distribution and sharding."

That would make it easy for users to find what they are looking for when
they search the page for "chunk". I find that a much more common and less
ambiguous name than "grouping", which reminds me more of "group by".

It might be a good idea in general to add a short comment on a use case to
each recipe where it's not immediately obvious or where there is a use case
with a well-known name, simply to aid in text searches over the page.


> Instead, I would be open adding "further reading" section with external links
> to interesting iterator writeups in blogs, cookbooks, stack overflow answers, wikis, etc.
>
> If one of you wants to craft an elegant blog post on "Grouping, Blocking, or
> Chunking Sequences and Iterables", I would be happy to link to it.

That could be done in addition, but it bares the risk of bit rotting the
documentation by links dying, blogs moving or texts changing.

Stefan

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


raymond.hettinger at gmail

Jul 2, 2012, 9:23 PM

Post #3 of 13 (390 views)
Permalink
Re: [Python-ideas] itertools.chunks(iterable, size, fill=None) [In reply to]

On Jul 1, 2012, at 5:01 AM, Stefan Behnel wrote:

> To address the main problem of users not finding what they need, what about
> simply extending the docstring of the grouper()


Here's a small change to the docstring: http://hg.python.org/cpython/rev/d32f21d87363

FWIW, if you're interested in load balancing applications, George Sakkis's itertools
recipe for roundrobin() may be of interest.

Another interesting iterator technique that is not well known is the two-argument
form of iter() which is a marvel for transforming callables into iterators:

for block in iter(partial(f.read, 1024), ''):
...

for diceroll in iter(partial(randrange(1, 7), 4):
...


Raymond


techtonik at gmail

Jul 4, 2012, 2:57 AM

Post #4 of 13 (389 views)
Permalink
Re: [Python-ideas] itertools.chunks(iterable, size, fill=None) [In reply to]

On Fri, Jun 29, 2012 at 11:32 PM, Georg Brandl <g.brandl [at] gmx> wrote:
> On 26.06.2012 10:03, anatoly techtonik wrote:
>>
>> Now that Python 3 is all about iterators (which is a user killer
>> feature for Python according to StackOverflow -
>> http://stackoverflow.com/questions/tagged/python) would it be nice to
>> introduce more first class functions to work with them? One function
>> to be exact to split string into chunks.
>>
>> itertools.chunks(iterable, size, fill=None)
>>
>> Which is the 33th most voted Python question on SO -
>>
>> http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python/312464
>>
>> P.S. CC'ing to python-dev@ to notify about the thread in python-ideas.
>>
>
> Anatoly, so far there were no negative votes -- would you care to go
> another step and propose a patch?

Was about to say "no problem", but in fact - there is. Sorry from
whining from my side and thanks for nudging. The only thought that a
simple task of copy/pasting relevant code from
http://docs.python.org/library/itertools.html?highlight=itertools#recipes
will require a few hours waiting of download (still not everybody has
a high-speed internet) makes me switch to other less time consuming
tasks before getting around to it. These tasks become more important
in a few hours, and basically I've passed through this many times
before. It then becomes quite hard to switch back.

I absolutely don't mind someone else being credited for the idea,
because ideas usually worthless without implementation. It will be
interesting to design how the process could work in a separate thread.
For now the best thing I can do (I don't risk even to mention anything
with 3.3) is to copy/paste code from the docs here:

from itertools import izip_longest
def chunks(iterable, size, fill=None):
"""Split an iterable into blocks of fixed-length"""
# chunks('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * size
return izip_longest(fillvalue=fill, *args)

BTW, this doesn't work as expected (at least for strings). Expected is:
chunks('ABCDEFG', 3, 'x') --> 'ABC' 'DEF' 'Gxx'
got:
chunks('ABCDEFG', 3, 'x') --> ('A' 'B' 'C') ('D' 'E' 'F') ('G' 'x' 'x')

Needs more round tuits definitely.
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


stefan_ml at behnel

Jul 4, 2012, 5:37 AM

Post #5 of 13 (393 views)
Permalink
Re: [Python-ideas] itertools.chunks(iterable, size, fill=None) [In reply to]

anatoly techtonik, 04.07.2012 11:57:
> On Fri, Jun 29, 2012 at 11:32 PM, Georg Brandl wrote:
>> On 26.06.2012 10:03, anatoly techtonik wrote:
>>> Now that Python 3 is all about iterators (which is a user killer
>>> feature for Python according to StackOverflow -
>>> http://stackoverflow.com/questions/tagged/python) would it be nice to
>>> introduce more first class functions to work with them? One function
>>> to be exact to split string into chunks.
>>>
>>> itertools.chunks(iterable, size, fill=None)
>>>
>>> Which is the 33th most voted Python question on SO -
>>>
>>> http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python/312464
>>>
>>> P.S. CC'ing to python-dev@ to notify about the thread in python-ideas.
>>>
>>
>> Anatoly, so far there were no negative votes -- would you care to go
>> another step and propose a patch?
>
> Was about to say "no problem", but in fact - there is. Sorry from
> whining from my side and thanks for nudging. The only thought that a
> simple task of copy/pasting relevant code from
> http://docs.python.org/library/itertools.html?highlight=itertools#recipes
> will require a few hours waiting of download (still not everybody has
> a high-speed internet) makes me switch to other less time consuming
> tasks before getting around to it. These tasks become more important
> in a few hours, and basically I've passed through this many times
> before. It then becomes quite hard to switch back.
>
> I absolutely don't mind someone else being credited for the idea,
> because ideas usually worthless without implementation. It will be
> interesting to design how the process could work in a separate thread.
> For now the best thing I can do (I don't risk even to mention anything
> with 3.3) is to copy/paste code from the docs here:
>
> from itertools import izip_longest
> def chunks(iterable, size, fill=None):
> """Split an iterable into blocks of fixed-length"""
> # chunks('ABCDEFG', 3, 'x') --> ABC DEF Gxx
> args = [iter(iterable)] * size
> return izip_longest(fillvalue=fill, *args)

I think Raymond's change fixes this issue quite nicely, no need to touch at
the module code.

Stefan

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


tjreedy at udel

Jul 4, 2012, 11:31 AM

Post #6 of 13 (389 views)
Permalink
Re: [Python-ideas] itertools.chunks(iterable, size, fill=None) [In reply to]

On 7/4/2012 5:57 AM, anatoly techtonik wrote:
> On Fri, Jun 29, 2012 at 11:32 PM, Georg Brandl <g.brandl [at] gmx> wrote:

>> Anatoly, so far there were no negative votes -- would you care to go
>> another step and propose a patch?
>
> Was about to say "no problem",

Did you read that there *are* strong negative votes? And that this idea
has been rejected before? I summarized the objections in my two
responses and pointed to the tracker issues. One of the objections is
that there are 4 different things one might want if the sequence length
is not an even multiple of the chunk size. Your original 'idea' did not
specify.

> For now the best thing I can do (I don't risk even to mention anything
> with 3.3) is to copy/paste code from the docs here:
>
> from itertools import izip_longest
> def chunks(iterable, size, fill=None):
> """Split an iterable into blocks of fixed-length"""
> # chunks('ABCDEFG', 3, 'x') --> ABC DEF Gxx
> args = [iter(iterable)] * size
> return izip_longest(fillvalue=fill, *args)

Python ideas is about Python 3 ideas. Please post Python 3 code.

This is actually a one liner

return zip_longest(*[iter(iterable)]*size, fillvalue=file)

We don't generally add such to the stdlib.

> BTW, this doesn't work as expected (at least for strings). Expected is:
> chunks('ABCDEFG', 3, 'x') --> 'ABC' 'DEF' 'Gxx'
> got:
> chunks('ABCDEFG', 3, 'x') --> ('A' 'B' 'C') ('D' 'E' 'F') ('G' 'x' 'x')

One of the problems with idea of 'add a chunker' is that there are at
least a dozen variants that different people want. I discussed the
problem of return types issue in my responses. I showed how to get the
'expected' response above using grouper, but also suggested that it is
the wrong basis for splitting strings. Repeated slicing make more sense
for concrete sequence types.

def seqchunk_odd(s, size):
# include odd size left over
for i in range(0, len(s), size):
yield s[i:i+size]

print(list(seqchunk_odd('ABCDEFG', 3)))
#
['ABC', 'DEF', 'G']

def seqchunk_even(s, size):
# only include even chunks
for i in range(0, size*(len(s)//size), size):
yield s[i:i+size]

print(list(seqchunk_even('ABCDEFG', 3)))
#
['ABC', 'DEF']

def strchunk_fill(s, size, fill):
# fill odd chunks
q, r = divmod(len(s), size)
even = size * q
for i in range(0, even, size):
yield s[i:i+size]
if size != even:
yield s[even:] + fill * (size - r)

print(list(strchunk_fill('ABCDEFG', 3, 'x')))
#
['ABC', 'DEF', 'Gxx']

Because the 'fill' value is necessarily a sequence for strings,
strchunk_fill would only work for lists and tuples if the fill value
were either required to be given as a tuple or list of length 1 or if it
were internally converted inside the function. Skipping that for now.

Having written the fill version based on the even version, it is easy to
select among the three behaviors by modifying the fill version.

def strchunk(s, size, fill=NotImplemented):
# fill odd chunks
q, r = divmod(len(s), size)
even = size * q
for i in range(0, even, size):
yield s[i:i+size]
if size != even and fill is not NotImplemented:
yield s[even:] + fill * (size - r)

print(*strchunk('ABCDEFG', 3))
print(*strchunk('ABCDEFG', 3, ''))
print(*strchunk('ABCDEFG', 3, 'x'))
#
ABC DEF
ABC DEF G
ABC DEF Gxx

I already described how something similar could be done by checking each
grouper output tuple for a fill value, but that requires that the fill
value be a sentinal that could not otherwise appear in the tuple. One
could modify grouper to fill with a private object() and check the last
item of each group for that sentinal and act accordingly (delete,
truncate, or replace). A generic api needs some thought, though.

---
An issue I did not previously mention is that people sometimes want
overlapping chunks rather than contiguous disjoint chunks. The slice
approach trivially adapts to that.

def seqlap(s, size):
for i in range(len(s)-size+1):
yield s[i:i+size]

print(*seqlap('ABCDEFG', 3))
#
ABC BCD CDE DEF EFG

A sliding window for a generic iterable requires a deque or ring buffer
approach that is quite different from the zip-longest -- grouper approach.

--
Terry Jan Reedy



_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


techtonik at gmail

Jul 5, 2012, 6:36 AM

Post #7 of 13 (383 views)
Permalink
Re: [Python-ideas] itertools.chunks(iterable, size, fill=None) [In reply to]

Before anything else I must apologize for significant lags in my
replies. I can not read all of them to hold in my head, so I reply one
by one as it goes trying not to miss a single point out there. It
would be much easier to do this in unified interface for threaded
discussions, but for now there is no capabilities for that neither in
Mailman nor in GMail. And when it turns out that the amount of text is
too big, and I spend a lot of time trying to squeeze it down and then
it becomes pointless to send at all.

Now back on the topic:

On Sun, Jul 1, 2012 at 12:09 AM, Terry Reedy <tjreedy [at] udel> wrote:
> On 6/29/2012 4:32 PM, Georg Brandl wrote:
>>
>> On 26.06.2012 10:03, anatoly techtonik wrote:
>>>
>>> Now that Python 3 is all about iterators (which is a user killer
>>> feature for Python according to StackOverflow -
>>> http://stackoverflow.com/questions/tagged/python) would it be nice to
>>> introduce more first class functions to work with them? One function
>>> to be exact to split string into chunks.
>
> Nothing special about strings.

It seemed so, but it just appeared that grouper recipe didn't work for me.

>>> itertools.chunks(iterable, size, fill=None)
>
> This is a renaming of itertools.grouper in 9.1.2. Itertools Recipes. You
> should have mentioned this. I think of 'blocks' rather than 'chunks', but I
> notice several SO questions with 'chunk(s)' in the title.

I guess `block` gives too low signal/noize ration in search results.
That's why it probably also called chunks in other languages, where
`block` stand for something else (I speak of Ruby blocks).

>>> Which is the 33th most voted Python question on SO -
>>>
>>> http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python/312464
>
> I am curious how you get that number. I do note that there are about 15
> other Python SO questions that seem to be variations on the theme. There
> might be more if 'blocks' and 'groups' were searched for.

It's easy:
1. Go http://stackoverflow.com/
2. Search [python]
3. Click `votes` tab
4. Choose `30 per page` at the bottom
5. Jump to the second page, there it is 4th from the top:
http://stackoverflow.com/questions/tagged/python?page=2&sort=votes&pagesize=30

As for duplicates - feel free to mark them as such. SO allows
everybody to do this (unlike Roundup).

>> Anatoly, so far there were no negative votes -- would you care to go
>> another step and propose a patch?
>
> That is because Raymond H. is not reading either list right now ;-)
> Hence the Cc:. Also because I did not yet respond to a vague, very
> incomplete idea.
>
> From Raymond's first message on http://bugs.python.org/issue6021 , add
> grouper:
>
> "This has been rejected before.

I quite often see such arguments and I can't stand to repeat that
these are not arguments. It is good to know, but when people use that
as a reason to close tickets - that's just disgusting. To the
Raymond's honor he cares to explain.

> * It is not a fundamental itertool primitive. The recipes section in
> the docs shows a clean, fast implementation derived from zip_longest().

What is the definition of 'fundamental primitive'?
To me the fact that top answer for chunking strings on SO has 2+ times
more votes than itertools versions is a clear 5 sigma indicator that
something is wrong with this Standard model without chunks boson.

> * There is some debate on a correct API for odd lengths. Some people
> want an exception, some want fill-in values, some want truncation, and
> some want a partially filled-in tuple. The alone is reason enough not
> to set one behavior in stone.

use case 3.1: odd lengths exception (CHOOSE ONE)
1. I see that no itertools function throws exceptions, check manually:
len(iterable) / float(size) == len(iterable) // float(size)
2. Explicitly
- itertools.chunks(iterable, size, fill=None)
+ itertools.chunks(iterable, size, fill=None, exception=False)

use case 3.2. fill in value. it is here (SOLVED)

use case 3.3: truncation
no itertools support truncation, do manually
chunks(iter, size)[.:len(iter)//size)

use case 4: partially filled-in tuple
What should be there?
>>> chunks('ABCDEFG', 3, 'x')
>>> |


More replies and workarounds to some of the raised points are below.

> * There is an issue with having too many itertools. The module taken as
> a whole becomes more difficult to use as new tools are added."

There can be only two reasons to that:
* chosen basis is bad (many functions that are rarely used or easily emulated)
* basis is good, but insufficient, because iterators universe is more
complicated
than we think

> This is not to say that the question should not be re-considered. Given the
> StackOverflow experience in addition to that of the tracker and python-list
> (and maybe python-ideas), a special exception might be made in relation to
> points 1 and 3.

--[offtopic about Python enhancements / proposals feedback]--
Yes, without SO I probably wouldn't trigger this at all. Because
tracker doesn't help with raising importance - there are no votes, no
feature proposals, no "stars". And what I "like" the most is that very
"nice" resolution status - "committed/rejected" - which doesn't say
anything at all. Python list? I try not to disrupt the frequency
there. Python ideas? Too low participation level for gathering
signals. There are many people that read, support, but don't want to
reply (don't want to stand out or just lazy). There are many outside
who don't want to be subscribed at all. There are 2000+ people
spending time on Python conferences all over the world each year we
see only a couple of reactions for every Python idea here. Quite often
there are mistakes and omissions that would be nice to correct and you
can't. So StackOverflow really helps here, but it is a Q&A tool, which
is still much better than ML that are solely for chatting,
brainstorming and all the crazy reading / writing stuff. They don't
help to develop ideas collaboratively. Quite often I am just lost in
amount of text to handle.
--[/offtopic]--

> It regard to point 2: many 'proposals', including Anatoly's, neglect this
> detail. But the function has to do *something* when seqlen % grouplen != 0.
> So an 'idea' is not really a concrete programmable proposal until
> 'something' is specified.
>
> Exception -- not possible for an itertool until the end of the iteration
> (see below). To raise immediately for sequences, one could wrap grouper.
>
> def exactgrouper(sequence, k): # untested
> if len(sequence) % k:
> raise ValueError('Sequence length {} must be a multiple of group length
> {}'.format(len(sequence), k)
> else:
> return itertools.grouper(sequence, k)

Right. Iterator is not a sequence, because it doesn't know the length
of its sequence. The method should not belong to itertools at all
then.

Python 3 is definitely become more complicated. I'd prefer to keep
separated from iterator stuff, but it seems more harder with every
iteration.

> Of course, sequences can also be directly sequentially sliced (but should
> the result be an iterable or sequence of blocks?). But we do not have a
> seqtools module and I do not think there should be another method added to
> the seq protocol.

I'd expect strings chunked into strings and lists into lists. Don't
want to know anything about protocols.

> Fill -- grouper always does this, with a default of None.
>
> Truncate, Remainder -- grouper (zip_longest) cannot directly do this and no
> recipes are given in the itertools docs. (More could be, see below.)
>
> Discussions on python-list gives various implementations either for
> sequences or iterables. For the latter, one approach is "it =
> iter(iterable)" followed by repeated islice of the first n items. Another is
> to use a sentinal for the 'fill' to detect a final incomplete block (tuple
> for grouper).
>
> def grouper_x(n, iterable): # untested
> sentinal = object()
> for g in grouper(n, iterable, sentinal):
> if g[-1] != sentinal:
> yield g
> else:
> # pass to truncate
> # yield g[.:g.index(sentinal) for remainer
> # raise ValueError for delayed exception

We need a simple function to split a sequence into chunks(). Now we
face with the problem to apply that technique to a sequence of
infinite length when a last element of infinite sequence is
encountered. You might be thinking now that this is a reduction to
absurdity. But I'd say it is an exit from the trap. Mathematically
this problem can't be solved. I am not ignoring your solution - I
think it's quite feasible, but isn't it an overcomplication?

I mean 160 people out of 149 who upvoted the question are pretty happy
with an answer that just outputs the last chunk as-is:
http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python

chunks('ABCDEFG', 3) --> 'ABC' 'DEF' 'G'

And it is quite nice solution to me, because you're free to do
anything you'd like if you expect you data to be odd:

for chunk in chunks('ABCDEFG', size):
if len(chunk) < size:
raise Tail

You can make a helper iterator out of it too.

> ---
> The above discussion of point 2 touches on point 4, which Raymond neglected
> in the particular message above but which has come up before: What are the
> allowed input and output types? An idea is not a programmable proposal until
> the domain, range, and mapping are specified.

Domain? Mapping? I am not ignoring existing knowledge and experience.
I just don't want to complicate and don't see appropriate `import
usecase` in current context, so I won't try to guess what this means.

in string -> out list of strings
in list -> out list of lists

> Possible inputs are a specific sequence (string, for instance), any
> sequence, any iterable. Possible outputs are a sequence or iterator of
> sequence or iterator. The various python-list and stackoverflow posts
> questions asks for various combinations. zip_longest and hence grouper takes
> any iterable and returns an iterator of tuples. (An iterator of maps might
> be more useful as a building block.) This is not what one usually wants with
> string input, for instance, nor with range input. To illustrate:

Allright. Got it. Sequences have a length and can be sliced with
[i:j], iterator can't be sliced (and hence no chunks can be made). So
this function doesn't belong to itertools - it is a missing string or
sequence method. We can't have a chunk with an iterator, because
iterator over a string decomposes it into a group of pieces with no
reverse function. We can have a group and then join the group into
something. But this requires the knowledge of appropriate join()
function for the iterator, and probably not efficient. As there are no
such function (must be that Mapping you referenced above) - the
recomposition into chunks is impossible.

> import itertools as it
>
> def grouper(n, iterable, fillvalue=None):
> "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
> args = [iter(iterable)] * n
> return it.zip_longest(*args, fillvalue=fillvalue)
>
> print(*(grouper(3, 'ABCDEFG', 'x'))) # probably not wanted
> print(*(''.join(g) for g in grouper(3, 'ABCDEFG', 'x')))
> #
> ('A', 'B', 'C') ('D', 'E', 'F') ('G', 'x', 'x')
> ABC DEF Gxx
>
> --
> What to do? One could easily write 20 different functions. So more thought
> is needed before adding anything. -1 on the idea as is.

I've learned a new English type of argument - "straw man" (I used to
call this "hijacking"). This -1 doesn't belong to original idea. It
belongs to proposal of itertools.chunks() with a long list of above
points and completely different user stories (i.e. not "split string
into chunks"). I hope you still +1 with 160 people on SO that think
Python needs an easy way to chunk sequences.

> For the doc, I think it would be helpful here and in most module subchapters
> if there were a subchapter table of contents at the top (under 9.1 in this
> case). Even though just 2 lines here (currently, but see below), it would
> let people know that there *is* a recipes section. After the appropriate
> tables, mention that there are example uses in the recipe section. Possibly
> add similar tables in the recipe section.

Unfortunately, it appeared that grouper() is not chunks(). It doesn't
delivers list of list of chars given string as an input instead of
list of chunks.

> Another addition could be a new subsection on grouping (chunking) that would
> discuss post-processing of grouper (as discussed above), as well as other
> recipes, including ones specific to strings and sequences. It would
> essentially be a short how-to. Call it 9.1.3 "Grouping, Blocking, or
> Chunking Sequences and Iterables". The synonyms will help external
> searching. A toc would let people who have found this doc know to look for
> this at the bottom.

This makes matters pretty ugly. In ideal language there should be less
docs, not more.
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


techtonik at gmail

Jul 5, 2012, 6:47 AM

Post #8 of 13 (379 views)
Permalink
Re: [Python-ideas] itertools.chunks(iterable, size, fill=None) [In reply to]

On Sun, Jul 1, 2012 at 3:01 PM, Stefan Behnel <stefan_ml [at] behnel> wrote:
>
> To address the main problem of users not finding what they need, what about
> simply extending the docstring of the grouper() function with a sentence
> like this:
>
> "This functionality is also called 'chunking' or 'blocking' and can be used
> for load distribution and sharding."
>
> That would make it easy for users to find what they are looking for when
> they search the page for "chunk". I find that a much more common and less
> ambiguous name than "grouping", which reminds me more of "group by".

In appeared that "chunking" and "grouping" are different kind of
tasks. You can chunk a sequence (sting) by slicing it into smaller
sequences, but you can not chunk in iterable - you can only group it.

There is an data loss about the structure that occurs when a sequence
(string) becomes an iterator:
chunks ABCDE -> AB CD E
group ABCDE -> A B C D E -> (A B) (C D) (D E)
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


barry at python

Jul 5, 2012, 6:52 AM

Post #9 of 13 (380 views)
Permalink
Re: [Python-ideas] itertools.chunks(iterable, size, fill=None) [In reply to]

On Jul 05, 2012, at 04:36 PM, anatoly techtonik wrote:

>It would be much easier to do this in unified interface for threaded
>discussions, but for now there is no capabilities for that neither in Mailman
>nor in GMail.

You might like to read the mailing lists via NNTP on Gmane.

Cheers,
-Barry
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


techtonik at gmail

Jul 5, 2012, 7:33 AM

Post #10 of 13 (378 views)
Permalink
Re: [Python-ideas] itertools.chunks(iterable, size, fill=None) [In reply to]

On Wed, Jul 4, 2012 at 9:31 PM, Terry Reedy <tjreedy [at] udel> wrote:
> On 7/4/2012 5:57 AM, anatoly techtonik wrote:
>>
>> On Fri, Jun 29, 2012 at 11:32 PM, Georg Brandl <g.brandl [at] gmx> wrote:
>
>
>>> Anatoly, so far there were no negative votes -- would you care to go
>>> another step and propose a patch?
>>
>>
>> Was about to say "no problem",
>
>
> Did you read that there *are* strong negative votes? And that this idea has
> been rejected before? I summarized the objections in my two responses and
> pointed to the tracker issues. One of the objections is that there are 4
> different things one might want if the sequence length is not an even
> multiple of the chunk size. Your original 'idea' did not specify.

I actually meant that there is a problem to propose a patch in the
sense of getting checkout, working on a diff, sending it by attaching
to bug tracker as developer guide says.

>> For now the best thing I can do (I don't risk even to mention anything
>> with 3.3) is to copy/paste code from the docs here:
>>
>> from itertools import izip_longest
>> def chunks(iterable, size, fill=None):
>> """Split an iterable into blocks of fixed-length"""
>> # chunks('ABCDEFG', 3, 'x') --> ABC DEF Gxx
>> args = [iter(iterable)] * size
>> return izip_longest(fillvalue=fill, *args)
>
>
> Python ideas is about Python 3 ideas. Please post Python 3 code.
>
> This is actually a one liner
>
> return zip_longest(*[iter(iterable)]*size, fillvalue=file)
>
> We don't generally add such to the stdlib.

Can you figure out from the code what this stuff does?
It doesn't give chunks of strings.

>> BTW, this doesn't work as expected (at least for strings). Expected is:
>> chunks('ABCDEFG', 3, 'x') --> 'ABC' 'DEF' 'Gxx'
>> got:
>> chunks('ABCDEFG', 3, 'x') --> ('A' 'B' 'C') ('D' 'E' 'F') ('G' 'x' 'x')
>
>
> One of the problems with idea of 'add a chunker' is that there are at least
> a dozen variants that different people want.

That's not the problem. People always want something extra. The
problem that we don't have a real wish distribution. If 1000 people
want chunks and 1 wants groups with exception - we still account these
as equal variants.

Therefore my idea is deliberately limited to "string to chunks" user
story, and SO implementation proposal.

> I discussed the problem of
> return types issue in my responses. I showed how to get the 'expected'
> response above using grouper, but also suggested that it is the wrong basis
> for splitting strings. Repeated slicing make more sense for concrete
> sequence types.
>
> def seqchunk_odd(s, size):
> # include odd size left over
> for i in range(0, len(s), size):
> yield s[i:i+size]
>
> print(list(seqchunk_odd('ABCDEFG', 3)))
> #
> ['ABC', 'DEF', 'G']

Right. That's the top answer on SO that people think should be in
stdlib. Great we are talking about the same thing actually.

> def seqchunk_even(s, size):
> # only include even chunks
> for i in range(0, size*(len(s)//size), size):
> yield s[i:i+size]
>
> print(list(seqchunk_even('ABCDEFG', 3)))
> #
> ['ABC', 'DEF']

This is deducible from seqchunk_odd(s, size)

> def strchunk_fill(s, size, fill):
> # fill odd chunks
> q, r = divmod(len(s), size)
> even = size * q
> for i in range(0, even, size):
> yield s[i:i+size]
> if size != even:
> yield s[even:] + fill * (size - r)
>
> print(list(strchunk_fill('ABCDEFG', 3, 'x')))
> #
> ['ABC', 'DEF', 'Gxx']

Also deducible from seqchunk_odd(s, size)

> Because the 'fill' value is necessarily a sequence for strings,
> strchunk_fill would only work for lists and tuples if the fill value were
> either required to be given as a tuple or list of length 1 or if it were
> internally converted inside the function. Skipping that for now.
>
> Having written the fill version based on the even version, it is easy to
> select among the three behaviors by modifying the fill version.
>
> def strchunk(s, size, fill=NotImplemented):
> # fill odd chunks
> q, r = divmod(len(s), size)
> even = size * q
> for i in range(0, even, size):
> yield s[i:i+size]
> if size != even and fill is not NotImplemented:
> yield s[even:] + fill * (size - r)
>
> print(*strchunk('ABCDEFG', 3))
> print(*strchunk('ABCDEFG', 3, ''))
> print(*strchunk('ABCDEFG', 3, 'x'))
> #
> ABC DEF
> ABC DEF G
> ABC DEF Gxx

I now don't even think that fill value is needed as argument.
if len(chunk) < size:
chunk.extend( [fill] * ( size - len(chunk)) )

> I already described how something similar could be done by checking each
> grouper output tuple for a fill value, but that requires that the fill value
> be a sentinal that could not otherwise appear in the tuple. One could modify
> grouper to fill with a private object() and check the last item of each
> group for that sentinal and act accordingly (delete, truncate, or replace).
> A generic api needs some thought, though.

I just need to chunk strings and sequences. Generic API is too complex
without counting all usecases and iterating over them.

> An issue I did not previously mention is that people sometimes want
> overlapping chunks rather than contiguous disjoint chunks. The slice
> approach trivially adapts to that.
>
> def seqlap(s, size):
> for i in range(len(s)-size+1):
> yield s[i:i+size]
>
> print(*seqlap('ABCDEFG', 3))
> #
> ABC BCD CDE DEF EFG
>
> A sliding window for a generic iterable requires a deque or ring buffer
> approach that is quite different from the zip-longest -- grouper approach.

That's why I'd like to drastically reduce the scope of proposal.
itertools doesn't seem to be the best place anymore. How about
sequence method?

string.chunks(size) -> ABC DEF G
list.chunks(size) -> [A,B,C], [C,D,E],[G]

If somebody needs a keyword argument - this can come later without
breaking compatibility.
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


steve at pearwood

Jul 5, 2012, 8:57 AM

Post #11 of 13 (389 views)
Permalink
Re: [Python-ideas] itertools.chunks(iterable, size, fill=None) [In reply to]

anatoly techtonik wrote:
> On Wed, Jul 4, 2012 at 9:31 PM, Terry Reedy <tjreedy [at] udel> wrote:

>> A sliding window for a generic iterable requires a deque or ring buffer
>> approach that is quite different from the zip-longest -- grouper approach.
>
> That's why I'd like to drastically reduce the scope of proposal.
> itertools doesn't seem to be the best place anymore. How about
> sequence method?
>
> string.chunks(size) -> ABC DEF G
> list.chunks(size) -> [A,B,C], [C,D,E],[G]

-1

This is a fairly trivial problem to solve, and there are many variations on
it. Many people will not find the default behaviour helpful, and will need to
write their own. Why complicate the API for all sequence types with this?

I don't believe that we should enshrine one variation as a built-in method,
without any evidence that it is the most useful or common variation. Even if
there is one variation far more useful than the others, that doesn't
necessarily mean we ought to make it a builtin method unless it is a
fundamental sequence operation, has wide applicability, and is genuinely hard
to write. I don't believe chunking meets *any* of those criteria, let alone
all three.

Not every six line function needs to be a builtin.

I believe that splitting a sequence (or a string) into fixed-size chunks is
more of a programming exercise problem than a genuinely useful tool. That does
not mean that there is never any real use-cases for splitting into fixed-size
chunks, only that this is the function that *seems* more useful in theory than
it turns out in practice.

Compare this with more useful sequence/iteration tools, like (say) zip. You
can hardly write a hundred lines of code without using zip at least once. But
I bet you can write tens of thousands of lines of code without needing to
split sequences into fixed chunks like this.

Besides, the name "chunks" is more general than how you are using it. For
example, I consider chunking to be splitting a sequence up at a various
delimiters or separators, not at fixed character positions. E.g. "the third
word of item two of the fourth line" is a chunk.

This fits more with the non-programming use of the term chunk or chunking, and
has precedence in Apple's Hypertalk language, which literally allowed you to
talk about words, items and lines of text, each of which are described as chunks.

This might be a good candidate for a utility module made up of assorted useful
functions, but not for the string and sequence APIs.



--
Steven

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


steve at pearwood

Jul 5, 2012, 9:09 AM

Post #12 of 13 (378 views)
Permalink
Re: [Python-ideas] itertools.chunks(iterable, size, fill=None) [In reply to]

anatoly techtonik wrote:
>>>> Which is the 33th most voted Python question on SO -
>>>>
>>>> http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python/312464
>> I am curious how you get that number. I do note that there are about 15
>> other Python SO questions that seem to be variations on the theme. There
>> might be more if 'blocks' and 'groups' were searched for.
>
> It's easy:
> 1. Go http://stackoverflow.com/
> 2. Search [python]
> 3. Click `votes` tab
> 4. Choose `30 per page` at the bottom
> 5. Jump to the second page, there it is 4th from the top:
> http://stackoverflow.com/questions/tagged/python?page=2&sort=votes&pagesize=30

Yes. I don't think this is particularly significant. Have a look at some of
the questions with roughly the same number of votes:

#26 "How can I remove (chomp) a newline in Python?" 176 votes

#33 "How do you split a list into evenly sized chunks in Python?" 149 votes

#36 "Accessing the index in Python for loops" 144 votes


Being 33rd most voted question doesn't really mean much.


By the way, why is this discussion going to both python-dev and python-ideas?



--
Steven
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


stefan_ml at behnel

Jul 5, 2012, 9:50 AM

Post #13 of 13 (377 views)
Permalink
Re: [Python-ideas] itertools.chunks(iterable, size, fill=None) [In reply to]

anatoly techtonik, 05.07.2012 15:36:
> On Sun, Jul 1, 2012 at 12:09 AM, Terry Reedy wrote:
>> From Raymond's first message on http://bugs.python.org/issue6021 , add
>> grouper:
>>
>> "This has been rejected before.
>
> I quite often see such arguments and I can't stand to repeat that
> these are not arguments. It is good to know, but when people use that
> as a reason to close tickets - that's just disgusting.

The *real* problem is that people keep bringing up topics (and even spell
them out in the bug tracker) without searching for existing discussions
and/or tickets first. That's why those who do such a search (or who know
what they are talking about anyway) close these tickets with the remark
"this has been rejected before", instead of repeating an entire heap of
arguments all over again to feed a discussion that would only lead to the
same result as it did before, often several times before.

Stefan

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com

Python dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.