Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Bugs

[issue15216] Support setting the encoding on a text stream after creation

 

 

Python bugs RSS feed   Index | Next | Previous | View Threaded


report at bugs

Aug 2, 2012, 9:11 AM

Post #1 of 13 (137 views)
Permalink
[issue15216] Support setting the encoding on a text stream after creation

Changes by Atsuo Ishimoto <ishimoto [at] gembook>:


----------
nosy: +ishimoto

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Aug 5, 2012, 3:52 AM

Post #2 of 13 (130 views)
Permalink
[issue15216] Support setting the encoding on a text stream after creation [In reply to]

Martin v. Löwis added the comment:

I fail to see why this is a release blocker; no rationale is given in the original message, nor in the quoted message. So unblocking.

----------
nosy: +loewis
priority: release blocker -> normal

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Aug 6, 2012, 6:09 PM

Post #3 of 13 (128 views)
Permalink
[issue15216] Support setting the encoding on a text stream after creation [In reply to]

Changes by INADA Naoki <songofacandy [at] gmail>:


----------
nosy: +naoki

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Aug 6, 2012, 7:01 PM

Post #4 of 13 (128 views)
Permalink
[issue15216] Support setting the encoding on a text stream after creation [In reply to]

STINNER Victor added the comment:

Here is a Python implementation of TextIOWrapper.set_encoding().

The main limitation is that it is not possible to set the encoding on a non-seekable stream after the first read (if the read buffer is not empty, ie. if there are pending decoded characters).

+ # flush read buffer, may require to seek backward in the underlying
+ # file object
+ if self._decoded_chars:
+ if not self.seekable():
+ raise UnsupportedOperation(
+ "It is not possible to set the encoding "
+ "of a non seekable file after the first read")
+ assert self._snapshot is not None
+ dec_flags, next_input = self._snapshot
+ offset = self._decoded_chars_used - len(next_input)
+ if offset:
+ self.buffer.seek(offset, SEEK_CUR)

--

I don't have an use case for setting the encoding of sys.stdout/stderr after startup, but I would like to be able to control the *error handler* after the startup! My implementation permits to change both (encoding, errors, encoding and errors).

For example, Lib/test/regrtest.py uses the following function to force the backslashreplace error handler on sys.stdout. It changes the error handler to avoid UnicodeEncodeError when displaying the result of tests.

def replace_stdout():
"""Set stdout encoder error handler to backslashreplace (as stderr error
handler) to avoid UnicodeEncodeError when printing a traceback"""
import atexit

stdout = sys.stdout
sys.stdout = open(stdout.fileno(), 'w',
encoding=stdout.encoding,
errors="backslashreplace",
closefd=False,
newline='\n')

def restore_stdout():
sys.stdout.close()
sys.stdout = stdout
atexit.register(restore_stdout)

The doctest module uses another trick to change the error handler:

save_stdout = sys.stdout
if out is None:
encoding = save_stdout.encoding
if encoding is None or encoding.lower() == 'utf-8':
out = save_stdout.write
else:
# Use backslashreplace error handling on write
def out(s):
s = str(s.encode(encoding, 'backslashreplace'), encoding)
save_stdout.write(s)
sys.stdout = self._fakeout

----------
keywords: +patch
nosy: +haypo
Added file: http://bugs.python.org/file26715/set_encoding.patch

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Aug 6, 2012, 7:04 PM

Post #5 of 13 (128 views)
Permalink
[issue15216] Support setting the encoding on a text stream after creation [In reply to]

STINNER Victor added the comment:

> That will be fragile. A bit of prematurate input or output
> (for whatever reason) and your program breaks.

I agree that it is not the most pure solution, but sometimes practicality beats purity (rule #9) ;-) We can add an ugly big red warning in the doc ;-)

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Aug 6, 2012, 7:09 PM

Post #6 of 13 (133 views)
Permalink
[issue15216] Support setting the encoding on a text stream after creation [In reply to]

STINNER Victor added the comment:

> My implementation permits to change both (encoding, errors, encoding and errors).

We may also add a set_errors() method:

def set_errors(self, errors):
self.set_encoding(self.encoding, errors)

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Aug 6, 2012, 10:34 PM

Post #7 of 13 (123 views)
Permalink
[issue15216] Support setting the encoding on a text stream after creation [In reply to]

Nick Coghlan added the comment:

The reason I marked this as a release blocker for 3.4 is because it's a key piece of functionality for writing command line apps which accept an encoding argument. I'll use "high" instead.

An interesting proposal was posted to the python-dev thread [1]: using self.detach() and self.__init__() to reinitialise the wrapper *in-place*.

With that approach, the pure Python version of set_encoding() would look something like this:

_sentinel = object()
def set_encoding(self, encoding=_sentinel, errors=_sentinel):
if encoding is _sentinel:
encoding = self.encoding
if errors is _sentinel:
errors = self.errors
self.__init__(self.detach(),
encoding, errors,
self._line_buffering,
self._readnl,
self._write_through)

(The pure Python version currently has no self._write_through attribute - see #15571)

Note that this approach addresses my main concern with the use of detach() for this: because the wrapper is reinitialised in place, old references (such as the sys.__std*__ attributes) will also see the change.

Yes, such a function would need a nice clear warning to say "Calling this may cause data loss or corruption if used without due care and attention", but it should *work*. (Without automatic garbage collection, the C implementation would need an explicit internal "reinitialise" function rather than being able to just use the existing init function directly, but that shouldn't be a major problem).

[1] http://mail.python.org/pipermail/python-ideas/2012-August/015898.html

----------
priority: normal -> high

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Aug 8, 2012, 6:25 PM

Post #8 of 13 (123 views)
Permalink
[issue15216] Support setting the encoding on a text stream after creation [In reply to]

Changes by rurpy the second <rurpy [at] yahoo>:


----------
nosy: +rurpy2

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Aug 8, 2012, 7:08 PM

Post #9 of 13 (121 views)
Permalink
[issue15216] Support setting the encoding on a text stream after creation [In reply to]

Nick Coghlan added the comment:

To bring back Victor's comments from the list:

- stdout/stderr are fairly easy to handle, since the underlying buffers can be flushed before switching the encoding and error settings. Yes, there's a risk of creating mojibake, but that's unavoidable and, for this use case, trumped by the pragmatic need to support overriding the output encoding in a robust fashion (i.e. not breaking sys.__stdout__ or sys.__stderr__, and not crashing if something else displays output during startup, for example, when running under "python -v")

- stdin is more challenging, since it isn't entirely clear yet how to handle the case where data is already buffered internally. Victor proposes that it's acceptable to simply disallow changing the encoding of a stream that isn't seekable. My feeling is that such a restriction would largely miss the point, since the original use case that prompted the creation of this was shell pipeline processing, where stdin will often be a PIPE

I think the guiding use case here really needs to be this one: "How do I implement the equivalent of 'iconv' as a Python 3 script, without breaking internal interpreter state invariants?"

My current thought is that, instead of seeking, the input case can better be handled by manipulating the read ahead buffer directly. Something like (for the pure Python version):

self._encoding = new_encoding
if self._decoder is not None:
old_data = self._get_decoded_chars().encode(old_encoding)
old_data += self._decoder.getstate()[0]
decoder = self._get_decoder()
new_chars = ''
if old_data:
new_chars = decoder.decode(old_data)
self._set_decoded_chars(new_chars)

(A similar mechanism could actually be used to support an "initial_data" parameter to TextIOWrapper, which would help in general encoding detection situations where changing encoding *in-place* isn't needed, but the application would like an easy way to "put back" the initial data for inclusion in the text stream without making assumptions about the underlying buffer implementation)

Also, StringIO should implement this new API as a no-op.

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Aug 9, 2012, 1:31 AM

Post #10 of 13 (122 views)
Permalink
[issue15216] Support setting the encoding on a text stream after creation [In reply to]

STINNER Victor added the comment:

> Victor proposes that it's acceptable to simply disallow changing the encoding of a stream that isn't seekable.

It is no what I said. My patch raises an exception if you already
started to read stdin. It should work fine if stdin is a pipe but the
read buffer is empty.

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Aug 9, 2012, 4:25 AM

Post #11 of 13 (122 views)
Permalink
[issue15216] Support setting the encoding on a text stream after creation [In reply to]

Nick Coghlan added the comment:

Ah, you're right - peeking into the underlying buffer would be enough to
handle encoding detection.

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Aug 9, 2012, 1:42 PM

Post #12 of 13 (122 views)
Permalink
[issue15216] Support setting the encoding on a text stream after creation [In reply to]

STINNER Victor added the comment:

Oh, set_encoding.patch is wrong:

+ offset = self._decoded_chars_used - len(next_input)

self._decoded_chars_used is a number of Unicode characters, len(next_input) is a number of bytes. It only works with 7 and 8 bit encodings like ascii or latin1, but not with multibyte encodings like utf8 or ucs-4.

> peeking into the underlying buffer would be enough to
> handle encoding detection.

I wrote a new patch using this idea. It does not work (yet?) with non seekable streams. The raw read buffer (bytes string) is not stored in the _snapshot attribute if the stream is not seeakble. It may be changed to solve this issue.

set_encoding-2.patch is still a work-in-progress. It does not patch the _io module for example.

----------
Added file: http://bugs.python.org/file26750/set_encoding-2.patch

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Aug 9, 2012, 1:46 PM

Post #13 of 13 (120 views)
Permalink
[issue15216] Support setting the encoding on a text stream after creation [In reply to]

STINNER Victor added the comment:

Note: it is not possible to reencode the buffer of decoded characters to compute the offset in bytes. Some codecs are not bijective.

Examples:

* b'\x00'.decode('utf7').encode('utf7') == b'+AAA-'
* b'\xff'.decode('ascii', 'replace').encode('ascii', 'replace') == b'?'
* b'\xff'.decode('ascii', 'ignore').encode('ascii', 'ignore') == b''

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue15216>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

Python bugs RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.