Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

csv and mixed lists of unicode and numbers

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


nulla.epistola at web

Nov 24, 2009, 8:42 AM

Post #1 of 8 (386 views)
Permalink
csv and mixed lists of unicode and numbers

Hello,

I want to put data from a database into a tab separated text file. This
looks like a typical application for the csv module, but there is a
snag: the rows I get from the database module (kinterbasdb in this case)
contain unicode objects and numbers. And of course the unicode objects
contain lots of non-ascii characters.

If I try to use csv.writer as is, I get UnicodeEncodeErrors. If I use
the UnicodeWriter from the module documentation, I get TypeErrors with
the numbers. (I'm using Python 2.6 - upgrading to 3.1 on this machine
would cause other complications.)

So do I have to process the rows myself and treat numbers and text
fields differently? Or what's the best way?

Here is a small example:

########################################################################
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import csv, codecs, cStringIO
import tempfile

cData = [u'Ärger', u'Ödland', 5, u'Süßigkeit', u'élève', 6.9, u'forêt']

class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()

def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)

def writerows(self, rows):
for row in rows:
self.writerow(row)

def writewithcsv(outfile, datalist):
wrt = csv.writer(outfile, dialect=csv.excel)
wrt.writerow(datalist)

def writeunicode(outfile, datalist):
wrt = UnicodeWriter(outfile)
wrt.writerow(datalist)

def main():
with tempfile.NamedTemporaryFile() as csvfile:
print "CSV file:", csvfile.name
print "Try with csv.writer"
try:
writewithcsv(csvfile, cData)
except UnicodeEncodeError as e:
print e
print "Try with UnicodeWriter"
writeunicode(csvfile, cData)
print "Ready."

if __name__ == "__main__":
main()


##############################################################################

Hoping for advice,

Sibylle
--
http://mail.python.org/mailman/listinfo/python-list


benjamin.kaplan at case

Nov 24, 2009, 9:50 AM

Post #2 of 8 (376 views)
Permalink
Re: csv and mixed lists of unicode and numbers [In reply to]

On Tue, Nov 24, 2009 at 11:42 AM, Sibylle Koczian <nulla.epistola [at] web> wrote:
> Hello,
>
> I want to put data from a database into a tab separated text file. This
> looks like a typical application for the csv module, but there is a
> snag: the rows I get from the database module (kinterbasdb in this case)
> contain unicode objects and numbers. And of course the unicode objects
> contain lots of non-ascii characters.
>
> If I try to use csv.writer as is, I get UnicodeEncodeErrors. If I use
> the UnicodeWriter from the module documentation, I get TypeErrors with
> the numbers. (I'm using Python 2.6 - upgrading to 3.1 on this machine
> would cause other complications.)
>
> So do I have to process the rows myself and treat numbers and text
> fields differently? Or what's the best way?
>
> Here is a small example:
>
> ########################################################################
> #!/usr/bin/env python
> # -*- coding: utf-8 -*-
>
> import csv, codecs, cStringIO
> import tempfile
>
> cData = [u'Ärger', u'Ödland', 5, u'Süßigkeit', u'élève', 6.9, u'forêt']
>
> class UnicodeWriter:
>    """
>    A CSV writer which will write rows to CSV file "f",
>    which is encoded in the given encoding.
>    """
>
>    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
>        # Redirect output to a queue
>        self.queue = cStringIO.StringIO()
>        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
>        self.stream = f
>        self.encoder = codecs.getincrementalencoder(encoding)()
>
>    def writerow(self, row):
>        self.writer.writerow([s.encode("utf-8") for s in row])

try doing [s.encode("utf-8") if isinstance(s,unicode) else s for s in row]
That way, you'll only encode the unicode strings


>        # Fetch UTF-8 output from the queue ...
>        data = self.queue.getvalue()
>        data = data.decode("utf-8")
>        # ... and reencode it into the target encoding
>        data = self.encoder.encode(data)
>        # write to the target stream
>        self.stream.write(data)
>        # empty queue
>        self.queue.truncate(0)
>
>    def writerows(self, rows):
>        for row in rows:
>            self.writerow(row)
>
> def writewithcsv(outfile, datalist):
>    wrt = csv.writer(outfile, dialect=csv.excel)
>    wrt.writerow(datalist)
>
> def writeunicode(outfile, datalist):
>    wrt = UnicodeWriter(outfile)
>    wrt.writerow(datalist)
>
> def main():
>    with tempfile.NamedTemporaryFile() as csvfile:
>        print "CSV file:", csvfile.name
>        print "Try with csv.writer"
>        try:
>            writewithcsv(csvfile, cData)
>        except UnicodeEncodeError as e:
>            print e
>        print "Try with UnicodeWriter"
>        writeunicode(csvfile, cData)
>    print "Ready."
>
> if __name__ == "__main__":
>    main()
>
>
> ##############################################################################
>
> Hoping for advice,
>
> Sibylle
> --
> http://mail.python.org/mailman/listinfo/python-list
>
--
http://mail.python.org/mailman/listinfo/python-list


__peter__ at web

Nov 24, 2009, 11:04 AM

Post #3 of 8 (377 views)
Permalink
Re: csv and mixed lists of unicode and numbers [In reply to]

Sibylle Koczian wrote:

> I want to put data from a database into a tab separated text file. This
> looks like a typical application for the csv module, but there is a
> snag: the rows I get from the database module (kinterbasdb in this case)
> contain unicode objects and numbers. And of course the unicode objects
> contain lots of non-ascii characters.
>
> If I try to use csv.writer as is, I get UnicodeEncodeErrors. If I use
> the UnicodeWriter from the module documentation, I get TypeErrors with
> the numbers. (I'm using Python 2.6 - upgrading to 3.1 on this machine
> would cause other complications.)
>
> So do I have to process the rows myself and treat numbers and text
> fields differently? Or what's the best way?

I'd preprocess the rows as I tend to prefer the simplest approach I can come
up with. Example:

def recode_rows(rows, source_encoding, target_encoding):
def recode(field):
if isinstance(field, unicode):
return field.encode(target_encoding)
elif isinstance(field, str):
return unicode(field, source_encoding).encode(target_encoding)
return unicode(field).encode(target_encoding)

return (map(recode, row) for row in rows)

rows = [[1.23], [u"äöü"], [u"ÄÖÜ".encode("latin1")], [1, 2, 3]]
writer = csv.writer(sys.stdout)
writer.writerows(recode_rows(rows, "latin1", "utf-8"))

The only limitation I can see: target_encoding probably has to be a superset
of ASCII.

Peter

--
http://mail.python.org/mailman/listinfo/python-list


nulla.epistola at web

Nov 24, 2009, 1:01 PM

Post #4 of 8 (376 views)
Permalink
Re: csv and mixed lists of unicode and numbers [In reply to]

Peter Otten schrieb:
> I'd preprocess the rows as I tend to prefer the simplest approach I can come
> up with. Example:
>
> def recode_rows(rows, source_encoding, target_encoding):
> def recode(field):
> if isinstance(field, unicode):
> return field.encode(target_encoding)
> elif isinstance(field, str):
> return unicode(field, source_encoding).encode(target_encoding)
> return unicode(field).encode(target_encoding)
>
> return (map(recode, row) for row in rows)
>

For this case isinstance really seems to be quite reasonable. And it was
silly of me not to think of sys.stdout as file object for the example!

> rows = [[1.23], [u"äöü"], [u"ÄÖÜ".encode("latin1")], [1, 2, 3]]
> writer = csv.writer(sys.stdout)
> writer.writerows(recode_rows(rows, "latin1", "utf-8"))
>
> The only limitation I can see: target_encoding probably has to be a superset
> of ASCII.
>

Coping with umlauts and accents is quite enough for me.

This problem really goes away with Python 3 (tried it on another
machine), but something else changes too: in Python 2.6 the
documentation for the csv module explicitly says "If csvfile is a file
object, it must be opened with the ‘b’ flag on platforms where that
makes a difference." The documentation for Python 3.1 doesn't have this
sentence, and if I do that in Python 3.1 I get for all sorts of data,
even for a list with only one integer literal:

TypeError: must be bytes or buffer, not str

I don't really understand that.

Regards,
Sibylle
--
http://mail.python.org/mailman/listinfo/python-list


__peter__ at web

Nov 24, 2009, 2:02 PM

Post #5 of 8 (374 views)
Permalink
Re: csv and mixed lists of unicode and numbers [In reply to]

Sibylle Koczian wrote:

> This problem really goes away with Python 3 (tried it on another
> machine), but something else changes too: in Python 2.6 the
> documentation for the csv module explicitly says "If csvfile is a file
> object, it must be opened with the ‘b’ flag on platforms where that
> makes a difference." The documentation for Python 3.1 doesn't have this
> sentence, and if I do that in Python 3.1 I get for all sorts of data,
> even for a list with only one integer literal:
>
> TypeError: must be bytes or buffer, not str

Read the documentation for open() at

http://docs.python.org/3.1/library/functions.html#open

There are significant changes with respect to 2.x; you won't even get a file
object anymore:

>>> open("tmp.txt", "w")
<_io.TextIOWrapper name='tmp.txt' encoding='UTF-8'>
>>> _.write("yadda")
5
>>> open("tmp.dat", "wb")
<_io.BufferedWriter name='tmp.dat'>
>>> _.write("yadda")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: write() argument 1 must be bytes or buffer, not str
>>> open("tmp.dat", "wb").write(b"yadda")
5

If you specify the "b" flag in 3.x the write() method expects bytes, not
str. The translation of newlines is now controlled by the "newline"
argument.

Peter

--
http://mail.python.org/mailman/listinfo/python-list


tjreedy at udel

Nov 24, 2009, 2:55 PM

Post #6 of 8 (375 views)
Permalink
Re: csv and mixed lists of unicode and numbers [In reply to]

Sibylle Koczian wrote:
> Peter Otten schrieb:
>> I'd preprocess the rows as I tend to prefer the simplest approach I can come
>> up with. Example:
>>
>> def recode_rows(rows, source_encoding, target_encoding):
>> def recode(field):
>> if isinstance(field, unicode):
>> return field.encode(target_encoding)
>> elif isinstance(field, str):
>> return unicode(field, source_encoding).encode(target_encoding)
>> return unicode(field).encode(target_encoding)
>>
>> return (map(recode, row) for row in rows)
>>
>
> For this case isinstance really seems to be quite reasonable. And it was
> silly of me not to think of sys.stdout as file object for the example!
>
>> rows = [[1.23], [u"äöü"], [u"ÄÖÜ".encode("latin1")], [1, 2, 3]]
>> writer = csv.writer(sys.stdout)
>> writer.writerows(recode_rows(rows, "latin1", "utf-8"))
>>
>> The only limitation I can see: target_encoding probably has to be a superset
>> of ASCII.
>>
>
> Coping with umlauts and accents is quite enough for me.
>
> This problem really goes away with Python 3 (tried it on another
> machine), but something else changes too: in Python 2.6 the
> documentation for the csv module explicitly says "If csvfile is a file
> object, it must be opened with the ‘b’ flag on platforms where that
> makes a difference." The documentation for Python 3.1 doesn't have this
> sentence, and if I do that in Python 3.1 I get for all sorts of data,
> even for a list with only one integer literal:
>
> TypeError: must be bytes or buffer, not str
>
> I don't really understand that.

In Python 3, a file opened in 'b' mode is for reading and writing bytes
with no encoding/decoding. I believe cvs works with files in text mode
as it returns and expects strings/text for reading and writing. Perhaps
the cvs doc should say must not be opened in 'b' mode. Not sure.

tjr

--
http://mail.python.org/mailman/listinfo/python-list


mal at egenix

Nov 25, 2009, 8:53 AM

Post #7 of 8 (360 views)
Permalink
Re: csv and mixed lists of unicode and numbers [In reply to]

Sibylle Koczian wrote:
> Hello,
>
> I want to put data from a database into a tab separated text file. This
> looks like a typical application for the csv module, but there is a
> snag: the rows I get from the database module (kinterbasdb in this case)
> contain unicode objects and numbers. And of course the unicode objects
> contain lots of non-ascii characters.
>
> If I try to use csv.writer as is, I get UnicodeEncodeErrors. If I use
> the UnicodeWriter from the module documentation, I get TypeErrors with
> the numbers. (I'm using Python 2.6 - upgrading to 3.1 on this machine
> would cause other complications.)
>
> So do I have to process the rows myself and treat numbers and text
> fields differently? Or what's the best way?

It's best to convert all data to plain strings before passing it
to the csv module.

There are many situations where you may want to use a different
string format than the standard str(obj) format, so this is best
done with a set of format methods - one for each type and format
you need, e.g. one for integers, floats, monetary values, Unicode
text, plain text, etc.

The required formatting also depends on the consumers of the
generated csv or tsv file and their locale.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Nov 25 2009)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
--
http://mail.python.org/mailman/listinfo/python-list


nulla.epistola at web

Nov 25, 2009, 12:31 PM

Post #8 of 8 (358 views)
Permalink
Re: csv and mixed lists of unicode and numbers [In reply to]

Terry Reedy schrieb:

> In Python 3, a file opened in 'b' mode is for reading and writing bytes
> with no encoding/decoding. I believe cvs works with files in text mode
> as it returns and expects strings/text for reading and writing. Perhaps
> the cvs doc should say must not be opened in 'b' mode. Not sure.
>

I think that might really be better, because for version 2.6 they
explicitly stated 'b' mode was necessary. The results I couldn't
understand, even after reading the documentation for open():

>>> import csv
>>> acsv = open(r"d:\home\sibylle\temp\tmp.csv", "wb")
>>> row = [b"abc", b"def", b"ghi"]
>>> wtr = csv.writer(acsv)
>>> wtr.writerow(row)
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
wtr.writerow(row)
TypeError: must be bytes or buffer, not str

Same error message with row = [5].

But I think I understand it now: the cvs.writer takes mixed lists of
text and numbers - that's exactly why I like to use it - so it has to
convert them before writing. And it converts into text - even bytes for
a file opened in 'b' mode. Right?

Thank you, everybody, for explaining.

Sibylle
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.