Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

Opening multiple Files in Different Encoding

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


subhabangalore at gmail

Jul 10, 2012, 10:46 AM

Post #1 of 6 (289 views)
Permalink
Opening multiple Files in Different Encoding

Dear Group,

I kept a good number of files in a folder. Now I want to read all of
them. They are in different formats and different encoding. Using
listdir/glob.glob I am able to find the list but how to open/read or
process them for different encodings?

If any one can help me out.I am using Python3.2 on Windows.

Regards,
Subhabrata Banerjee.
--
http://mail.python.org/mailman/listinfo/python-list


python at mrabarnett

Jul 10, 2012, 12:26 PM

Post #2 of 6 (275 views)
Permalink
Re: Opening multiple Files in Different Encoding [In reply to]

On 10/07/2012 18:46, Subhabrata wrote:
> Dear Group,
>
> I kept a good number of files in a folder. Now I want to read all of
> them. They are in different formats and different encoding. Using
> listdir/glob.glob I am able to find the list but how to open/read or
> process them for different encodings?
>
> If any one can help me out.I am using Python3.2 on Windows.
>
You could try different encodings. If it raises a UnicodeDecodeError,
then it's the wrong encoding, Otherwise just look at the decoding
result and see whether it "looks" OK.

I believe that one method is to look at the frequency distribution of
characters compared with sample texts.
--
http://mail.python.org/mailman/listinfo/python-list


steve+comp.lang.python at pearwood

Jul 10, 2012, 11:22 PM

Post #3 of 6 (272 views)
Permalink
Re: Opening multiple Files in Different Encoding [In reply to]

On Tue, 10 Jul 2012 10:46:08 -0700, Subhabrata wrote:

> Dear Group,
>
> I kept a good number of files in a folder. Now I want to read all of
> them. They are in different formats and different encoding. Using
> listdir/glob.glob I am able to find the list but how to open/read or
> process them for different encodings?

open('first file', encoding='uft-8')
open('second file', encoding='latin1')

How you decide which encoding to use is up to you. Perhaps you can keep a
mapping of {filename: encoding} somewhere.

Or perhaps you can try auto-detecting the encodings. The chardet module
should help you there.



--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


subhabangalore at gmail

Jul 11, 2012, 11:15 AM

Post #4 of 6 (269 views)
Permalink
Re: Opening multiple Files in Different Encoding [In reply to]

On Tuesday, July 10, 2012 11:16:08 PM UTC+5:30, Subhabrata wrote:
> Dear Group,
>
> I kept a good number of files in a folder. Now I want to read all of
> them. They are in different formats and different encoding. Using
> listdir/glob.glob I am able to find the list but how to open/read or
> process them for different encodings?
>
> If any one can help me out.I am using Python3.2 on Windows.
>
> Regards,
> Subhabrata Banerjee.
Dear Group,

No generally I know the glob.glob or the encodings as I work lot on non-ASCII stuff, but I recently found an interesting issue, suppose there are .doc,.docx,.txt,.xls,.pdf files with different encodings.
1) First I have to determine on the fly the file type.
2) I can not assign encoding="..." whatever be the encoding I have to read it.

Any idea. Thinking.

Thanks in Advance,
Regards,
Subhabrata Banerjee.

--
http://mail.python.org/mailman/listinfo/python-list


oscar.j.benjamin at gmail

Jul 11, 2012, 1:55 PM

Post #5 of 6 (273 views)
Permalink
Re: Opening multiple Files in Different Encoding [In reply to]

On 11 July 2012 19:15, <subhabangalore [at] gmail> wrote:

> On Tuesday, July 10, 2012 11:16:08 PM UTC+5:30, Subhabrata wrote:
> > Dear Group,
> >
> > I kept a good number of files in a folder. Now I want to read all of
> > them. They are in different formats and different encoding. Using
> > listdir/glob.glob I am able to find the list but how to open/read or
> > process them for different encodings?
> >
> > If any one can help me out.I am using Python3.2 on Windows.
> >
> > Regards,
> > Subhabrata Banerjee.
> Dear Group,
>
> No generally I know the glob.glob or the encodings as I work lot on
> non-ASCII stuff, but I recently found an interesting issue, suppose there
> are .doc,.docx,.txt,.xls,.pdf files with different encodings.


Some of the formats you have listed are not text-based. What do you mean by
the encoding of e.g. a .doc or .xls file?

My understanding is that these are binary files. You won't be able to read
them without the help of a special module (I don't know of one that can).


> 1) First I have to determine on the fly the file type.
> 2) I can not assign encoding="..." whatever be the encoding I have to read
> it.
>

Perhaps you just want to open the file as binary? The following will read
the contents of any file binary or text regardless of encoding or anything
else:

f = open('spreadsheet.xls', 'rb')
data = f.read() # returns binary data rather than text


>
> Any idea. Thinking.
>
> Thanks in Advance,
> Regards,
> Subhabrata Banerjee.
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>


steve+comp.lang.python at pearwood

Jul 11, 2012, 4:22 PM

Post #6 of 6 (268 views)
Permalink
Re: Opening multiple Files in Different Encoding [In reply to]

On Wed, 11 Jul 2012 11:15:02 -0700, subhabangalore wrote:

> On Tuesday, July 10, 2012 11:16:08 PM UTC+5:30, Subhabrata wrote:
>> Dear Group,
>>
>> I kept a good number of files in a folder. Now I want to read all of
>> them. They are in different formats and different encoding. Using
>> listdir/glob.glob I am able to find the list but how to open/read or
>> process them for different encodings?
>>
>> If any one can help me out.I am using Python3.2 on Windows.
>>
>> Regards,
>> Subhabrata Banerjee.
> Dear Group,
>
> No generally I know the glob.glob or the encodings as I work lot on
> non-ASCII stuff, but I recently found an interesting issue, suppose
> there are .doc,.docx,.txt,.xls,.pdf files with different encodings.

You can have text files with different encodings, but not the others.

.doc .docx .xls and .pdf are all binary files. You don't specify an
encoding when you read them, because they aren't text -- encodings are
for mapping bytes to text, not bytes to binary formats.

In particular, .docx is compressed XML, so once you have uncompressed it,
the contents XML, which is *always* UTF-8.


> 1) First I have to determine on the fly the file type.

Which is a different problem from your first post.

On Windows, you determine the file type using the file extension.

import os
name, ext = os.path.splitext("my_file_name.bmp")

will give you ext = ".bmp".

Then what do you expect to do? You can open the file as a binary blob,
but what do you expect then?

f = open("my_file_name.bmp", "rb")

Now what do you want to do with it?


> 2) I can not assign
> encoding="..." whatever be the encoding I have to read it.

You can't set the encoding when you open files in binary mode, but binary
files don't have an encoding.



--
Steven
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.