Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

How to know if a file is a text file

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


luca at keul

Nov 14, 2009, 8:02 AM

Post #1 of 7 (283 views)
Permalink
How to know if a file is a text file

Hi all.

I'm looking for a way to be able to load a generic file from the
system and understand if he is plain text.
The mimetype module has some nice methods, but for example it's not
working for file without extension.

Any suggestion?

--
-- luca
--
http://mail.python.org/mailman/listinfo/python-list


philip at semanchuk

Nov 14, 2009, 9:51 AM

Post #2 of 7 (262 views)
Permalink
Re: How to know if a file is a text file [In reply to]

On Nov 14, 2009, at 11:02 AM, Luca Fabbri wrote:

> Hi all.
>
> I'm looking for a way to be able to load a generic file from the
> system and understand if he is plain text.
> The mimetype module has some nice methods, but for example it's not
> working for file without extension.

Hi Luca,
You have to define what you mean by "text" file. It might seem
obvious, but it's not.

Do you mean just ASCII text? Or will you accept Unicode too? Unicode
text can be more difficult to detect because you have to guess the
file's encoding (unless it has a BOM; most don't).

And do you need to verify that every single byte in the file is
"text"? What if the file is 1GB, do you still want to examine every
single byte?

If you give us your own (specific!) definition of what "text" means,
or perhaps a description of the problem you're trying to solve, then
maybe we can help you better.

Cheers
Philip
--
http://mail.python.org/mailman/listinfo/python-list


nobody at nowhere

Nov 15, 2009, 4:06 AM

Post #3 of 7 (253 views)
Permalink
Re: How to know if a file is a text file [In reply to]

On Sat, 14 Nov 2009 17:02:29 +0100, Luca Fabbri wrote:

> I'm looking for a way to be able to load a generic file from the
> system and understand if he is plain text.
> The mimetype module has some nice methods, but for example it's not
> working for file without extension.
>
> Any suggestion?

You could use the "file" command. It's normally installed by default on
Unix systems, but you can get a Windows version from:

http://gnuwin32.sourceforge.net/packages/file.htm

--
http://mail.python.org/mailman/listinfo/python-list


clp2 at rebertia

Nov 15, 2009, 4:34 AM

Post #4 of 7 (252 views)
Permalink
Re: How to know if a file is a text file [In reply to]

On Sun, Nov 15, 2009 at 4:06 AM, Nobody <nobody [at] nowhere> wrote:
> On Sat, 14 Nov 2009 17:02:29 +0100, Luca Fabbri wrote:
>
>> I'm looking for a way to be able to load a generic file from the
>> system and understand if he is plain text.
>> The mimetype module has some nice methods, but for example it's not
>> working for file without extension.
>>
>> Any suggestion?
>
> You could use the "file" command. It's normally installed by default on
> Unix systems, but you can get a Windows version from:

FWIW, IIRC the heuristic `file` uses to check whether a file is text
or not is whether it contains any null bytes; if it does, it
classifies it as binary (i.e. not text).

Cheers,
Chris
--
http://blog.rebertia.com
--
http://mail.python.org/mailman/listinfo/python-list


lucafbb at gmail

Nov 15, 2009, 4:49 AM

Post #5 of 7 (252 views)
Permalink
Re: How to know if a file is a text file [In reply to]

On Sat, Nov 14, 2009 at 6:51 PM, Philip Semanchuk <philip [at] semanchuk> wrote:
> Hi Luca,
> You have to define what you mean by "text" file. It might seem obvious, but
> it's not.
>
> Do you mean just ASCII text? Or will you accept Unicode too? Unicode text
> can be more difficult to detect because you have to guess the file's
> encoding (unless it has a BOM; most don't).
>
> And do you need to verify that every single byte in the file is "text"? What
> if the file is 1GB, do you still want to examine every single byte?
>
> If you give us your own (specific!) definition of what "text" means, or
> perhaps a description of the problem you're trying to solve, then maybe we
> can help you better.
>

Thanks all.

I was quite sure that this is not a very simple task. Right now search
only inside ASCII encode is not enough for me (my native language is
outside this encode :-)
Checking every single byte can be a good solution...

I can start using the mimetype module and, if the file has no
extension, check byte one by one (commonly) as "file" command does.
Better: I can check use the "file" command if available.

Again: thanks all!

--
-- luca
--
http://mail.python.org/mailman/listinfo/python-list


nobody at nowhere

Nov 15, 2009, 10:50 AM

Post #6 of 7 (250 views)
Permalink
Re: How to know if a file is a text file [In reply to]

On Sun, 15 Nov 2009 04:34:10 -0800, Chris Rebert wrote:

>>> I'm looking for a way to be able to load a generic file from the
>>> system and understand if he is plain text.
>>> The mimetype module has some nice methods, but for example it's not
>>> working for file without extension.
>>>
>>> Any suggestion?
>>
>> You could use the "file" command. It's normally installed by default on
>> Unix systems, but you can get a Windows version from:
>
> FWIW, IIRC the heuristic `file` uses to check whether a file is text
> or not is whether it contains any null bytes; if it does, it
> classifies it as binary (i.e. not text).

"file" provides more granularity than that, recognising many specific
formats, both text and binary.

First, it uses "magic number" checks, checking for known signature bytes
(e.g. "#!" or "JFIF") at the beginning of the file. If those checks fail
it checks for common text encodings. If those also fail, it reports "data".

Also, UTF-16-encoded text is recognised as text, even though it may
contain a high proportion of NUL bytes.


--
http://mail.python.org/mailman/listinfo/python-list


nobody at nowhere

Nov 15, 2009, 10:56 AM

Post #7 of 7 (250 views)
Permalink
Re: How to know if a file is a text file [In reply to]

On Sun, 15 Nov 2009 13:49:54 +0100, Luca wrote:

> I was quite sure that this is not a very simple task. Right now search
> only inside ASCII encode is not enough for me (my native language is
> outside this encode :-)
> Checking every single byte can be a good solution...
>
> I can start using the mimetype module and, if the file has no
> extension, check byte one by one (commonly) as "file" command does.
> Better: I can check use the "file" command if available.

Another possible solution:

Universal Encoding Detector
Character encoding auto-detection in Python 2 and 3

http://chardet.feedparser.org/

--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.