Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

convert unicode characters to visibly similar ascii characters

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


peter.bulychev at gmail

Jul 1, 2008, 11:31 AM

Post #1 of 14 (764 views)
Permalink
convert unicode characters to visibly similar ascii characters

Hello.

I want to convert unicode character into ascii one.
The method ".encode('ASCII') " can convert only those unicode characters,
which fit into 0..128 range.

But there are still lots of characters beyond this range, which can be
manually converted to some visibly similar ascii characters. For instance,
there are several quotation marks in unicode, which can be converted into
ascii quotation mark.

Can this conversion be performed in automatic manner? After googling I've
only found that there exists Unicode database, which stores human-readable
information on notation of all unicode characters (
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt). And there also exists
the Python adapter for this database (
http://docs.python.org/lib/module-unicodedata.html). Using this database I
can do something like `if notation.find('QUOTATION')!=-1:\n\treturn "'"`. I
believe there is more elegant way. Am I right?

Thanks.

--
Best regards,
Peter Bulychev.


gandalf at shopzeus

Jul 1, 2008, 11:47 AM

Post #2 of 14 (758 views)
Permalink
Re: convert unicode characters to visibly similar ascii characters [In reply to]

Peter Bulychev wrote:
> Hello.
>
> I want to convert unicode character into ascii one.
> The method ".encode('ASCII') " can convert only those unicode
> characters, which fit into 0..128 range.
>
> But there are still lots of characters beyond this range, which can be
> manually converted to some visibly similar ascii characters. For
> instance, there are several quotation marks in unicode, which can be
> converted into ascii quotation mark.
Please be more specific. There is no general solution. Unicode can
handle latin, cyrilic (russian), chinese, japanese and arabic characters
in the same string. There are thousands of possible non-ascii characters
and many of them are not similar to any ascii character.

If you only want this to work for a subset, please define that subset.

Laszlo

--
http://mail.python.org/mailman/listinfo/python-list


peter.bulychev at gmail

Jul 1, 2008, 11:54 AM

Post #3 of 14 (762 views)
Permalink
Re: convert unicode characters to visibly similar ascii characters [In reply to]

Thank you for you answer.

If you only want this to work for a subset, please define that subset.

Actually, I want to convert only punctuations (dots, commas, hyphens and so
on).

--
Best regards,
Peter Bulychev.


gandalf at shopzeus

Jul 1, 2008, 12:16 PM

Post #4 of 14 (763 views)
Permalink
Re: convert unicode characters to visibly similar ascii characters [In reply to]

Peter Bulychev wrote:
> Thank you for you answer.
>
> If you only want this to work for a subset, please define that subset.
>
> Actually, I want to convert only punctuations (dots, commas, hyphens
> and so on).
Then make your translation table manually and apply this method:

unicode.translate

Finally

print s.encode('ascii')

If you get an UnicodeEncodeError then it means you had other (not
translated, non-ascii) characters in the original string.

Best,

Laszlo

--
http://mail.python.org/mailman/listinfo/python-list


tjreedy at udel

Jul 1, 2008, 12:46 PM

Post #5 of 14 (748 views)
Permalink
Re: convert unicode characters to visibly similar ascii characters [In reply to]

Peter Bulychev wrote:
> Hello.
>
> I want to convert unicode character into ascii one.
> The method ".encode('ASCII') " can convert only those unicode
> characters, which fit into 0..128 range.
>
> But there are still lots of characters beyond this range, which can be
> manually converted to some visibly similar ascii characters. For
> instance, there are several quotation marks in unicode, which can be
> converted into ascii quotation mark.
>
> Can this conversion be performed in automatic manner? After googling
> I've only found that there exists Unicode database, which stores
> human-readable information on notation of all unicode characters
> (ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt). And there also
> exists the Python adapter for this database
> (http://docs.python.org/lib/module-unicodedata.html). Using this
> database I can do something like `if
> notation.find('QUOTATION')!=-1:\n\treturn "'"`. I believe there is more
> elegant way. Am I right?

I believe you will have to make up your own translation dictionary for
the translations *you* want. You should then be able to use that with
the .translate() method.

tjr

--
http://mail.python.org/mailman/listinfo/python-list


jim.hefferon at gmail

Jul 1, 2008, 4:55 PM

Post #6 of 14 (752 views)
Permalink
Re: convert unicode characters to visibly similar ascii characters [In reply to]

Peter Bulychev wrote:
> I want to convert unicode character into ascii one.
You have to make some arbitrary choices of what to translate. Based
on some materials on effbot's site, and a recipe, I made
ftp://alan.smcvt.edu/hefferon/unicode2ascii.py
which has at least some of what you are looking for.
$ grep HYPHEN unicode2ascii.py
u'\N{SOFT HYPHEN}':u'-',
u'\N{HYPHEN}':u'-',
u'\N{NON-BREAKING HYPHEN}':u'-',
u'\N{SOFT HYPHEN}': '-',
No doubt I have some terrible gaffes and some things missing.
Corrections appreciated.

Jim
--
http://mail.python.org/mailman/listinfo/python-list


jim.hefferon at gmail

Jul 1, 2008, 4:55 PM

Post #7 of 14 (750 views)
Permalink
Re: convert unicode characters to visibly similar ascii characters [In reply to]

Peter Bulychev wrote:
> I want to convert unicode character into ascii one.
You have to make some arbitrary choices of what to translate. Based
on some materials on effbot's site, and a recipe, I made
ftp://alan.smcvt.edu/hefferon/unicode2ascii.py
which has at least some of what you are looking for.
$ grep HYPHEN unicode2ascii.py
u'\N{SOFT HYPHEN}':u'-',
u'\N{HYPHEN}':u'-',
u'\N{NON-BREAKING HYPHEN}':u'-',
u'\N{SOFT HYPHEN}': '-',
No doubt I have some terrible gaffes and some things missing.
Corrections appreciated.

Jim
--
http://mail.python.org/mailman/listinfo/python-list


sjmachin at lexicon

Jul 1, 2008, 5:29 PM

Post #8 of 14 (747 views)
Permalink
Re: convert unicode characters to visibly similar ascii characters [In reply to]

On Jul 2, 9:55 am, Jim <jim.heffe...@gmail.com> wrote:
> Peter Bulychev wrote:
> > I want to convert unicode character into ascii one.
>
> You have to make some arbitrary choices of what to translate. Based
> on some materials on effbot's site, and a recipe, I made
> ftp://alan.smcvt.edu/hefferon/unicode2ascii.py
> which has at least some of what you are looking for.
> $ grep HYPHEN unicode2ascii.py
> u'\N{SOFT HYPHEN}':u'-',
> u'\N{HYPHEN}':u'-',
> u'\N{NON-BREAKING HYPHEN}':u'-',
> u'\N{SOFT HYPHEN}': '-',
> No doubt I have some terrible gaffes and some things missing.
> Corrections appreciated.

Comments on the above grep output:
1. You have SOFT HYPHEN twice, mapping it to u'-' and '-'
2. The idea of a soft hyphen is as a hint to a hyphenator about where
to insert a hyphen if one is necessary and the hyphenator is suspected
of acting cluelessly without the hint. IMHO, asciification should
substitute u'', not u'-'.
3. Read PEP 8. s/:/: /

Cheers,
John
--
http://mail.python.org/mailman/listinfo/python-list


jim.hefferon at gmail

Jul 1, 2008, 5:42 PM

Post #9 of 14 (750 views)
Permalink
Re: convert unicode characters to visibly similar ascii characters [In reply to]

On Jul 1, 8:29 pm, John Machin <sjmac...@lexicon.net> wrote:
> On Jul 2, 9:55 am, Jim <jim.heffe...@gmail.com> wrote:
>
> Comments on the above grep output:
> 1. You have SOFT HYPHEN twice, mapping it to u'-' and '-'
Hmph. I'll correct that. Thanks.
> 2. The idea of a soft hyphen is as a hint to a hyphenator about where
> to insert a hyphen if one is necessary and the hyphenator is suspected
> of acting cluelessly without the hint. IMHO, asciification should
> substitute u'', not u'-'.
Thanks also here. I'll think about it.
> 3. Read PEP 8. s/:/: /
I don't like the spacing in 8, personally.

Thanks,
Jim
--
http://mail.python.org/mailman/listinfo/python-list


jim.hefferon at gmail

Jul 1, 2008, 5:42 PM

Post #10 of 14 (747 views)
Permalink
Re: convert unicode characters to visibly similar ascii characters [In reply to]

On Jul 1, 8:29 pm, John Machin <sjmac...@lexicon.net> wrote:
> On Jul 2, 9:55 am, Jim <jim.heffe...@gmail.com> wrote:
>
> Comments on the above grep output:
> 1. You have SOFT HYPHEN twice, mapping it to u'-' and '-'
Hmph. I'll correct that. Thanks.
> 2. The idea of a soft hyphen is as a hint to a hyphenator about where
> to insert a hyphen if one is necessary and the hyphenator is suspected
> of acting cluelessly without the hint. IMHO, asciification should
> substitute u'', not u'-'.
Thanks also here. I'll think about it.
> 3. Read PEP 8. s/:/: /
I don't like the spacing in 8, personally.

Thanks,
Jim
--
http://mail.python.org/mailman/listinfo/python-list


bignose+hates-spam at benfinney

Jul 1, 2008, 5:47 PM

Post #11 of 14 (750 views)
Permalink
Re: convert unicode characters to visibly similar ascii characters [In reply to]

Jim <jim.hefferon [at] gmail> writes:

> I don't like the spacing in [PEP 8], personally.

Nevertheless, your Python code will be much less effort to read by
others (and yourself in future) if it is written in conformance with
PEP 8.

Writing all your Python code to conform with that standard is the
simplest step you can take to ensure that your code won't cause other
Python programmers undue reading effort.

--
\ “There's no excuse to be bored. Sad, yes. Angry, yes. |
`\ Depressed, yes. Crazy, yes. But there's no excuse for boredom, |
_o__) ever.” —Viggo Mortensen |
Ben Finney
--
http://mail.python.org/mailman/listinfo/python-list


mal at egenix

Jul 2, 2008, 1:39 AM

Post #12 of 14 (739 views)
Permalink
Re: convert unicode characters to visibly similar ascii characters [In reply to]

On 2008-07-01 20:31, Peter Bulychev wrote:
> Hello.
>
> I want to convert unicode character into ascii one.
> The method ".encode('ASCII') " can convert only those unicode characters,
> which fit into 0..128 range.
>
> But there are still lots of characters beyond this range, which can be
> manually converted to some visibly similar ascii characters. For instance,
> there are several quotation marks in unicode, which can be converted into
> ascii quotation mark.
>
> Can this conversion be performed in automatic manner? After googling I've
> only found that there exists Unicode database, which stores human-readable
> information on notation of all unicode characters (
> ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt). And there also exists
> the Python adapter for this database (
> http://docs.python.org/lib/module-unicodedata.html). Using this database I
> can do something like `if notation.find('QUOTATION')!=-1:\n\treturn "'"`. I
> believe there is more elegant way. Am I right?

You could write a codec which translates Unicode into a ASCII
lookalike characters, but AFAIK there is no standard for doing
this.

I guess the best choice is to use the Unicode code point names
as basis. These can be accessed via unicodedata.name(). You can
then create a mapping which can be processed by the character
map codec.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Jul 02 2008)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
2008-07-07: EuroPython 2008, Vilnius, Lithuania 4 days to go

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
--
http://mail.python.org/mailman/listinfo/python-list


peter.bulychev at gmail

Jul 2, 2008, 7:27 AM

Post #13 of 14 (738 views)
Permalink
Re: convert unicode characters to visibly similar ascii characters [In reply to]

Thank you.

That is exactly what I was looking for.

2008/7/2 Jim <jim.hefferon [at] gmail>:

> Peter Bulychev wrote:
> > I want to convert unicode character into ascii one.
> You have to make some arbitrary choices of what to translate. Based
> on some materials on effbot's site, and a recipe, I made
> ftp://alan.smcvt.edu/hefferon/unicode2ascii.py
> which has at least some of what you are looking for.
> $ grep HYPHEN unicode2ascii.py
> u'\N{SOFT HYPHEN}':u'-',
> u'\N{HYPHEN}':u'-',
> u'\N{NON-BREAKING HYPHEN}':u'-',
> u'\N{SOFT HYPHEN}': '-',
> No doubt I have some terrible gaffes and some things missing.
> Corrections appreciated.
>
> Jim
> --
> http://mail.python.org/mailman/listinfo/python-list
>



--
Best regards,
Peter Bulychev.


jim.hefferon at gmail

Jul 2, 2008, 9:42 AM

Post #14 of 14 (735 views)
Permalink
Re: convert unicode characters to visibly similar ascii characters [In reply to]

On Jul 1, 8:42 pm, Jim <jim.heffe...@gmail.com> wrote:
> On Jul 1, 8:29 pm, John Machin <sjmac...@lexicon.net> wrote:
> > Comments on the above grep output:
> > 1. You have SOFT HYPHEN twice, mapping it to u'-' and '-'
>
> Hmph. I'll correct that. Thanks.
Well, maybe not. I forgot that I got the by-hand conversions from
three different sources and that's why that character appears in two
different places. (I thought that listing all cases for each source
was less confusing. Arguable, for sure.)

> 2. The idea of a soft hyphen is as a hint to a hyphenator about where
> > to insert a hyphen if one is necessary and the hyphenator is suspected
> > of acting cluelessly without the hint. IMHO, asciification should
> > substitute u'', not u'-'.
>
> Thanks also here. I'll think about it.
Googling "soft hyphen" showed me that the question is not perfectly
clear-- some people seem to have very elaborate opinions on the
topic-- but I've gone with your suggestion. Thank you.

Again, I'd appreciate additional corrections. Not do I only speak
ASCII :-( but I admit to entering the data while watching a basketball
game, so no doubt there are some real blunders.

Thanks,
Jim
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.