Unicode chr(150) en dash
hdante at gmail.com
Fri Apr 18 05:57:21 CEST 2008
On Apr 17, 12:10 pm, marexpo... at googlemail.com wrote:
> Thank you Martin and John, for you excellent explanations.
> I think I understand the unicode basic principles, what confuses me is the usage different applications make out of it.
> For example, I got that EN DASH out of a web page which states <?xml version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I did go for that encoding. But if the browser can
There's a trick here. Blame lax web standards and companies that
don't like standards.
There's no EN DASH in ISO-8859-1. The first 256 characters in Unicode
are the same as ISO-8859-1, but EN DASH number is U+2013.
The character code in question (which is present in the page), 150,
doesn't exist in ISO-8859-1. See
http://en.wikipedia.org/wiki/ISO/IEC_8859-1 (the entry for 150 is
The character 150 exists in Windows-1252, however, which is a non-
standard clone of ISO-8859-1.
Who is wrong ?
- The guy who wrote the web site
- The browser that does the trick.
- Everybody for using a non-standard encoding
- Everybody for using an outdated 8-bit encoding.
Don't use old 8-bit encodings. Use UTF-8.
> I might need to go for python's htmllib to avoid this, not sure. But if I don't, if I only want to just copy and paste some web pages text contents into a tkinter Text widget, what should I do to succesfully make every single character go all the way from the widget and out of tkinter into a python string variable? How did my browser knew it should render an EN DASH instead of a circumflexed lowercase u?
> This is the webpage in case you are interested, 4th line of first paragraph, there is the EN DASH:http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-...
> Thanks a lot.
More information about the Python-list