[Tutor] encoding question

spir denis.spir at gmail.com
Sun Jan 5 11:06:34 CET 2014


On 01/05/2014 03:31 AM, Alex Kleider wrote:
> I've been maintaining both a Python3 and a Python2.7 version.  The latter has
> actually opened my eyes to more complexities. Specifically the need to use
> unicode strings rather than Python2.7's default ascii.

So-called Unicode strings are not the solution to all problems. Example with 
your 'á', which can be represented by either 1 "precomposed" code (unicode code 
point) 0xe1, or ibasically by 2 ucodes (one for the "base" 'a', one for the 
"combining" '´'). Imagine you search for "Bogotá": how do you know which is 
reprsentation is used in the text you search? How do you know at all there are 
multiple representations, and what they are? The routine wil work iff, by 
chance, your *programming editor* (!) used the same representation as the 
software used to create the searched test...

Usually it the case, because most text-creation software use precomposed codes, 
when they exist, for composite characters. (But this fact just makes the issue 
more rare, hard to be aware of, and thus difficult to cope with correctly in 
code. As far as I know nearly no software does it.)

Denis


More information about the Tutor mailing list