unicode direction control characters
Robin Becker
robin at reportlab.com
Tue Jan 2 10:36:10 EST 2018
On 02/01/2018 15:18, Chris Angelico wrote:
> On Wed, Jan 3, 2018 at 1:30 AM, Robin Becker <robin at reportlab.com> wrote:
>> I'm seeing some strange characters in web responses eg
>>
>> u'\u200e28\u200e/\u200e09\u200e/\u200e1962'
>>
>> for a date of birth. The code \u200e is LEFT-TO-RIGHT MARK according to
>> unicodedata.name. I tried unicodedata.normalize, but it leaves those
>> characters there. Is there any standard way to deal with these?
>>
>> I assume that some browser+settings combination is putting these in eg
>> perhaps the language is normally right to left but numbers are not.
>
> Unicode normalization is a different beast altogether. You could
> probably just remove the LTR marks and run with the rest, though, as
> they don't seem to be important in this string.
>
> ChrisA
>
I guess I'm really wondering whether the BIDI control characters have any
semantic meaning. Most numbers seem to be LTR.
If I saw u'\u200f12' it seems to imply that the characters should be displayed
'21', but I don't know whether the number is 12 or 21.
--
Robin Becker
More information about the Python-list
mailing list