[issue5240] time.strptime fails to match data and format with Unicode whitespaces (Py3)

Ezio Melotti report at bugs.python.org
Fri Feb 13 04:16:39 CET 2009


New submission from Ezio Melotti <ezio.melotti at gmail.com>:

On Python3, strptime raises a ValueError with some "Unicode whitespaces"
even if they are present both in the 'string' and 'format' args in the
same position:
>>> strptime("Thu\x20Feb", "%a\x20%b") # normal space, works fine
time.struct_time(tm_year=1900, tm_mon=2, tm_mday=1, tm_hour=0, tm_min=0,
tm_sec=0, tm_wday=3, tm_yday=32, tm_isdst=-1)
>>> strptime("Thu\xa0Feb", "%a\xa0%b") # no-break space, fails
ValueError: time data 'Thu\xa0Feb' does not match format '%a\xa0%b'

I wrote a small script to find out other chars where it fails (it needs
~5 minutes to run):
>>> l = []
>>> for char in map(chr, range(0xFFFF)):
...   try: x = strptime('Thu{0}Feb'.format(char), '%a{0}%b'.format(char))
...   except ValueError: l.append(char)
...
>>> l
['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680',
'\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006',
'\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029',
'\u202f', '\u205f', '\u3000']
>>> [char.strip() for char in l]
['', '', '', '', '%', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '']
>>> [unicodedata.category(char) for char in l]
['Cc', 'Cc', 'Cc', 'Cc', 'Po', 'Cc', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs',
'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Cf', 'Zl', 'Zp', 'Zs', 'Zs',
'Zs']
>>> [unicodedata.name(char, '???') for char in l]
['???', '???', '???', '???', 'PERCENT SIGN', '???', 'NO-BREAK SPACE',
'OGHAM SPACE MARK', 'EN QUAD', 'EM QUAD', 'EN SPACE', 'EM SPACE',
'THREE-PER-EM SPACE', 'FOUR-PER-EM SPACE', 'SIX-PER-EM SPACE', 'FIGURE
SPACE', 'PUNCTUATION SPACE', 'THIN SPACE', 'HAIR SPACE', 'ZERO WIDTH
SPACE', 'LINE SEPARATOR', 'PARAGRAPH SEPARATOR', 'NARROW NO-BREAK
SPACE', 'MEDIUM MATHEMATICAL SPACE', 'IDEOGRAPHIC SPACE']

All these chars (except % and some control chars) are whitespace and
they are removed by the .strip() method, so I guess that something
similar happens in strptime too.

The Unicode categories are:
"Cc" = "Other, Control"
"Zs" = "Separator, Space"
"Cf" = "Other, Format"
"Zl" = "Separator, Line"
"Zp" = "Separator, Paragraph"

Everything seems to work fine on Py2.x (tested on 2.4 and 2.6)

----------
components: Library (Lib), Unicode
messages: 81859
nosy: ezio.melotti
severity: normal
status: open
title: time.strptime fails to match data and format with Unicode whitespaces (Py3)
type: behavior
versions: Python 3.0, Python 3.1

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue5240>
_______________________________________


More information about the Python-bugs-list mailing list