python3, regular expression and bytes text
Eko palypse
ekopalypse at gmail.com
Sat Oct 12 14:08:34 EDT 2019
What needs to be set in order to be able to use a re search within
utf8 encoded bytes?
My test, being on a windows PC with cp1252 setup, looks like this
import re
import locale
cp1252 = 'Ärger im Paradies'.encode('cp1252')
utf8 = 'Ärger im Paradies'.encode('utf-8')
print('cp1252:', cp1252)
print('utf8 :', utf8)
print('-'*80)
print("search for 'Ärger'.encode('cp1252') in cp1252 encoded text")
for m in re.finditer('Ärger'.encode('cp1252'), cp1252):
print(m)
print('-'*80)
print("search for 'Ärger'.encode('') in utf8 encoded text")
for m in re.finditer('Ärger'.encode(), utf8):
print(m)
print('-'*80)
print("search for '\\w+'.encode('cp1252') in cp1252 encoded text")
for m in re.finditer('\\w+'.encode('cp1252'), cp1252):
print(m)
print('-'*80)
print("search for '\\w+'.encode('') in utf8 encoded text")
for m in re.finditer('\\w+'.encode(), utf8):
print(m)
locale.setlocale(locale.LC_ALL, '')
print('-'*80)
print("search for '\\w+'.encode('cp1252') using re.LOCALE in cp1252 encoded text")
for m in re.finditer('\\w+'.encode('cp1252'), cp1252, re.LOCALE):
print(m)
print('-'*80)
print("search for '\\w+'.encode('') using ??? in utf8 encoded text")
for m in re.finditer('\\w+'.encode(), utf8):
print(m)
if you run this you will get something like
cp1252: b'\xc4rger im Paradies'
utf8 : b'\xc3\x84rger im Paradies'
--------------------------------------------------------------------------------
search for 'Ärger'.encode('cp1252') in cp1252 encoded text
<re.Match object; span=(0, 5), match=b'\xc4rger'>
--------------------------------------------------------------------------------
search for 'Ärger'.encode('') in utf8 encoded text
<re.Match object; span=(0, 6), match=b'\xc3\x84rger'>
--------------------------------------------------------------------------------
these two are ok BUT the result for \w+ shows a difference
search for '\w+'.encode('cp1252') in cp1252 encoded text
<re.Match object; span=(1, 5), match=b'rger'>
<re.Match object; span=(6, 8), match=b'im'>
<re.Match object; span=(9, 17), match=b'Paradies'>
--------------------------------------------------------------------------------
search for '\w+'.encode('') in utf8 encoded text
<re.Match object; span=(2, 6), match=b'rger'>
<re.Match object; span=(7, 9), match=b'im'>
<re.Match object; span=(10, 18), match=b'Paradies'>
--------------------------------------------------------------------------------
it doesn't find the Ä, which from documentation point of view is expected
and a hint to use locale is given, so let's do it and the results are
search for '\w+'.encode('cp1252') using re.LOCALE in cp1252 encoded text
<re.Match object; span=(0, 5), match=b'\xc4rger'>
<re.Match object; span=(6, 8), match=b'im'>
<re.Match object; span=(9, 17), match=b'Paradies'>
--------------------------------------------------------------------------------
works for cp1252 BUT does not work for utf8
search for '\w+'.encode('') using ??? in utf8 encoded text
<re.Match object; span=(2, 6), match=b'rger'>
<re.Match object; span=(7, 9), match=b'im'>
<re.Match object; span=(10, 18), match=b'Paradies'>
So how can I make it work with utf8 encoded text?
Note, decoding it to a string isn't preferred as this would mean
allocating the bytes buffer a 2nd time and it might be that a
buffer is several 100MBs, even GBs.
Thank you
Eren
More information about the Python-list
mailing list