reg exp and octal notation

Peter Otten __peter__ at web.de
Fri Mar 5 09:54:44 EST 2004


Lucas Branca wrote:

> Could someone explain me the difference between the results below?
> 
> ## $cat octals.txt
> ## \006\034abc
> 
> import re
> 
> a= "\006\034abc"
> preg= re.compile(r'([\0-\377]*)')
> res = preg.search(a)
> print res.groups()
> 
> loader = open('./octals.txt', 'r')
> b = loader.readline()
> preg= re.compile(r'([\0-\377]*)')
> res = preg.search(b)
> print res.groups()
> 
> 
> RESULTS
> 
> ('\x06\x1cabc',)
> 
> ('\\006\\034abc\n',)

a and b are two entirely different strings. Whatever similarity there
appears to be is an artifact of Python's treatment of escape sequences -
only in source code not in an arbitrary file.

Your literal string:

>>> s = "\006\034\n"
>>> s
'\x06\x1c\n'

What you read from the text file:

>>> t = "\\006\\034\n"
>>> t
'\\006\\034\n'

Maybe it helps to learn what's really inside these two strings, so let's
have a look at the ascii codes:

>>> map(ord, s)
[6, 28, 10]
>>> map(ord, t)
[92, 48, 48, 54, 92, 48, 51, 52, 10]

Another example: in source code you can write the newline as

>>> a = """
... """
>>> b = "\n"
>>> c = "\x0a"
>>> d = "\012" 
>>> a,b,c,d
('\n', '\n', '\n', '\n')

But if read from a file \n, \x0a, \012 would just be sequences of two or
four characters. 

Only when you have understood the above you should return to regular
expressions. Your regexp always matches the whole string - i. e. is
redundant (and probably not what you want, but that you would need to
explain in another post).

[\0-\377] is just a fancy way of writing "match any character"
* means "repeat the preceding as often as you want" (including zero times)

Peter




More information about the Python-list mailing list