unicode strings and such

Garth Grimm garth_grimm at hp.com
Thu Sep 13 11:46:57 EDT 2001


Martin von Loewis wrote:

> Garth Grimm <garth_grimm at hp.com> writes:
> 
> 
>><!--$#-*-mode:python;tab-width:8;py-indent-offset:4;indent-tabs-mode:nil-*-
>>
> 
> What kind of programming language is this? It is not Python, I can
> tell that much. It looks like the language supports embedding Python,
> though.
> 
These are actually python scripts that run in a third-party web application servers (a search engine to be more precise)



>>a) Use UTF-8 encoding on the data file and use u'^ã??ã? ã?.ã?"$'
>>notation in it.  This would create two-element tuples of unicode
>>strings.
>>
> 
> Since I don't know the programming language you are using, it is hard
> to understand why putting UTF-8 in the first line might have any
> effect. However, if the embedded Python text is passed to a Python
> interpreter, I can tell you that the Unicode literal does *not* have
> the desired effect - it is treated as a Latin-1 string. If this is
> really UTF-8 for some Japanese text (which I cannot tell, just looking
> at the bytes), you'd need to write
> 
>    unicode('^ã??ã? ã?.ã?"$', 'utf-8')
> 
> It's not clear to me why the str() call is needed; what happens if you
> leave it out?
> 
> Regards,
> Martin
> 
Thanks Martin.  Writing the data file as:


repairList = [
	( unicode('(ヘルプ)(\d{3})', 'utf-8'), unicode('\g<1> \g<2>', 'utf-8') ),
	( unicode('ください.*ヘルプ', 'utf-8'), unicode('hit me', 'utf-8') ),
	( unicode('^ください$', 'utf-8') , unicode('\g<0> hit me again', 'utf-8') ),
	
  ]


does indeed allow me to write the application logic as:

for (pattern, patch) in repairList:
     patternRegex = re.compile(pattern,re.UNICODE or re.IGNORECASE)
     qt = patternRegex.sub(patch, qt)

which is certainly easier to follow.  The data file is generated with an XSLT, so modifying the XSL 
to wrap the "real" data with more language syntax is rather trivial.

I think I'm still going to try to figure out what was actually happening with the original 
methodology that we tried, just because there seems to be some language concepts I'd like to get 
cleared up.

Thanks again,
Garth




More information about the Python-list mailing list