[XML-SIG] 4XPath: parsing Unicode string

Sat, 25 Nov 2000 19:27:43 -0700

[crossposted to string SIG & i18n SIG since wwe need help from anywhere we can 
get it]

> I've used 4Suite 0.9.2 together with Python 2.0 and PyXML 0.6.2.
> 
> I have a problem that I cannot pass a Unicode string containing
> Japanese characters to the 4XPath parser.  Following reproduces
> the problem:

[snip]

Uh oh.  Tick tick tick BOOM!

This is one of those problems I knew would come up, but was hoping to put off 
dealing with.

The problem is that we use FLEX and BISON for XPath parsing.  This is old 
school code that doesn't know for wchar_t, let alone unicode.

The solution is for us to uniformly translate things to UTF-8 for the scanner, 
ans then update the scanner so that it handles UTF-8 encoded sequences.  But 
this is a very ugly mound of work.

Or does anyone out there know of an i18n-friendly scanner?

Or does anyone out there have _any_ other ideas?  We're pretty much at the end 
of the tether with FLEX and Bison for other problems (concurrency), but we're 
not coming up with other good parsing solutions.  We've looked at most of the 
Python tools available at the Vaults of Parnassus, and they don't really cut 
it.  We could probably do it with re and some glue code to handle the 
non-regular portions of XPath, but this would also be a huge task.

Any help or ideas are appreciated.

-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python