[XML-SIG] 4XPath: parsing Unicode string
uche.ogbuji@fourthought.com
uche.ogbuji@fourthought.com
Sat, 25 Nov 2000 19:27:43 -0700
[crossposted to string SIG & i18n SIG since wwe need help from anywhere we can
get it]
> I've used 4Suite 0.9.2 together with Python 2.0 and PyXML 0.6.2.
>
> I have a problem that I cannot pass a Unicode string containing
> Japanese characters to the 4XPath parser. Following reproduces
> the problem:
[snip]
Uh oh. Tick tick tick BOOM!
This is one of those problems I knew would come up, but was hoping to put off
dealing with.
The problem is that we use FLEX and BISON for XPath parsing. This is old
school code that doesn't know for wchar_t, let alone unicode.
The solution is for us to uniformly translate things to UTF-8 for the scanner,
ans then update the scanner so that it handles UTF-8 encoded sequences. But
this is a very ugly mound of work.
Or does anyone out there know of an i18n-friendly scanner?
Or does anyone out there have _any_ other ideas? We're pretty much at the end
of the tether with FLEX and Bison for other problems (concurrency), but we're
not coming up with other good parsing solutions. We've looked at most of the
Python tools available at the Vaults of Parnassus, and they don't really cut
it. We could probably do it with re and some glue code to handle the
non-regular portions of XPath, but this would also be a huge task.
Any help or ideas are appreciated.
--
Uche Ogbuji Principal Consultant
uche.ogbuji@fourthought.com +1 303 583 9900 x 101
Fourthought, Inc. http://Fourthought.com
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python