[I18n-sig] Test Suite for the Unicode codecs

M.-A. Lemburg mal@lemburg.com
Mon, 03 Apr 2000 10:31:45 +0200

[CC:ing to i18n-sig -- hope this is ok]

Andy Robinson wrote:
> >
> > That would be great of course... but how do we get native
> > script readers for all those code pages ?
> I suspect we won't.  Unicode fonts with all 45k glyphs are not exactly
> common; there is one, but it was full of holes last time I checked.  There
> are two approaches to viewing the CJK ones:
> 1. Use IE5 or Netscape.  IE5 comes with lots of font packs for most
> languages, especially the Asian ones.  One makes up preformatted text files
> designed to mirror the vendor or standards' organisation's code chart, puts
> it through a round trip, and tells the browser to display it - possibly side
> by side with the original.  If you feel clever, you can use tables and
> highlight things which fail the round trip.  Of course, this depends on the
> fonts you have installed, and these vary
> 2. (Some months off)  Use Acrobat 4.0 and the Language Packs from Adobe.
> These are the first really platform-independent vewing technology; I have
> wrapped up the Japanese one in ReportLab and used it very successfully at
> Fidelity Investments last year to prove round trips from AS400, but have to
> rewrite that code as it was done in-house for them.  I write a loop to print
> about fifteen pages of charts which are laid out exactly like the relevant
> Appendix in "CJKV Information Processing", run it through some
> transformations, then sit staring at all 6879 glyphs for a couple of hours.
> Sometimes, while bored, I did plots to show how code points mapped from one
> encoding to another; we had to reverse engineer an AS400 encoding.   Adobe's
> CID fonts include their own mapping tables and conversion at the PostScript
> level; If I ask for the font "Mincho-UTF8",  I get it encoded that way and
> can feed it UTF8 strings; if I as for the font "Mincho-SJIS" I get a
> Shift-JIS encoded font.

This looks like an awful lot of work. Isn't there some better
way to get this done ? (There might be a problem due to different
composition of characters, but I think we could handle it by implementing
the normalization algorithmn for Unicode.)
> This is actually my main interest in the Unicode stuff; to build a global
> reporting engine, we have to handle data in any encoding and feed it to the
> font engine in an encoding the font can handle.
> The great thing about PDF code charts is that they are immutable and not
> dependent on your PC setup.
> >
> > > For testing, I think the best approach is to compare output to another
> > > well-known mapping utility.  The most convenient I know of is
> > > uniconv.exe from http://www.basistech.com/ - not Open Source and
> > > Windows-only, but it is a straightforward goal for us to write a
> > > uniconv.py that perfectly mimics its behaviour.
> >
> > Ok, I've just downloaded it (it's a bit hidden as Demo of
> > their C++ Unicode class lib) and will give it a try next week.
> >
> > > Marc-Andre, do you have any preferences for where a test suite and
> > > bunch of add-on tools live?  Do you want something which fits into the
> > > standard distribution, or can we handle it outside?
> >
> > Hmm, tests for the builtin codecs should live in Lib/test
> > with the output in Lib/test/output. Tools etc. are probably
> > best placed somewhere into the Tools/ directory (e.g. the
> > gencodec.py script lives in Tools/scripts). Perhaps we need
> > a separate Tools/unicode if there are going to many different
> > scripts...
> I must admit, I was thinking of an actual web server test framework which
> kept a database of sample text files, did round trip tests on demand, and
> could hand out HTML and PDF files to anyone who asked - probably a bit much
> for the standard Python library.  One needs knowledge of each individual
> code page and some quite devious test files to test out double-byte codecs.
> For single-byte, we need a reliable way to see all the code points before we
> dare rely on full round trip tests and assertions.   I think we need some
> separate project on starship, sourceforge or wherever to mess around with
> this stuff, and then you can decide what is worth including in the main
> distribution.

Ok. For now I'll leave the current cp codecs in place and
simply wait for people reporting bugs in the mapping tables...

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/