unicode and dbf files

Tue Oct 27 11:51:52 EDT 2009

John Machin wrote:
> On Oct 27, 7:15 am, Ethan Furman <et... at stoneleaf.us> wrote:
 >
>>Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
>>to a cp437, and the file came from a german oem machine... could that
>>file have upper-ascii codes that will not map to anything reasonable on
>>my \x01 cp437 machine?  If so, is there anything I can do about it?
> 
> ASCII is defined over the first 128 codepoints; "upper-ascii codes" is
> meaningless. As for the rest of your question, if the file's encoded
> in cpXXX, it's encoded in cpXXX. If either the creator or the reader
> or both are lying, then all bets are off.

My confusion is this -- is there a difference between any of the various 
cp437s?  Going down the list at ESRI: 0x01, 0x09, 0x0b, 0x0d, 0x0f, 
0x11, 0x15, 0x18, 0x19, and 0x1b all map to cp437, and they have names 
such as US, Dutch, Finnish, French, German, Italian, Swedish, Spanish, 
English (Britain & US)... are these all the same?

>>>BTW, what are you planning to do with an LDID of 0x00?
>>
>>Hmmm.  Well, logical choices seem to be either treating it as plain
>>ascii, and barfing when high-ascii shows up; defaulting to \x01; or
>>forcing the user to choose one on initial access.
> 
> It would be more useful to allow the user to specify an encoding than
> an LDID.

I plan on using the same technique used in xlrd and xlwt, and allowing 
an encoding to be specified when the table is opened.  If not specified, 
it will use whatever the table has in the LDID field.

> You need to be able to read files created not only by software like
> VFP or dBase but also scripts using third-party libraries. It would be
> useful to allow an encoding to override an LDID that is incorrect e.g.
> the LDID implies cp1251 but the data is actually encoded in koi8[ru]
> 
> Read this: http://en.wikipedia.org/wiki/Code_page_437
> With no LDID in the file and no encoding supplied, I'd be inclined to
> make it barf if any codepoint not in range(32, 128) showed up.

Sounds reasonable -- especially when the encoding can be overridden.

~Ethan~