unicode and dbf files

Mon Oct 26 16:15:38 EDT 2009

John Machin wrote:
> On Oct 27, 3:22 am, Ethan Furman <et... at stoneleaf.us> wrote:
> 
>>John Machin wrote:
>>
>>>Try this:
>>>http://webhelp.esri.com/arcpad/8.0/referenceguide/
>>
>>Wow.  Question, though:  all those codepages mapping to 437 and 850 --
>>are they really all the same?
> 
> 437 and 850 *are* codepages. You mean "all those language driver IDs
> mapping to codepages 437 and 850". A codepage merely gives an
> encoding. An LDID is like a locale; it includes other things besides
> the encoding. That's why many Western European languages map to the
> same codepage, first 437 then later 850 then 1252 when Windows came
> along.

Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps 
to a cp437, and the file came from a german oem machine... could that 
file have upper-ascii codes that will not map to anything reasonable on 
my \x01 cp437 machine?  If so, is there anything I can do about it?

>>>>    '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'),     # iffy
>>
>>>Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
>>>not alone. I suggest that you omit Kamenicky until someone actually
>>>wants it.
>>
>>Yeah, I noticed that.  Tentative plan was to implement it myself (more
>>for practice than anything else), and also to be able to raise a more
>>specific error ("Kamenicky not currently supported" or some such).
> 
> 
> The error idea is fine, but I don't get the "implement it yourself for
> practice" bit ... practice what? You plan a long and fruitful career
> inplementing codecs for YAGNI codepages?

ROFL.  Playing with code; the unicode/code page interactions.  Possibly 
looking at constructs I might not otherwise.  Since this would almost 
certainly (I don't like saying "absolutely" and "never" -- been 
troubleshooting for too many years for that!-) be a YAGNI, implementing 
it is very low priority

>>>>    '\x7b' : ('iso2022_jp', 'Japanese Windows'),        # wag
>>
>>>Try cp936.
>>
>>You mean 932?
> 
> 
> Yes.
> 
> 
>>Very helpful indeed.  Many thanks for reviewing and correcting.
> 
> 
> You're welcome.
> 
> 
>>Learning to deal with unicode is proving more difficult for me than
>>learning Python was to begin with!  ;D
> 
> 
> ?? As far as I can tell, the topic has been about mapping from
> something like a locale to the name of an encoding, i.e. all about the
> pre-Unicode mishmash and nothing to do with dealing with unicode ...

You are, of course, correct.  Once it's all unicode life will be easier 
(he says, all innocent-like).  And dbf files even bigger, lol.

> BTW, what are you planning to do with an LDID of 0x00?

Hmmm.  Well, logical choices seem to be either treating it as plain 
ascii, and barfing when high-ascii shows up; defaulting to \x01; or 
forcing the user to choose one on initial access.

I am definitely open to ideas!

> Cheers,
> 
> John