unicode and dbf files
ethan at stoneleaf.us
Mon Oct 26 21:15:38 CET 2009
John Machin wrote:
> On Oct 27, 3:22 am, Ethan Furman <et... at stoneleaf.us> wrote:
>>John Machin wrote:
>>Wow. Question, though: all those codepages mapping to 437 and 850 --
>>are they really all the same?
> 437 and 850 *are* codepages. You mean "all those language driver IDs
> mapping to codepages 437 and 850". A codepage merely gives an
> encoding. An LDID is like a locale; it includes other things besides
> the encoding. That's why many Western European languages map to the
> same codepage, first 437 then later 850 then 1252 when Windows came
Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
to a cp437, and the file came from a german oem machine... could that
file have upper-ascii codes that will not map to anything reasonable on
my \x01 cp437 machine? If so, is there anything I can do about it?
>>>> '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy
>>>Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
>>>not alone. I suggest that you omit Kamenicky until someone actually
>>Yeah, I noticed that. Tentative plan was to implement it myself (more
>>for practice than anything else), and also to be able to raise a more
>>specific error ("Kamenicky not currently supported" or some such).
> The error idea is fine, but I don't get the "implement it yourself for
> practice" bit ... practice what? You plan a long and fruitful career
> inplementing codecs for YAGNI codepages?
ROFL. Playing with code; the unicode/code page interactions. Possibly
looking at constructs I might not otherwise. Since this would almost
certainly (I don't like saying "absolutely" and "never" -- been
troubleshooting for too many years for that!-) be a YAGNI, implementing
it is very low priority
>>>> '\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag
>>You mean 932?
>>Very helpful indeed. Many thanks for reviewing and correcting.
> You're welcome.
>>Learning to deal with unicode is proving more difficult for me than
>>learning Python was to begin with! ;D
> ?? As far as I can tell, the topic has been about mapping from
> something like a locale to the name of an encoding, i.e. all about the
> pre-Unicode mishmash and nothing to do with dealing with unicode ...
You are, of course, correct. Once it's all unicode life will be easier
(he says, all innocent-like). And dbf files even bigger, lol.
> BTW, what are you planning to do with an LDID of 0x00?
Hmmm. Well, logical choices seem to be either treating it as plain
ascii, and barfing when high-ascii shows up; defaulting to \x01; or
forcing the user to choose one on initial access.
I am definitely open to ideas!
More information about the Python-list