unicode and dbf files
Ethan Furman
ethan at stoneleaf.us
Mon Oct 26 16:15:38 EDT 2009
John Machin wrote:
> On Oct 27, 3:22 am, Ethan Furman <et... at stoneleaf.us> wrote:
>
>>John Machin wrote:
>>
>>>Try this:
>>>http://webhelp.esri.com/arcpad/8.0/referenceguide/
>>
>>Wow. Question, though: all those codepages mapping to 437 and 850 --
>>are they really all the same?
>
> 437 and 850 *are* codepages. You mean "all those language driver IDs
> mapping to codepages 437 and 850". A codepage merely gives an
> encoding. An LDID is like a locale; it includes other things besides
> the encoding. That's why many Western European languages map to the
> same codepage, first 437 then later 850 then 1252 when Windows came
> along.
Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
to a cp437, and the file came from a german oem machine... could that
file have upper-ascii codes that will not map to anything reasonable on
my \x01 cp437 machine? If so, is there anything I can do about it?
>>>> '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy
>>
>>>Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
>>>not alone. I suggest that you omit Kamenicky until someone actually
>>>wants it.
>>
>>Yeah, I noticed that. Tentative plan was to implement it myself (more
>>for practice than anything else), and also to be able to raise a more
>>specific error ("Kamenicky not currently supported" or some such).
>
>
> The error idea is fine, but I don't get the "implement it yourself for
> practice" bit ... practice what? You plan a long and fruitful career
> inplementing codecs for YAGNI codepages?
ROFL. Playing with code; the unicode/code page interactions. Possibly
looking at constructs I might not otherwise. Since this would almost
certainly (I don't like saying "absolutely" and "never" -- been
troubleshooting for too many years for that!-) be a YAGNI, implementing
it is very low priority
>>>> '\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag
>>
>>>Try cp936.
>>
>>You mean 932?
>
>
> Yes.
>
>
>>Very helpful indeed. Many thanks for reviewing and correcting.
>
>
> You're welcome.
>
>
>>Learning to deal with unicode is proving more difficult for me than
>>learning Python was to begin with! ;D
>
>
> ?? As far as I can tell, the topic has been about mapping from
> something like a locale to the name of an encoding, i.e. all about the
> pre-Unicode mishmash and nothing to do with dealing with unicode ...
You are, of course, correct. Once it's all unicode life will be easier
(he says, all innocent-like). And dbf files even bigger, lol.
> BTW, what are you planning to do with an LDID of 0x00?
Hmmm. Well, logical choices seem to be either treating it as plain
ascii, and barfing when high-ascii shows up; defaulting to \x01; or
forcing the user to choose one on initial access.
I am definitely open to ideas!
> Cheers,
>
> John
More information about the Python-list
mailing list