unicode and dbf files
Ethan Furman
ethan at stoneleaf.us
Mon Oct 26 12:22:22 EDT 2009
John Machin wrote:
> On Oct 24, 4:14 am, Ethan Furman <et... at stoneleaf.us> wrote:
>
>>John Machin wrote:
>>
>>>On Oct 23, 3:03 pm, Ethan Furman <et... at stoneleaf.us> wrote:
>>
>>>>John Machin wrote:
>>
>>>>>On Oct 23, 7:28 am, Ethan Furman <et... at stoneleaf.us> wrote:
>>
>>>>>>Greetings, all!
>>
>>>>>>I would like to add unicode support to my dbf project. The dbf header
>>>>>>has a one-byte field to hold the encoding of the file. For example,
>>>>>>\x03 is code-page 437 MS-DOS.
>>
>>>>>>My google-fu is apparently not up to the task of locating a complete
>>>>>>resource that has a list of the 256 possible values and their
>>>>>>corresponding code pages.
>>
>>>>>What makes you imagine that all 256 possible values are mapped to code
>>>>>pages?
>>
>>>>I'm just wanting to make sure I have whatever is available, and
>>>>preferably standard. :D
>>
>>>>>>So far I have found this, plus variations:http://support.microsoft.com/kb/129631
>>
>>>>>>Does anyone know of anything more complete?
>>
>>>>>That is for VFP3. Try the VFP9 equivalent.
>>
>>>>>dBase 5,5,6,7 use others which are not defined in publicly available
>>>>>dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary
>>>>>source: ESRI support site.
>>
>>>>Well, a couple hours later and still not more than I started with.
>>>>Thanks for trying, though!
>>
>>>Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search
>>>keywords and you couldn't come up with anything??
>>
>>Perhaps "nothing new" would have been a better description. I'd already
>>seen the clicketyclick site (good info there)
>
>
> Do you think so? My take is that it leaves out most of the codepage
> numbers, and these two lines are wrong:
> 65h Nordic MS-DOS code page 865
> 66h Russian MS-DOS code page 866
That was the site I used to get my whole project going, so ignoring the
unicode aspect, it has been very helpful to me.
>>and all I found at ESRI
>>were folks trying to figure it out, plus one link to a list that was no
>>different from the vfp3 list (or was it that the list did not give the
>>hex values? Either way, of no use to me.)
>
>
> Try this:
> http://webhelp.esri.com/arcpad/8.0/referenceguide/
Wow. Question, though: all those codepages mapping to 437 and 850 --
are they really all the same?
>>I looked at dbase.com, but came up empty-handed there (not surprising,
>>since they are a commercial company).
>
>
> MS and ESRI have docs ... does that mean that they are non-commercial
> companies?
I don't know enough about ESRI to make an informed comment, so I'll just
say I'm grateful they have them! MS is a complete mystery... perhaps
they are finally seeing the light? Hard to believe, though, from a
company that has consistently changed their file formats with every release.
>>I searched some more on Microsoft's site in the VFP9 section, and was
>>able to find the code page section this time. Sadly, it only added
>>about seven codes.
>>
>>At any rate, here is what I have come up with so far. Any corrections
>>and/or additions greatly appreciated.
>>
>>code_pages = {
>> '\x01' : ('ascii', 'U.S. MS-DOS'),
>
>
> All of the sources say codepage 437, so why ascii instead of cp437?
Hard to say, really. Adjusted.
>> '\x02' : ('cp850', 'International MS-DOS'),
>> '\x03' : ('cp1252', 'Windows ANSI'),
>> '\x04' : ('mac_roman', 'Standard Macintosh'),
>> '\x64' : ('cp852', 'Eastern European MS-DOS'),
>> '\x65' : ('cp866', 'Russian MS-DOS'),
>> '\x66' : ('cp865', 'Nordic MS-DOS'),
>> '\x67' : ('cp861', 'Icelandic MS-DOS'),
>> '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy
>
>
> Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
> not alone. I suggest that you omit Kamenicky until someone actually
> wants it.
Yeah, I noticed that. Tentative plan was to implement it myself (more
for practice than anything else), and also to be able to raise a more
specific error ("Kamenicky not currently supported" or some such).
>> '\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'), # iffy
>
>
> Look 5 lines back. cp852 is 'Eastern European MS-DOS'. Mazovia
> predates and is not the same as cp852. In any case, I suggest that you
> omit Masovia until someone wants it. Interesting reading:
>
> http://www.jastra.com.pl/klub/ogonki.htm
Very interesting reading.
>> '\x6a' : ('cp737', 'Greek MS-DOS (437G)'),
>> '\x6b' : ('cp857', 'Turkish MS-DOS'),
>> '\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\
>
>
> big5 is *not* the same as cp950. The products that create DBF files
> were designed for Windows. So when your source says that LDID 0xXX
> maps to Windows codepage YYY, I would suggest that all you should do
> is translate that without thinking to python encoding cpYYY.
Ack. Not sure how I missed 'Windows' at the end of that description.
>> Windows'), # wag
>
> What does "wag" mean?
wag == 'wild ass guess'
>> '\x79' : ('iso2022_kr', 'Korean Windows'), # wag
>
> Try cp949.
Done.
>> '\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\
>> Windows'), # wag
>
>
> Very wrong. iso2022_jp_2 is supposed to include basic Japanese, basic
> (1980) Chinese (GB2312) and a basic Korean kit. However to quote from
> "CJKV Information Processing" by Ken Lunde, "... from a practical
> point of view, ISO-2022-JP-2 ..... [is] equivalent to ISO-2022-JP-1
> encoding." i.e. no Chinese support at all. Try cp936.
Done.
>> '\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag
>
>
> Try cp936.
You mean 932?
>> '\x7c' : ('cp874', 'Thai Windows'), # wag
>> '\x7d' : ('cp1255', 'Hebrew Windows'),
>> '\x7e' : ('cp1256', 'Arabic Windows'),
>> '\xc8' : ('cp1250', 'Eastern European Windows'),
>> '\xc9' : ('cp1251', 'Russian Windows'),
>> '\xca' : ('cp1254', 'Turkish Windows'),
>> '\xcb' : ('cp1253', 'Greek Windows'),
>> '\x96' : ('mac_cyrillic', 'Russian Macintosh'),
>> '\x97' : ('mac_latin2', 'Macintosh EE'),
>> '\x98' : ('mac_greek', 'Greek Macintosh') }
>
>
> HTH,
> John
Very helpful indeed. Many thanks for reviewing and correcting.
Learning to deal with unicode is proving more difficult for me than
learning Python was to begin with! ;D
~Ethan~
More information about the Python-list
mailing list