[Tutor] Re: How to read unicode strings from a binary file and
display them as plain ascii?
R. Alan Monroe
amonroe at columbus.rr.com
Tue Mar 1 12:39:22 CET 2005
> R. Alan Monroe wrote:
>> I started writing a program to parse the headers of truetype fonts to
>> examine their family info. But I can't manage to print out the strings
>> without the zero bytes in between each character (they display as a
>> black block labeled 'NUL' in Scite's output pane)
>>
>> I tried:
>> stuff = f.read(nlength)
>> stuff = unicode(stuff, 'utf-8')
> If there are embeded 0's in the string, it won't be utf8, it could be
> utf16 or 32.
> Try:
> unicode(stuff, 'utf-16')
> or
> stuff.decode('utf-16')
>> print type(stuff), 'stuff', stuff.encode()
>> This prints:
>>
>> <type 'unicode'> stuff [NUL]C[NUL]o[NUL]p[NUL]y[NUL]r[NUL]i[NUL]g[NUL]
> I don't understand what you tried to accomplish here.
That's evidence of what I failed to accomplish. My expected results
was to print the word "Copyright" and whatever other strings are
present in the font, with no intervening NUL characters.
> Try the other encodings. It probably is utf-16.
Aha, after some trial and error I see that I'm running into an endian
problem. It's "\x00C" in the file, which needs to be swapped to
"C\x00". I cheated temporarily by just adding 1 to the file pointer
:^)
Alan
-------------- next part --------------
#~ 11/30/1998 03:45 PM 38,308 FUTURAB.TTF
#~ 11/30/1998 03:45 PM 38,772 FUTURABI.TTF
#~ 12/10/1998 06:24 PM 32,968 FUTURAK.TTF
#~ 12/30/1998 05:15 AM 36,992 FUTURAL.TTF
#~ 12/15/1998 11:39 PM 37,712 FUTURALI.TTF
#~ 01/05/1999 03:59 AM 38,860 FUTURAXK.TTF
#~ The OpenType font with the Offset Table. If the font file contains only one font, the Offset Table will begin at byte 0 of the file. If the font file is a TrueType collection, the beginning point of the Offset Table for each font is indicated in the TTCHeader.
#~ Offset Table Type Name Description
#~ Fixed sfnt version 0x00010000 for version 1.0.
#~ USHORT numTables Number of tables.
#~ USHORT searchRange (Maximum power of 2 <= numTables) x 16.
#~ USHORT entrySelector Log2(maximum power of 2 <= numTables).
#~ USHORT rangeShift NumTables x 16-searchRange.
import struct
def grabushort():
global f
data = f.read(2)
return int(struct.unpack('>H',data)[0])
def grabulong():
global f
data = f.read(4)
return int(struct.unpack('>L',data)[0])
f=open('c:/windows/fonts/futurak.ttf', 'rb')
version=f.read(4)
numtables = grabushort()
print numtables
f.read(6) #skip searchrange, entryselector, rangeshift
#~ Table Directory Type Name Description
#~ ULONG tag 4 -byte identifier.
#~ ULONG checkSum CheckSum for this table.
#~ ULONG offset Offset from beginning of TrueType font file.
#~ ULONG length Length of this table.
#for x in range(numtables):
for x in range(numtables):
tag=f.read(4)
checksum =grabulong()
offset = grabulong()
tlength = grabulong()
print 'tag', tag, 'offset', offset, 'tlength', tlength
if tag=='name':
nameoffset = offset
namelength = tlength
print 'nameoffset', nameoffset, 'namelength', namelength
#The Naming Table is organized as follows:
#~ Type Name Description
#~ USHORT format Format selector (=0).
#~ USHORT count Number of name records.
#~ USHORT stringOffset Offset to start of string storage (from start of table).
#~ NameRecord nameRecord[count] The name records where count is the number of records.
#~ (Variable) Storage for the actual string data.
#~ Each NameRecord looks like this:
#~ Type Name Description
#~ USHORT platformID Platform ID.
#~ USHORT encodingID Platform-specific encoding ID.
#~ USHORT languageID Language ID.
#~ USHORT nameID Name ID.
#~ USHORT length String length (in bytes).
#~ USHORT offset String offset from start of storage area (in bytes).
print
f.seek(nameoffset)
format = grabushort()
count = grabushort()
stringoffset = grabushort()
print 'format', format, 'count', count, 'stringoffset', stringoffset
for x in range(count):
platformid = grabushort()
encodingid = grabushort()
languageid = grabushort()
nameid = grabushort()
nlength = grabushort()
noffset = grabushort()
print 'platformid', platformid, 'encodingid', encodingid, 'languageid', languageid, 'nameid', nameid, 'nlength', nlength, 'noffset', noffset
if platformid==3:# microsoft
bookmark = f.tell()
print 'bookmark', bookmark
f.seek(nameoffset+stringoffset+noffset+1)
stuff = f.read(nlength)
#stuff = unicode(stuff, 'utf-16')
stuff = stuff.decode( 'utf-16')
print type(stuff), 'stuff', stuff
f.seek(bookmark)
f.close()
More information about the Tutor
mailing list