[Tutor] Re: How to read unicode strings from a binary file and display them as plain ascii?

Tue Mar 1 12:39:22 CET 2005

> R. Alan Monroe wrote:
>> I started writing a program to parse the headers of truetype fonts to
>> examine their family info. But I can't manage to print out the strings
>> without the zero bytes in between each character (they display as a
>> black block labeled 'NUL' in Scite's output pane)
>> 
>> I tried:
>>      stuff = f.read(nlength)
>>      stuff = unicode(stuff, 'utf-8')

>    If there are embeded 0's in the string, it won't be utf8, it could be 
> utf16 or 32.
>    Try:
>         unicode(stuff, 'utf-16')
> or
>         stuff.decode('utf-16')

>>      print type(stuff), 'stuff', stuff.encode()
>> This prints:
>> 
>>      <type 'unicode'> stuff [NUL]C[NUL]o[NUL]p[NUL]y[NUL]r[NUL]i[NUL]g[NUL]

>    I don't understand what you tried to accomplish here.

That's evidence of what I failed to accomplish. My expected results
was to print the word "Copyright" and whatever other strings are
present in the font, with no intervening NUL characters.

>    Try the other encodings. It probably is utf-16.

Aha, after some trial and error I see that I'm running into an endian
problem. It's "\x00C" in the file, which needs to be swapped to
"C\x00". I cheated temporarily by just adding 1 to the file pointer
:^)

Alan
-------------- next part --------------
#~ 11/30/1998  03:45 PM            38,308 FUTURAB.TTF
#~ 11/30/1998  03:45 PM            38,772 FUTURABI.TTF
#~ 12/10/1998  06:24 PM            32,968 FUTURAK.TTF
#~ 12/30/1998  05:15 AM            36,992 FUTURAL.TTF
#~ 12/15/1998  11:39 PM            37,712 FUTURALI.TTF
#~ 01/05/1999  03:59 AM            38,860 FUTURAXK.TTF

#~ The OpenType font with the Offset Table. If the font file contains only one font, the Offset Table will begin at byte 0 of the file. If the font file is a TrueType collection, the beginning point of the Offset Table for each font is indicated in the TTCHeader.

#~ Offset Table Type 	Name 	Description
#~ Fixed 	sfnt version 	0x00010000 for version 1.0.
#~ USHORT 	numTables 	Number of tables.
#~ USHORT 	searchRange 	(Maximum power of 2 <= numTables) x 16.
#~ USHORT 	entrySelector 	Log2(maximum power of 2 <= numTables).
#~ USHORT 	rangeShift 	NumTables x 16-searchRange.

import struct

def grabushort():
    global f
    data = f.read(2)
    return int(struct.unpack('>H',data)[0])

def grabulong():
    global f
    data = f.read(4)
    return int(struct.unpack('>L',data)[0])

f=open('c:/windows/fonts/futurak.ttf', 'rb')

version=f.read(4)

numtables = grabushort()
print numtables

f.read(6) #skip searchrange, entryselector, rangeshift

#~ Table Directory Type 	Name 	Description
#~ ULONG 	tag 	4 -byte identifier.
#~ ULONG 	checkSum 	CheckSum for this table.
#~ ULONG 	offset 	Offset from beginning of TrueType font file.
#~ ULONG 	length 	Length of this table.

#for x in range(numtables):
for x in range(numtables):
    tag=f.read(4)
    checksum =grabulong()
    offset = grabulong()
    tlength = grabulong()
    print 'tag', tag,  'offset', offset, 'tlength', tlength
    if tag=='name':
        nameoffset = offset
        namelength = tlength

print 'nameoffset', nameoffset, 'namelength', namelength

#The Naming Table is organized as follows:
#~ Type 	Name 	Description
#~ USHORT 	format 	Format selector (=0).
#~ USHORT 	count 	Number of name records.
#~ USHORT 	stringOffset 	Offset to start of string storage (from start of table).
#~ NameRecord 	nameRecord[count] 	The name records where count is the number of records.
#~ (Variable) 		Storage for the actual string data.

#~ Each NameRecord looks like this:
#~ Type 	Name 	Description
#~ USHORT 	platformID 	Platform ID.
#~ USHORT 	encodingID 	Platform-specific encoding ID.
#~ USHORT 	languageID 	Language ID.
#~ USHORT 	nameID 	Name ID.
#~ USHORT 	length 	String length (in bytes).
#~ USHORT 	offset 	String offset from start of storage area (in bytes).
print

f.seek(nameoffset)
format = grabushort()
count = grabushort()
stringoffset = grabushort()
print 'format', format, 'count', count, 'stringoffset', stringoffset

for x in range(count):
    platformid = grabushort()
    encodingid = grabushort()
    languageid = grabushort()
    nameid = grabushort()
    nlength = grabushort()
    noffset = grabushort()
    print 'platformid', platformid, 'encodingid', encodingid, 'languageid', languageid, 'nameid', nameid, 'nlength', nlength, 'noffset', noffset
    if platformid==3:# microsoft
        bookmark = f.tell()
        print 'bookmark', bookmark
        f.seek(nameoffset+stringoffset+noffset+1)
        stuff = f.read(nlength)
        #stuff = unicode(stuff, 'utf-16')
        stuff = stuff.decode( 'utf-16')
        print type(stuff), 'stuff', stuff
        f.seek(bookmark)

f.close()