How to read fonts in python
robert.kern at gmail.com
Mon Nov 17 21:22:51 CET 2008
Steve Holden wrote:
> ganesh gajre wrote:
>> Hello all,
>> I am writing a program to convert indic true type font to unicode. For
>> which i need to know how to read the any file i.e Text, Doc, Excel file
>> in python and identify the font used in which that file is written. So
>> that using Map file can convert the file in unicode.
> You are getting too ambitious. Text files don't have any font
> information associated with them. Not only that, but the encoding of
> Unicode character data is independent of the font used to render the
> readable glyphs as text.
> This makes it look as though you don't really know what you are doing.
> Perhaps you should start more slowly, and try explaining the real problem.
> I'm not even sure what "converting a font to Unicode" means, so you
> might start by explaining that.
Fonts associate numbers to glyphs. Using Unicode code points for most of this
mapping is reasonably common nowadays, but there are many older fonts that use
any number of other mappings. Sometimes they used fairly standard text encodings
like the ISO-8859-* series, but sometimes they used ad hoc mappings in order to
make use of Latin keyboards easily.
For some older WYSIWYG word processor documents using these fonts, the text's
"encoding" is specified in an ad hoc fashion only by the font. The word
processor file may say that character 10 is the ASCII 'A' (or at least, the byte
0x41), but the font may map 0x41 to some Indic glyph. The only thing in the file
which says that the byte 0x41 should be interpreted as that Indic glyph is the
font. As you say, this is irrelevant to real text files, but might be useful for
Word documents which use these hacky fonts.
Ganesh, you should take a look at FontTools to handle parsing TTF files.
In order to read specific document types, you will need to find parsers for each
of the file types you want to. Be aware that many of these parsers don't parse
the font information as they are geared more for just the extraction of the text
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
More information about the Python-list