hyphenation &c (was Re: C's syntax)

Alex Martelli aleaxit at yahoo.com
Wed Nov 1 08:46:34 EST 2000


"Boris Borcic" <borcis at geneva-link.ch> wrote in message
news:39FF1FDB.A84B1376 at geneva-link.ch...
> > Alex Martelli wrote:
> > >
> > > "hyphenation in Italian is best performed algorithmically,
> > > given the strong regularity of the language's syllabification
> > > rules; trying to adapt to Italian hyphenation algorithms
> > > designed for other languages, adjusting only a data table
> > > to account for the language, is vastly sub-optimal"
    [snip]
> your claim *still* is such that it implies something you might
> very well be wrong about : that a design can't exist, such that
> "data table flexibility" is adequate to encode "algorithmic
> syllabification rules" such that optimal treatment of italian
> requires.

Oh, I can and do encode algorithms as data structures, of
course.  For hyphenating italian, for example, a couple of
simple lookup tables (one to classify letters into vowels,
semivowels, and 2 kinds of consonants; one to give the
'kind' of letter-to-letter transition for each pair of letters)
reduce the executable-code part of the algorithm to a few
lines (basically a simple FSM).  But it's still a language
specific approach: it produces a 'conservative hyphenator'
(in roughly the same sense in which Boehm's garbage
collection is 'conservative'...) with less than a kilobyte's
worth of data, plus a few lines of code.  But it doesn't
generalize to hyphenation rules for other languages; only
languages with very easy spelling-to-phonetics rules, and
hyphenation rules based on phonetics, may benefit (I guess
there may be other such languages besides Italian, but
English definitely isn't one).

That the hyphenation results of such an approach are fully
acceptable for Italian is the part that somebody with no
knowledge of Italian would have to take 'on faith', as, I
guess, might be the fact that out of the two possible kinds
of mistakes (not allowing for a syllable-break that would in
fact be OK; allowing for a syllable-break that would NOT be
OK), the second one has a FAR higher cost in terms of
perceived quality of the hyphenation in typical uses (such
as optimal line-filling in typographical layout packages).

That other ways to produce the same results cannot be as
effective has little to do with Italian and a lot to do with
other languages' hyphenation rules and idiosyncrasies.

Take, for example, TeX's approach at hyphenation (a good
starting point -- it's better than most hyphenation stuff
around, and, being open-source, one can easily examine
it) and consider what other algorithm/data-structure
approaches could possibly reproduce _those_ results in
cheaper or simpler ways.  For _that_ part, you just need
knowledge of _English_ hyphenation rules and general
computer-science skills.  For a good introductory starting
point, see, e.g., http://www.talo.nl/ (a company that
specialized in hyphenation software) and specifically
http://www.talo.nl/talo/hrules.html for a list of compact
collections of rules for various languages.

A complex and general approach is no doubt needed
to cover hyphenation for any and all languages with
the _same_ executable-code -- pushing all the variation
down into data-tables.  Either the 'language' in which
those 'data' are expressed is an algorithmic language
powerful enough to let you encode the FSM-like approach
I sketched above, and then the question of optimality
becomes one of how convenient it is to code in that
language rather than in, say, C, Pascal, or Python; or,
you have take a roundabout, more complex route -- in
which case, clearly, optimality is not there.

If the general package that needs 'hyphenation services'
for a generic natural language is able to 'shell out' in
some way to natural-language-dependent executable code,
rather than limiting the customization possibilities to
"data tables" (representing, at best, a probably
idiosyncratic and limited 'programming language'...),
the situation is much rosier.  For example, if the general
package embedded Python as an 'extension language',
and let you set some Python code to handle hyphenation
needs, you'd be all set:-).

No real need to know anything about italian specifically
to understand these issues, and/or debate them...


Alex






More information about the Python-list mailing list