split string with hieroglyphs

Sun Dec 24 00:04:44 EST 2006

On Sat, 23 Dec 2006 19:28:48 -0800, Belize wrote:

> Hi.
> Essence of problem in the following:
> Here is lines in utf8 of this form "BZ???TV%??DVD"
> Is it possible to split them into the fragments that contain only latin
> printable symbols (aplhabet + "?#" etc)

Of course it is possible, but there probably isn't a built-in function to
do it. Write a program to do it.

> and fragments with the hieroglyphs, so it could be like this
> ['BZ?', '\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xaa', 'TV%',
> '\xe3\x83\x84\xe3\x82\xad', 'DVD'] ?

def split_fragments(s):
    """Split a string s into Latin and non-Latin fragments."""
    # Warning -- untested.
    fragments = []  # hold the string fragments
    latin = []  # temporary accumulator for Latin fragment
    nonlatin = []  # temporary accumulator for non-Latin fragment
    for c in s:
        if islatin(c):
            if nonlatin:
                fragments.append(''.join(nonlatin))
                nonlatin = []
            latin.append(c)
        else: 
            if latin:
                fragments.append(''.join(latin))
                latin = []
            nonlatin.append(c)
    return fragments

I leave it to you to write the function islatin.

Hints: 

There is a Perl module to guess the encoding:
http://search.cpan.org/~dankogai/Encode-2.18/lib/Encode/Guess.pm

You might like to read this too:
http://effbot.org/pyfaq/what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean.htm

I also recommend you read this recipe:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871

And look at the module unicodedata.

> Then, after translate of hieroglyphs, necessary to join line, so it
> could be like this 
> "BZ? navigation TV% display DVD"

def join_fragments(fragments)
    accumulator = []
    for fragment in fragments:
        if islatin(fragment):
            accumulator.append(fragment)
        else:
            accumulator.append(translate_hieroglyphics(fragment))
    return ''.join(accumulator)

I leave it to you to write the function translate_hieroglyphics.

-- 
Steven.