split string with hieroglyphs
Steven D'Aprano
steve at REMOVE.THIS.cybersource.com.au
Sun Dec 24 00:04:44 EST 2006
On Sat, 23 Dec 2006 19:28:48 -0800, Belize wrote:
> Hi.
> Essence of problem in the following:
> Here is lines in utf8 of this form "BZ???TV%??DVD"
> Is it possible to split them into the fragments that contain only latin
> printable symbols (aplhabet + "?#" etc)
Of course it is possible, but there probably isn't a built-in function to
do it. Write a program to do it.
> and fragments with the hieroglyphs, so it could be like this
> ['BZ?', '\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xaa', 'TV%',
> '\xe3\x83\x84\xe3\x82\xad', 'DVD'] ?
def split_fragments(s):
"""Split a string s into Latin and non-Latin fragments."""
# Warning -- untested.
fragments = [] # hold the string fragments
latin = [] # temporary accumulator for Latin fragment
nonlatin = [] # temporary accumulator for non-Latin fragment
for c in s:
if islatin(c):
if nonlatin:
fragments.append(''.join(nonlatin))
nonlatin = []
latin.append(c)
else:
if latin:
fragments.append(''.join(latin))
latin = []
nonlatin.append(c)
return fragments
I leave it to you to write the function islatin.
Hints:
There is a Perl module to guess the encoding:
http://search.cpan.org/~dankogai/Encode-2.18/lib/Encode/Guess.pm
You might like to read this too:
http://effbot.org/pyfaq/what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean.htm
I also recommend you read this recipe:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871
And look at the module unicodedata.
> Then, after translate of hieroglyphs, necessary to join line, so it
> could be like this
> "BZ? navigation TV% display DVD"
def join_fragments(fragments)
accumulator = []
for fragment in fragments:
if islatin(fragment):
accumulator.append(fragment)
else:
accumulator.append(translate_hieroglyphics(fragment))
return ''.join(accumulator)
I leave it to you to write the function translate_hieroglyphics.
--
Steven.
More information about the Python-list
mailing list