[python-uk] Tell us what you did with Python this year....

Edward Hartley ed.hartley at gmail.com
Mon Dec 20 21:09:59 CET 2010





On 20 Dec 2010, at 16:46, Tim Golden <mail at timgolden.me.uk> wrote:

> On 20/12/2010 16:08, Alec Battles wrote:
>>  I
>> still have no idea why tokenizing Hungarian text and tokenizing German
>> text are not fundamentally the same operation
> 
Those languages have different grammatical structure inflexion and stemming rules amongst others. 
HTH
> I have no idea why they're not:
> 
> <code - untested>
> import codecs
> 
> with codecs.open ("german.txt", "rb", encoding="utf8") as f:
>  german_text = f.read ()
> 
> with codecs.open ("hungarian.txt", "rb", encoding="utf8") as f:
>  hungarian_text = f.read ()
> 
> # do_stuff_with (german_text)
> # do_stuff_with (hungarian_text)
> 
> </code>
> 
> Of course, I'm assuming that you know what encoding has been
> used to serialise the text, but if you don't then it's not
> Python's fault ;)
> 
> TJG
> _______________________________________________
> python-uk mailing list
> python-uk at python.org
> http://mail.python.org/mailman/listinfo/python-uk


More information about the python-uk mailing list