From JoyceUlysses.txt -- words occurring exactly once

dieter.maurer at online.de dieter.maurer at online.de
Tue Jun 4 12:13:47 EDT 2024


Edward Teach wrote at 2024-6-3 10:47 +0100:
> ...
>The Gutenburg Project publishes "plain text".  That's another problem,
>because "plain text" means UTF-8....and that means unicode...and that
>means running some sort of unicode-to-ascii conversion in order to get
>something like "words".  A couple of hours....a couple of hundred lines
>of C....problem solved!

Unicode supports the notion "owrd" even better "ASCII".
For example, the `\w` (word charavter) regular expression wild card,
works for Unicode like for ASCII (of course with enhanced letter,
digits, punctuation, etc.)


More information about the Python-list mailing list