From JoyceUlysses.txt -- words occurring exactly once
Chris Angelico
rosuav at gmail.com
Tue Jun 4 18:02:26 EDT 2024
On Wed, 5 Jun 2024 at 02:49, Edward Teach via Python-list
<python-list at python.org> wrote:
>
> On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)
> Grant Edwards <grant.b.edwards at gmail.com> wrote:
>
> > On 2024-06-03, Edward Teach via Python-list <python-list at python.org>
> > wrote:
> >
> > > The Gutenburg Project publishes "plain text". That's another
> > > problem, because "plain text" means UTF-8....and that means
> > > unicode...and that means running some sort of unicode-to-ascii
> > > conversion in order to get something like "words". A couple of
> > > hours....a couple of hundred lines of C....problem solved!
> >
> > I'm curious. Why does it need to be converted frum Unicode to ASCII?
> >
> > When you read it into Python, it gets converted right back to
> > Unicode...
> >
>
> Well.....when using the file linux.words as a useful master list of
> "words".....linux.words is strict ASCII........
>
Whatever gave you that idea? I have a large number of dictionaries in
/usr/share/dict, all of them encoded UTF-8 except one (and I don't
know why that is). Even the English ones aren't entirely ASCII.
There is no need to "convert from Unicode to ASCII", which makes no sense.
ChrisA
More information about the Python-list
mailing list