From JoyceUlysses.txt -- words occurring exactly once
Edward Teach
hackbeard at linuxmail.org
Mon Jun 3 05:47:42 EDT 2024
On Sat, 1 Jun 2024 13:34:11 -0600
Mats Wichmann <mats at wichmann.us> wrote:
> On 5/31/24 11:59, Dieter Maurer via Python-list wrote:
>
> hmmm, I "sent" this but there was some problem and it remained
> unsent. Just in case it hasn't All Been Said Already, here's the
> retry:
>
> > HenHanna wrote at 2024-5-30 13:03 -0700:
> >>
> >> Given a text file of a novel (JoyceUlysses.txt) ...
> >>
> >> could someone give me a pretty fast (and simple) Python program
> >> that'd give me a list of all words occurring exactly once?
> >
> > Your task can be split into several subtasks:
> > * parse the text into words
> >
> > This depends on your notion of "word".
> > In the simplest case, a word is any maximal sequence of
> > non-whitespace characters. In this case, you can use `split` for
> > this task
>
> This piece is by far "the hard part", because of the ambiguity. For
> example, if I just say non-whitespace, then I get as distinct words
> followed by punctuation. What about hyphenation - of which there's
> both the compound word forms and the ones at the end of lines if the
> source text has been formatted that way. Are all-lowercase words
> different than the same word starting with a capital? What about
> non-initial capitals, as happens a fair bit in modern usage with
> acronyms, trademarks (perhaps not in Ulysses? :-) ), etc. What about
> accented letters?
>
> If you want what's at least a quick starting point to play with, you
> could use a very simple regex - a fair amount of thought has gone
> into what a "word character" is (\w), so it deals with excluding both
> punctuation and whitespace.
>
> import re
> from collections import Counter
>
> with open("JoyceUlysses/txt", "r") as f:
> wordcount = Counter(re.findall(r'\w+', f.read().lower()))
>
> Now you have a Counter object counting all the "words" with their
> occurrence counts (by this definition) in the document. You can fish
> through that to answer the questions asked (find entries with a count
> of 1, 2, 3, etc.)
>
> Some people Go Big and use something that actually tries to recognize
> the language, and opposed to making assumptions from ranges of
> characters. nltk is a choice there. But at this point it's not
> really "simple" any longer (though nltk experts might end up
> disagreeing with that).
>
>
The Gutenburg Project publishes "plain text". That's another problem,
because "plain text" means UTF-8....and that means unicode...and that
means running some sort of unicode-to-ascii conversion in order to get
something like "words". A couple of hours....a couple of hundred lines
of C....problem solved!
More information about the Python-list
mailing list