From JoyceUlysses.txt -- words occurring exactly once

Sat Jun 1 15:34:11 EDT 2024

On 5/31/24 11:59, Dieter Maurer via Python-list wrote:

hmmm, I "sent" this but there was some problem and it remained unsent. 
Just in case it hasn't All Been Said Already, here's the retry:

> HenHanna wrote at 2024-5-30 13:03 -0700:
>>
>> Given a text file of a novel (JoyceUlysses.txt) ...
>>
>> could someone give me a pretty fast (and simple) Python program that'd
>> give me a list of all words occurring exactly once?
> 
> Your task can be split into several subtasks:
>   * parse the text into words
> 
>     This depends on your notion of "word".
>     In the simplest case, a word is any maximal sequence of non-whitespace
>     characters. In this case, you can use `split` for this task

This piece is by far "the hard part", because of the ambiguity. For 
example, if I just say non-whitespace, then I get as distinct words 
followed by punctuation. What about hyphenation - of which there's both 
the compound word forms and the ones at the end of lines if the source 
text has been formatted that way.  Are all-lowercase words different 
than the same word starting with a capital?  What about non-initial 
capitals, as happens a fair bit in modern usage with acronyms, 
trademarks (perhaps not in Ulysses? :-) ), etc. What about accented letters?

If you want what's at least a quick starting point to play with, you 
could use a very simple regex - a fair amount of thought has gone into 
what a "word character" is (\w), so it deals with excluding both 
punctuation and whitespace.

import re
from collections import Counter

with open("JoyceUlysses/txt", "r") as f:
     wordcount = Counter(re.findall(r'\w+', f.read().lower()))

Now you have a Counter object counting all the "words" with their 
occurrence counts (by this definition) in the document. You can fish 
through that to answer the questions asked (find entries with a count of 
1, 2, 3, etc.)

Some people Go Big and use something that actually tries to recognize 
the language, and opposed to making assumptions from ranges of 
characters.  nltk is a choice there.  But at this point it's not really 
"simple" any longer (though nltk experts might end up disagreeing with 
that).