From JoyceUlysses.txt -- words occurring exactly once

Mon Jun 3 05:47:42 EDT 2024

On Sat, 1 Jun 2024 13:34:11 -0600
Mats Wichmann <mats at wichmann.us> wrote:

> On 5/31/24 11:59, Dieter Maurer via Python-list wrote:
> 
> hmmm, I "sent" this but there was some problem and it remained
> unsent. Just in case it hasn't All Been Said Already, here's the
> retry:
> 
> > HenHanna wrote at 2024-5-30 13:03 -0700:  
> >>
> >> Given a text file of a novel (JoyceUlysses.txt) ...
> >>
> >> could someone give me a pretty fast (and simple) Python program
> >> that'd give me a list of all words occurring exactly once?  
> > 
> > Your task can be split into several subtasks:
> >   * parse the text into words
> > 
> >     This depends on your notion of "word".
> >     In the simplest case, a word is any maximal sequence of
> > non-whitespace characters. In this case, you can use `split` for
> > this task  
> 
> This piece is by far "the hard part", because of the ambiguity. For 
> example, if I just say non-whitespace, then I get as distinct words 
> followed by punctuation. What about hyphenation - of which there's
> both the compound word forms and the ones at the end of lines if the
> source text has been formatted that way.  Are all-lowercase words
> different than the same word starting with a capital?  What about
> non-initial capitals, as happens a fair bit in modern usage with
> acronyms, trademarks (perhaps not in Ulysses? :-) ), etc. What about
> accented letters?
> 
> If you want what's at least a quick starting point to play with, you 
> could use a very simple regex - a fair amount of thought has gone
> into what a "word character" is (\w), so it deals with excluding both 
> punctuation and whitespace.
> 
> import re
> from collections import Counter
> 
> with open("JoyceUlysses/txt", "r") as f:
>      wordcount = Counter(re.findall(r'\w+', f.read().lower()))
> 
> Now you have a Counter object counting all the "words" with their 
> occurrence counts (by this definition) in the document. You can fish 
> through that to answer the questions asked (find entries with a count
> of 1, 2, 3, etc.)
> 
> Some people Go Big and use something that actually tries to recognize 
> the language, and opposed to making assumptions from ranges of 
> characters.  nltk is a choice there.  But at this point it's not
> really "simple" any longer (though nltk experts might end up
> disagreeing with that).
> 
> 

The Gutenburg Project publishes "plain text".  That's another problem,
because "plain text" means UTF-8....and that means unicode...and that
means running some sort of unicode-to-ascii conversion in order to get
something like "words".  A couple of hours....a couple of hundred lines
of C....problem solved!