[Tutor] Segmenting bash help docs
Julius Hamilton
juliushamilton100 at gmail.com
Sat Dec 25 15:46:02 EST 2021
Hey,
I am pretty close to pulling this off and I thought I’d reach out to fill
in a few gaps.
I can imagine writing a Bash function which takes the help page and
displays it sentence by sentence.
If we consider code for doing this step by step:
$ help wait > wait.txt
$ python3
>>> import spacy, requests
>>> f = open(“wait.txt”).read()
>>> sen = spacy.load(“en_core_web_sm”)(f)
The sentences are pretty well segmented. There is a thread that fascinates
me about separating on an even more fine-grained level (
https://stackoverflow.com/questions/65227103/clause-extraction-long-sentence-segmentation-in-python),
but I won’t venture there yet.
What I can do is first split the current list on any newline regions
greater than one newline.
>>> sen = [s.text.strip() for s in sen]
Then I can compress broken sentences into single lines by replacing
extended whitespace with a single whitespace:
>>> sen = [re.sub(“[\s]+”, “ “, s) for s in sen]
Not sure if anybody has any opinions on this, or ways they’d improve it.
It’d be cool to find a single simple, elegant way to clean, strip and
conjoin the sentences/lines in one line of code, and also not to use the
“re” module if not necessary.
Thanks,
Julius
More information about the Tutor
mailing list