[Tutor] Segmenting bash help docs

Julius Hamilton juliushamilton100 at gmail.com
Sat Dec 25 15:46:02 EST 2021


Hey,

I am pretty close to pulling this off and I thought I’d reach out to fill
in a few gaps.

I can imagine writing a Bash function which takes the help page and
displays it sentence by sentence.

If we consider code for doing this step by step:

$ help wait > wait.txt

$ python3

>>> import spacy, requests

>>> f = open(“wait.txt”).read()

>>> sen = spacy.load(“en_core_web_sm”)(f)

The sentences are pretty well segmented. There is a thread that fascinates
me about separating on an even more fine-grained level (
https://stackoverflow.com/questions/65227103/clause-extraction-long-sentence-segmentation-in-python),
but I won’t venture there yet.

What I can do is first split the current list on any newline regions
greater than one newline.

>>> sen = [s.text.strip() for s in sen]

Then I can compress broken sentences into single lines by replacing
extended whitespace with a single whitespace:

>>> sen = [re.sub(“[\s]+”, “ “, s) for s in sen]

Not sure if anybody has any opinions on this, or ways they’d improve it.
It’d be cool to find a single simple, elegant way to clean, strip and
conjoin the sentences/lines in one line of code, and also not to use the
“re” module if not necessary.

Thanks,
Julius


More information about the Tutor mailing list