Can't match str/unicode

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Jan 8 01:17:41 EST 2017


On Sunday 08 January 2017 15:33, CM wrote:

> On Saturday, January 7, 2017 at 7:59:01 PM UTC-5, Steve D'Aprano wrote:
[...]
>> Start by printing repr(candidate_text) and see what you really have.
> 
> Yes, that did it. The repr of that one was, in fact:
> 
> u'match /r'

Are you sure it is a forward-slash /r rather than backslash \r?


> Thanks, that did it. Do you happen to know why that /r was appended to the
> unicode object in this case?

*shrug* 

You're the only one that can answer that, because only you know where the text 
came from. The code you showed:

candidate_text = Paragraph.Range.Text.encode('utf-8')

is a mystery, because we don't know what Paragraph is or where it comes from, 
or what .Range.Text does.

You mention "scraping a Word docx", but we have no idea how you're scraping it. 
If I had to guess, I'd guess:

- you actually mean \r rather than /r;

- paragraphs in Word docs always end with a carriage return \r;

- and whoever typed the paragraph accidentally hit the spacebar after typing 
the word "match".


But its just a guess. For all I know, the software you are using to scrape the 
docx file inserts space/r after everything.



-- 
Steven
"Ever since I learned about confirmation bias, I've been seeing 
it everywhere." - Jon Ronson



More information about the Python-list mailing list