Can't match str/unicode
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sun Jan 8 01:17:41 EST 2017
On Sunday 08 January 2017 15:33, CM wrote:
> On Saturday, January 7, 2017 at 7:59:01 PM UTC-5, Steve D'Aprano wrote:
[...]
>> Start by printing repr(candidate_text) and see what you really have.
>
> Yes, that did it. The repr of that one was, in fact:
>
> u'match /r'
Are you sure it is a forward-slash /r rather than backslash \r?
> Thanks, that did it. Do you happen to know why that /r was appended to the
> unicode object in this case?
*shrug*
You're the only one that can answer that, because only you know where the text
came from. The code you showed:
candidate_text = Paragraph.Range.Text.encode('utf-8')
is a mystery, because we don't know what Paragraph is or where it comes from,
or what .Range.Text does.
You mention "scraping a Word docx", but we have no idea how you're scraping it.
If I had to guess, I'd guess:
- you actually mean \r rather than /r;
- paragraphs in Word docs always end with a carriage return \r;
- and whoever typed the paragraph accidentally hit the spacebar after typing
the word "match".
But its just a guess. For all I know, the software you are using to scrape the
docx file inserts space/r after everything.
--
Steven
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." - Jon Ronson
More information about the Python-list
mailing list