replace nbsp with space
data:image/s3,"s3://crabby-images/68281/682811131061ddf0a8ae288d02efca5f138e45a0" alt=""
I use the following code to replace nbsp with space. Is it the best way to do so in lxml? Thanks. from lxml import html doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8')) for x in doc.iter(): if x.text is not None: x.text = x.text.replace(u'\xa0', ' ') if x.tail is not None: x.tail = x.tail.replace(u'\xa0', ' ') sys.stdout.write(html.tostring(doc).encode('utf-8')) -- Regards, Peng
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Peng Yu schrieb am 22.08.19 um 21:22:
I use the following code to replace nbsp with space. Is it the best way to do so in lxml? Thanks.
from lxml import html doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8')) for x in doc.iter(): if x.text is not None: x.text = x.text.replace(u'\xa0', ' ') if x.tail is not None: x.tail = x.tail.replace(u'\xa0', ' ')
Looks good to me, although this slightly more ugly variant should be faster: for el in doc.iter(): text = el.text if text and '\xa0' in text: el.text = text.replace('\xa0', ' ') tail = el.tail if tail and '\xa0' in tail: el.tail = tail.replace('\xa0', ' ')
sys.stdout.write(html.tostring(doc).encode('utf-8'))
This should read html.tostring(doc, encoding='utf-8') You are probably using Python 2. Python 3 would have caught this bug. Stefan
data:image/s3,"s3://crabby-images/68281/682811131061ddf0a8ae288d02efca5f138e45a0" alt=""
sys.stdout.write(html.tostring(doc).encode('utf-8'))
This should read
html.tostring(doc, encoding='utf-8')
You are probably using Python 2. Python 3 would have caught this bug.
In python3, it seems that the above code should be changed to this instead. Is it? sys.stdout.buffer.write(html.tostring(doc, encoding = 'utf-8')) Otherwise, I got this error. Traceback (most recent call last): File "../htmlnbsp2spc0.py", line 22, in <module> sys.stdout.write(html.tostring(doc, encoding = 'utf-8')) TypeError: write() argument must be str, not bytes -- Regards, Peng
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Peng Yu schrieb am 24.08.19 um 15:49:
sys.stdout.write(html.tostring(doc).encode('utf-8'))
This should read
html.tostring(doc, encoding='utf-8')
You are probably using Python 2. Python 3 would have caught this bug.
In python3, it seems that the above code should be changed to this instead. Is it?
sys.stdout.buffer.write(html.tostring(doc, encoding = 'utf-8'))
Exactly. Stefan
participants (2)
-
Peng Yu
-
Stefan Behnel