encoding problem with BeautifulSoup - problem when writing parsed text to file

jmfauth wxjmfauth at gmail.com
Thu Oct 6 13:41:57 EDT 2011


On 6 oct, 06:39, Greg <gregor.hochsch... at googlemail.com> wrote:
> Brilliant! It worked. Thanks!
>
> Here is the final code for those who are struggling with similar
> problems:
>
> ## open and decode file
> # In this case, the encoding comes from the charset argument in a meta
> tag
> # e.g. <meta charset="iso-8859-2">
> fileObj = open(filePath,"r").read()
> fileContent = fileObj.decode("iso-8859-2")
> fileSoup = BeautifulSoup(fileContent)
>
> ## Do some BeautifulSoup magic and preserve unicode, presume result is
> saved in 'text' ##
>
> ## write extracted text to file
> f = open(outFilePath, 'w')
> f.write(text.encode('utf-8'))
> f.close()
>



or  (Python2/Python3)

>>> import io
>>> with io.open('abc.txt', 'r', encoding='iso-8859-2') as f:
...     r = f.read()
...
>>> repr(r)
u'a\nb\nc\n'
>>> with io.open('def.txt', 'w', encoding='utf-8-sig') as f:
...     t = f.write(r)
...
>>> f.closed
True

jmf




More information about the Python-list mailing list