[Tutor] encode question!

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Tue Sep 27 08:58:26 CEST 2005

On Tue, 27 Sep 2005, [GB2312] Ìúʯ wrote:

>     I am trying to write a stript that extract jpg files
>  from a html I had downloaded.I encounter a problem with
>  a Big5 charset html file.Big5 used in Hongkong ans Taiwan.
>     In this html file there's a jpg names "xvg_h%202.jpg" in vi ,the tag
> of the image is <img src="...../xvg_h%25202.jpg>, and as python read it
> out, it was "xvg_h%2525202.jpg". As I try to test the size of this
> image,python report that file "xvg_h%2525202.jpg" don't exists.

It'll help if you show us your program so far, and point us to an example
file in big5 format.

I do not suspect that an encoding issue's in play here; the file name you
are showing us looks regular enough that there may be some other issue.
There are any number of reasons why programs don't work: I'd rather remove
the ambgiuity by seeing real code.

Let's me check something quickly...

>>> import codecs
>>> from StringIO import StringIO
>>> sampleText = "xvg_h%202.jpg"
>>> sampleFile = StringIO(sampleText)
>>> translatedFile = codecs.EncodedFile(sampleFile, "big5")
>>> translatedFile.readline()

As far as I can tell, the filename that you're showing us, 'xvg_h%202.jpg'
doesn't have characters that trigger an alternative interpretation in
big5.  I'm unfamiliar enought with big5, though, that I could be mistaken.
Please point us to a sample text file with big5, and we can do more
realistic tests on this end.

Let me try another experiment with a big5-encoded file and a third-party
html parser called 'BeautifulSoup':

>>> import BeautifulSoup
>>> f = urllib.urlopen('http://chinese.yahoo.com')
>>> soup = BeautifulSoup.BeautifulSoup(f)
>>> images = soup('img')
>>> for img in images:
...     print img['src']

That looks sorta ok, although I know I'm completely ignoring big5 issues.

Give us a file to work on, and we'll see what we can do to help you parse
it.  Good luck!

More information about the Tutor mailing list