[Tutor] encode question!

Tue Sep 27 08:58:26 CEST 2005

On Tue, 27 Sep 2005, [GB2312] ÌúÊ¯ wrote:

>     I am trying to write a stript that extract jpg files
>  from a html I had downloaded.I encounter a problem with
>  a Big5 charset html file.Big5 used in Hongkong ans Taiwan.
>     In this html file there's a jpg names "xvg_h%202.jpg" in vi ,the tag
> of the image is <img src="...../xvg_h%25202.jpg>, and as python read it
> out, it was "xvg_h%2525202.jpg". As I try to test the size of this
> image,python report that file "xvg_h%2525202.jpg" don't exists.

It'll help if you show us your program so far, and point us to an example
file in big5 format.

I do not suspect that an encoding issue's in play here; the file name you
are showing us looks regular enough that there may be some other issue.
There are any number of reasons why programs don't work: I'd rather remove
the ambgiuity by seeing real code.

Let's me check something quickly...

######
>>> import codecs
>>> from StringIO import StringIO
>>> sampleText = "xvg_h%202.jpg"
>>> sampleFile = StringIO(sampleText)
>>> translatedFile = codecs.EncodedFile(sampleFile, "big5")
>>> translatedFile.readline()
'xvg_h%202.jpg'
######

As far as I can tell, the filename that you're showing us, 'xvg_h%202.jpg'
doesn't have characters that trigger an alternative interpretation in
big5.  I'm unfamiliar enought with big5, though, that I could be mistaken.
Please point us to a sample text file with big5, and we can do more
realistic tests on this end.

Let me try another experiment with a big5-encoded file and a third-party
html parser called 'BeautifulSoup':

######
>>> import BeautifulSoup
>>> f = urllib.urlopen('http://chinese.yahoo.com')
>>> soup = BeautifulSoup.BeautifulSoup(f)
>>> images = soup('img')
>>> for img in images:
...     print img['src']
...
http://us.i1.yimg.com/us.yimg.com/i/b5/home/m6v1.gif
http://hk.yimg.com/i/search/mglass.gif
http://us.i1.yimg.com/us.yimg.com/i/hk/new2.gif
http://hk.yimg.com/i/icon/16/3.gif
http://us.i1.yimg.com/us.yimg.com/i/hk/spc.gif
http://hk.yimg.com/i/home/tabbt.gif
http://hk.yimg.com/i/home/tabp.gif
http://hk.yimg.com/i/home/tabp.gif
http://hk.yimg.com/i/home/tabp.gif
######

That looks sorta ok, although I know I'm completely ignoring big5 issues.
*grin*

Give us a file to work on, and we'll see what we can do to help you parse
it.  Good luck!