[Tutor] encode question!

ZIYAD A. M. AL-BATLY zamb at saudi.net.sa
Tue Sep 27 09:08:05 CEST 2005

On Tue, 2005-09-27 at 14:01 +0800, 铁石 wrote: 
>     I am trying to write a stript that extract jpg files
>  from a html I had downloaded.I encounter a problem with
>  a Big5 charset html file.Big5 used in Hongkong ans Taiwan.
>     In this html file there's a jpg names "xvg_h%202.jpg"
> in vi ,the tag of the image is <img src="...../xvg_h%25202.jpg>,
> and as python read it out, it was "xvg_h%2525202.jpg".
> As I try to test the size of this image,python report that 
> file "xvg_h%2525202.jpg" don't  exists.
>      I think %25 mean the char "%",the "%2525" was equal to 
> the "%25" in html and "%" in the shell listed file name.
> I had no idea about this encode! Is the html tag use big5 too?
> Why '%' should be code as big5, I think it was a ASCII char before
> today!
>      So, how can I read this file name correctly!
>         winglion1 at 163.com
>           2005-09-27
This has nothing to do with Big5 encoding!  This is how URL are sent in
HTTP requests.  As an example: a space letter " " become "%20".

In your example above your file name is probably named "xvg_h 2.jpg".
So, what's going on?  When you viewed the image in you browser (or the
application you used to download it, even if you were using Python
scrip) the request was something like:
which translate to "xvg_h 2.jpg" which is right.  However, when you
saved the HTML file along with the image, the application interpreted
the "%20" as the character "percent sign" followed by the characters
"20".  It also encoded them as such (just like when sending HTTP
requests) and that resulted in "xvg_h%25202.jpg"!

How to fix this?  Use "unquote()" from the "urllib" module twice!

        >>> urllib.unquote(urllib.unquote('xvg_h%25202.jpg'))
        'xvg_h 2.jpg'

I hope this is the right explanation and that it will work for you.  If
anyone have a better opinion please don't be shy and help us all.

More information about the Tutor mailing list