[Python-ideas] Support data: URLs in urllib

Sat Nov 3 00:47:41 CET 2012

On 10/31/2012 08:54 AM, Paul Moore wrote:
> On Wednesday, 31 October 2012, Mathias Panzenböck wrote:
>
>     Sometimes it would be handy to read data:-urls just like any other url. While it is pretty easy
>     to parse a data: url yourself I think it would be nice if urllib could do this for you.
>
>     Example data url parser:
>
> [...]
>
> IIUC, this should be possible with a custom opener. While it might be nice to have this in the
> stdlib, it would also be a really useful recipe to have in the docs, showing how to create and
> install a simple custom opener into the default set of openers (so that urllib.request gains the
> ability to handle data rules automatically). Would you be willing to submit a doc patch to cover this?
>
> Paul

Ok, I wrote something here:
https://gist.github.com/4004353

I wrote two versions. One that just returns an io.BytesIO and one that returns a DataResponse 
(derived from ioBytesIO), that has a few properties/methods like HTTPResponse: msg, headers, length, 
getheader and getheaders and also an additinal mediatype

I also added two examples, one that writes the binary data read to stdout (stdout reopened as "wb") 
and one that reads the text data in the defined encoding (requires the version with the 
DataResponse) and writes it to stdout as string.

Which version do you think is the best for the recipe? I guess losing the mediatype (and thus the 
charset) is not so good, therefore the version with the DataResponse is better? Maybe with a note 
that if you don't need the mediatype you can simply return an io.BytesIO as well? How does one 
submit a doc patch anyway? Is there a hg repo for the documentation and a web interface through 
which one can submit a pull request?

Note:
Handling of buggy data URLs is buggy. E.g. missing padding characters at the end of the URL raise an 
exception. Browsers like Firefox and Chrome correct the padding (Chrome only if the padding is 
completely missing, Firefox corrects/ignores any garbage at the end). I could correct the padding as 
well, but I'd rather not perform such magic.

RFC 4648[1] (Base64 Data Encoding) states that specifications referring to it have to explicitly 
state if there are characters that can be ignored or if the padding is not required. RFC 2397[2] 
(data URL scheme) does not state any such thing, but it doesn't specifically refer to RFC 4648 
either (as it was written before RFC 4648). Chrome and Firefox ignore any kind of white space in 
data URLs. I think that is a good idea, because it let's you wrap long data URLs in image tags. 
binascii.a2b_base64 ignores white spaces anyway, so I don't have to do something there.

Firefox and Chrome both allow %-encoding of base64 characters like "/", "+" and "=". That this 
should work is not mentioned in the data URL RFC, but I think one can assume as much.

Also note that a minimal base64 data URL is "data:;base64," and not "data:base64," (note the ";"). 
The later would specify the (illegal) mime type "base64" and not a base64 encoding. This is handled 
correctly by my example code.

	-panzi

[1] http://tools.ietf.org/html/rfc4648#section-3
[2] http://tools.ietf.org/html/rfc2397