On 10/31/2012 08:54 AM, Paul Moore wrote:
On Wednesday, 31 October 2012, Mathias Panzenböck wrote:
Sometimes it would be handy to read data:-urls just like any other url. While it is pretty easy to parse a data: url yourself I think it would be nice if urllib could do this for you. Example data url parser:
IIUC, this should be possible with a custom opener. While it might be nice to have this in the stdlib, it would also be a really useful recipe to have in the docs, showing how to create and install a simple custom opener into the default set of openers (so that urllib.request gains the ability to handle data rules automatically). Would you be willing to submit a doc patch to cover this?
Ok, I wrote something here: https://gist.github.com/4004353
I wrote two versions. One that just returns an io.BytesIO and one that returns a DataResponse (derived from ioBytesIO), that has a few properties/methods like HTTPResponse: msg, headers, length, getheader and getheaders and also an additinal mediatype
I also added two examples, one that writes the binary data read to stdout (stdout reopened as "wb") and one that reads the text data in the defined encoding (requires the version with the DataResponse) and writes it to stdout as string.
Which version do you think is the best for the recipe? I guess losing the mediatype (and thus the charset) is not so good, therefore the version with the DataResponse is better? Maybe with a note that if you don't need the mediatype you can simply return an io.BytesIO as well? How does one submit a doc patch anyway? Is there a hg repo for the documentation and a web interface through which one can submit a pull request?
Note: Handling of buggy data URLs is buggy. E.g. missing padding characters at the end of the URL raise an exception. Browsers like Firefox and Chrome correct the padding (Chrome only if the padding is completely missing, Firefox corrects/ignores any garbage at the end). I could correct the padding as well, but I'd rather not perform such magic.
RFC 4648 (Base64 Data Encoding) states that specifications referring to it have to explicitly state if there are characters that can be ignored or if the padding is not required. RFC 2397 (data URL scheme) does not state any such thing, but it doesn't specifically refer to RFC 4648 either (as it was written before RFC 4648). Chrome and Firefox ignore any kind of white space in data URLs. I think that is a good idea, because it let's you wrap long data URLs in image tags. binascii.a2b_base64 ignores white spaces anyway, so I don't have to do something there.
Firefox and Chrome both allow %-encoding of base64 characters like "/", "+" and "=". That this should work is not mentioned in the data URL RFC, but I think one can assume as much.
Also note that a minimal base64 data URL is "data:;base64," and not "data:base64," (note the ";"). The later would specify the (illegal) mime type "base64" and not a base64 encoding. This is handled correctly by my example code.