Mailman 3 Support data: URLs in urllib - Python-ideas

Support data: URLs in urllib

Mathias Panzenböck

31 Oct 2012 31 Oct '12

6:02 a.m.

Sometimes it would be handy to read data:-urls just like any other url. While it is pretty easy to parse a data: url yourself I think it would be nice if urllib could do this for you. Example data url parser:

...

...
...
import base64 import urllib.parse

def read_data_url(url): scheme, data = url.split(":") assert scheme == "data", "unsupported scheme: "+scheme mimetype, data = data.split(",") if mimetype.endswith(";base64"): return mimetype[:-7] or None, base64.b64decode(data.encode("UTF-8")) else: return mimetype or None, urllib.parse.unquote(data).encode("UTF-8")

See also: http://tools.ietf.org/html/rfc2397 -panzi

Show replies by date

Paul Moore

31 Oct 31 Oct

8:54 a.m.

On Wednesday, 31 October 2012, Mathias Panzenböck wrote:

...

Sometimes it would be handy to read data:-urls just like any other url. While it is pretty easy to parse a data: url yourself I think it would be nice if urllib could do this for you.

Example data url parser:

[...] IIUC, this should be possible with a custom opener. While it might be nice to have this in the stdlib, it would also be a really useful recipe to have in the docs, showing how to create and install a simple custom opener into the default set of openers (so that urllib.request gains the ability to handle data rules automatically). Would you be willing to submit a doc patch to cover this? Paul

Mathias Panzenböck

3 Nov 3 Nov

12:47 a.m.

On 10/31/2012 08:54 AM, Paul Moore wrote:

...

On Wednesday, 31 October 2012, Mathias Panzenböck wrote:

Sometimes it would be handy to read data:-urls just like any other url. While it is pretty easy to parse a data: url yourself I think it would be nice if urllib could do this for you.

Example data url parser:

[...]

IIUC, this should be possible with a custom opener. While it might be nice to have this in the stdlib, it would also be a really useful recipe to have in the docs, showing how to create and install a simple custom opener into the default set of openers (so that urllib.request gains the ability to handle data rules automatically). Would you be willing to submit a doc patch to cover this?

Paul

Ok, I wrote something here: https://gist.github.com/4004353 I wrote two versions. One that just returns an io.BytesIO and one that returns a DataResponse (derived from ioBytesIO), that has a few properties/methods like HTTPResponse: msg, headers, length, getheader and getheaders and also an additinal mediatype I also added two examples, one that writes the binary data read to stdout (stdout reopened as "wb") and one that reads the text data in the defined encoding (requires the version with the DataResponse) and writes it to stdout as string. Which version do you think is the best for the recipe? I guess losing the mediatype (and thus the charset) is not so good, therefore the version with the DataResponse is better? Maybe with a note that if you don't need the mediatype you can simply return an io.BytesIO as well? How does one submit a doc patch anyway? Is there a hg repo for the documentation and a web interface through which one can submit a pull request? Note: Handling of buggy data URLs is buggy. E.g. missing padding characters at the end of the URL raise an exception. Browsers like Firefox and Chrome correct the padding (Chrome only if the padding is completely missing, Firefox corrects/ignores any garbage at the end). I could correct the padding as well, but I'd rather not perform such magic. RFC 4648[1] (Base64 Data Encoding) states that specifications referring to it have to explicitly state if there are characters that can be ignored or if the padding is not required. RFC 2397[2] (data URL scheme) does not state any such thing, but it doesn't specifically refer to RFC 4648 either (as it was written before RFC 4648). Chrome and Firefox ignore any kind of white space in data URLs. I think that is a good idea, because it let's you wrap long data URLs in image tags. binascii.a2b_base64 ignores white spaces anyway, so I don't have to do something there. Firefox and Chrome both allow %-encoding of base64 characters like "/", "+" and "=". That this should work is not mentioned in the data URL RFC, but I think one can assume as much. Also note that a minimal base64 data URL is "data:;base64," and not "data:base64," (note the ";"). The later would specify the (illegal) mime type "base64" and not a base64 encoding. This is handled correctly by my example code. -panzi [1] http://tools.ietf.org/html/rfc4648#section-3 [2] http://tools.ietf.org/html/rfc2397

Paul Moore

11:40 a.m.

On 2 November 2012 23:47, Mathias Panzenböck <grosser.meister.morti@gmx.net> wrote:

...

Which version do you think is the best for the recipe? I guess losing the mediatype (and thus the charset) is not so good, therefore the version with the DataResponse is better? Maybe with a note that if you don't need the mediatype you can simply return an io.BytesIO as well? How does one submit a doc patch anyway? Is there a hg repo for the documentation and a web interface through which one can submit a pull request?

You should probably be consistent with urllib's behaviour for other URLs - from the documentation of urlopen: """ This function returns a file-like object that works as a context manager, with two additional methods from the urllib.response module geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed info() — return the meta-information of the page, such as headers, in the form of an email.message_from_string() instance (see Quick Reference to HTTP Headers) Raises URLError on errors. """ To create a doc patch, open a feature request on bugs.python.org and attach a patch. The documentation is in the core Python repository, from hg.python.org. You can clone that and use Mercurial to generate a patch, but there's no "pull request" mechanism that I know of. Paul

Mathias Panzenböck

4 Nov 4 Nov

2:54 a.m.

On 11/03/2012 11:40 AM, Paul Moore wrote:

...

On 2 November 2012 23:47, Mathias Panzenböck <grosser.meister.morti@gmx.net> wrote:

...
Which version do you think is the best for the recipe? I guess losing the mediatype (and thus the charset) is not so good, therefore the version with the DataResponse is better? Maybe with a note that if you don't need the mediatype you can simply return an io.BytesIO as well? How does one submit a doc patch anyway? Is there a hg repo for the documentation and a web interface through which one can submit a pull request?

You should probably be consistent with urllib's behaviour for other URLs - from the documentation of urlopen:

""" This function returns a file-like object that works as a context manager, with two additional methods from the urllib.response module

geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed info() — return the meta-information of the page, such as headers, in the form of an email.message_from_string() instance (see Quick Reference to HTTP Headers) Raises URLError on errors. """

Ok, I added the two methods. Now there are 3 ways to get the headers: req.headers, req.msg, req.info() Shouldn't there be *one* obvious way to do this? req.headers?

...

To create a doc patch, open a feature request on bugs.python.org and attach a patch. The documentation is in the core Python repository, from hg.python.org. You can clone that and use Mercurial to generate a patch, but there's no "pull request" mechanism that I know of.

Paul

Paul Moore

9:28 a.m.

On Sunday, 4 November 2012, Mathias Panzenböck wrote:

...

Shouldn't there be *one* obvious way to do this? req.headers

Well, I'd say that the stdlib docs imply that req.info is the required way so that's the "one obvious way". If you want to add extra methods for convenience, fair enough, but code that doesn't already know it is handling a data URL can't use them so I don't see the point, personally. But others may have different views... Paul

Mathias Panzenböck

7 Nov 7 Nov

4:45 a.m.

Ok, I've written an issue in the python bug tracker and attached a doc patch for the recipe: http://bugs.python.org/issue16423 On 11/04/2012 09:28 AM, Paul Moore wrote:

...

On Sunday, 4 November 2012, Mathias Panzenböck wrote:

Shouldn't there be *one* obvious way to do this? req.headers

Well, I'd say that the stdlib docs imply that req.info <http://req.info> is the required way so that's the "one obvious way". If you want to add extra methods for convenience, fair enough, but code that doesn't already know it is handling a data URL can't use them so I don't see the point, personally.

But others may have different views...

Paul

Senthil Kumaran

6:08 a.m.

Had not known about the 'data' url scheme. Thanks for pointing out ( http://tools.ietf.org/html/rfc2397 ) and the documentation patch. BTW, documentation patch is easy to get in, but should the support in a more natural form, where data url is parsed internally by the module and expected results be returned should be considered? That could be targeted for 3.4 and docs recipe does serve for all the other releases. Thank you, Senthil On Tue, Nov 6, 2012 at 7:45 PM, Mathias Panzenböck <grosser.meister.morti@gmx.net> wrote:

...

Ok, I've written an issue in the python bug tracker and attached a doc patch for the recipe:

http://bugs.python.org/issue16423

On 11/04/2012 09:28 AM, Paul Moore wrote:

...
On Sunday, 4 November 2012, Mathias Panzenböck wrote:

Shouldn't there be *one* obvious way to do this? req.headers

Well, I'd say that the stdlib docs imply that req.info <http://req.info> is the required way so

that's the "one obvious way". If you want to add extra methods for convenience, fair enough, but code that doesn't already know it is handling a data URL can't use them so I don't see the point, personally.

But others may have different views...

Paul

_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

Mathias Panzenböck

6:24 p.m.

Sorry, I don't quite understand. On 11/07/2012 06:08 AM, Senthil Kumaran wrote:

...

Had not known about the 'data' url scheme. Thanks for pointing out ( http://tools.ietf.org/html/rfc2397 ) and the documentation patch. BTW, documentation patch is easy to get in, but should the support in a more natural form, where data url is parsed internally by the module

Do you mean the parse_data_url function should be removed and put into DataResponse (or DataHandler)?

...

and expected results be returned should be considered?

What expected results? And in what way should they be considered? Considered for what?

...

That could be targeted for 3.4 and docs recipe does serve for all the other releases.

Thank you, Senthil

On Tue, Nov 6, 2012 at 7:45 PM, Mathias Panzenböck <grosser.meister.morti@gmx.net> wrote:

...
Ok, I've written an issue in the python bug tracker and attached a doc patch for the recipe:

http://bugs.python.org/issue16423

On 11/04/2012 09:28 AM, Paul Moore wrote:

...
On Sunday, 4 November 2012, Mathias Panzenböck wrote:

Shouldn't there be *one* obvious way to do this? req.headers

Well, I'd say that the stdlib docs imply that req.info <http://req.info> is the required way so

that's the "one obvious way". If you want to add extra methods for convenience, fair enough, but code that doesn't already know it is handling a data URL can't use them so I don't see the point, personally.

But others may have different views...

Paul

Senthil Kumaran

8 Nov 8 Nov

7:11 a.m.

On Wed, Nov 7, 2012 at 9:24 AM, Mathias Panzenböck <grosser.meister.morti@gmx.net> wrote:

...

Sorry, I don't quite understand. Do you mean the parse_data_url function should be removed and put into DataResponse (or DataHandler)?

...
and expected results be returned should be considered?

What expected results? And in what way should they be considered? Considered for what?

I meant, urlopen("data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO 9TXL0Y4OHwAAAABJRU5ErkJggg==") should work out of box, wherein the DataHandler example that is the documentation is made available in request.py and added to OpenerDirector by default. I find it hard to gauge the utility, but documentation is ofcourse a +1. Thanks, Senthil

Mathias Panzenböck

15 Nov 15 Nov

11:45 p.m.

On 11/08/2012 07:11 AM, Senthil Kumaran wrote:

...

On Wed, Nov 7, 2012 at 9:24 AM, Mathias Panzenböck <grosser.meister.morti@gmx.net> wrote:

...
Sorry, I don't quite understand. Do you mean the parse_data_url function should be removed and put into DataResponse (or DataHandler)?

...
and expected results be returned should be considered?

What expected results? And in what way should they be considered? Considered for what?

I meant, urlopen("data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO 9TXL0Y4OHwAAAABJRU5ErkJggg==")

should work out of box, wherein the DataHandler example that is the documentation is made available in request.py and added to OpenerDirector by default. I find it hard to gauge the utility, but documentation is ofcourse a +1.

Thanks, Senthil

Yes, I would also be in favor to including this in python, but I was told here I should write it as recipe in the documentation. It is e.g. useful for crawlers/spiders, that analyze webpages including their images.

Paul Moore

11:48 p.m.

On 15 November 2012 22:45, Mathias Panzenböck <grosser.meister.morti@gmx.net

...

wrote:

...

Yes, I would also be in favor to including this in python, but I was told here I should write it as recipe in the documentation.

It is e.g. useful for crawlers/spiders, that analyze webpages including their images.

It would be good in the stdlib. By all means submit a patch for adding it. Paul

Mathias Panzenböck

16 Nov 16 Nov

4:51 a.m.

On 11/15/2012 11:48 PM, Paul Moore wrote:

...

On 15 November 2012 22:45, Mathias Panzenböck <grosser.meister.morti@gmx.net <mailto:grosser.meister.morti@gmx.net>> wrote:

Yes, I would also be in favor to including this in python, but I was told here I should write it as recipe in the documentation.

It is e.g. useful for crawlers/spiders, that analyze webpages including their images.

It would be good in the stdlib. By all means submit a patch for adding it. Paul

Ok, I added a patch that adds this to the stdlib to this issue: http://bugs.python.org/issue16423 I changed my code so it is more aligned with the existing code in urllib.request. -panzi

4423

Age (days ago)

4439

Last active (days ago)

List overview

Download

12 comments

3 participants

participants (3)

Mathias Panzenböck
Paul Moore
Senthil Kumaran