Fwd: Simple curl/wget-like download functionality in urllib (like http offers server)

In the context of building Docker images, it is often required to download stuff. If curl/wget is available, great, but often slim images don't include that. The urllib could provide a very simple download functionality (like http offers a simple server): from urllib.request import urlopen data = urlopen('https://.../install-poetry.py').read() # print or save data

On 18Oct2021 21:25, Tom P <thomas.pohl@gmail.com> wrote:
Well, it could I suppose, but it is probably better to reach straight for the "requests" module (get it with "pip install requests"). It offers much functionality, in particular its response object has an iter_content() method for fetching payload content iteratively. The problem with a "download()" method is that it is almost never what you need. There are too many ways to want to do it, and one almost _never_ wants to suck the download itself into memory as you do above with read() because downloads are often large, sometimes very large. You also don't always want to put it into a file. As a result, a "download()" method would rapidly grow a bunch of options trying to accomodate many simple but differing uses. Instead you might be better with a method which could return the contents in a loop to process as you see fit, which is what iter_content() does. Untested example (but based on some real world download code I've got right here): rsp = requests.get(url) with open("filename","wb") as f: for chunk in rsp.iter_content(): f.write(chunk) Now, if you find yourself doing that _specific_ variation often, write yourself a function to do it and keep it in a module of your own. Cheers, Cameron Simpson <cs@cskk.id.au>

On Tue, Oct 19, 2021 at 9:00 AM Cameron Simpson <cs@cskk.id.au> wrote:
OTOH, if you *do* want to put it into a file, it should be possible to take advantage of zero-copy APIs to reduce unnecessary transfers. I'm not sure if there's a way to do that with requests. Ideally, what you want is os.sendfile() but it'd need to be cleanly wrapped by the library itself. ChrisA

Performance is not an issue in the use case I envision. This is about downloading small installation scripts (i.e for install poetry in a container) or a few megabytes of data. I just tested the improved script (with just 1MB of read buffer) and it could easily saturate my 100 Mbit/s connection while downloading a 1.9 GB file in under three minutes (other clients where using the same connection). Tom

On 19/10/2021 00.06, Chris Angelico wrote:
Splicing APIs like sendfile() require a Kernel socket. You cannot do sendfile() with userspace sockets like OpenSSL socket for e.g. HTTPS. Latest Linux Kernel and OpenSSL 3.0.0 have a new feature called kTLS. Kernel TLS uses OpenSSL to establish the TLS connection and then handles payload transfer in Kernel to for zero-copy sendfile(). Christian

On Wed, Oct 20, 2021 at 1:23 AM Christian Heimes <christian@python.org> wrote:
Ah, of course, forgot about that. But obviously other people have seen the same problem, and come up with a solution. In any case, it cements the need for an actual API for "download this into a file", despite the limitations of it. Obviously you also need the API of "give me the next lot of bytes as a string", but being able to download directly to a file *is* of significant value. ChrisA

I am aware of requests (or httpx, ...), but the idea is to do all that with the standard library. If I have to install stuff, I could just install wget/curl and be done. Feature creep is an issue here, but just like http.server one could be really strict about just covering 90% of the use cases (download stuff and print or save it) and not trying to handle any corner cases. The first code snippet was not supposed to be production-ready. Here's an improved version which only downloads 1MB a time and prints it. The only parameter could be the URL: from urllib.request import urlopen from sys import stdout with urlopen("https://coherentminds.de/") as response: while data := response.read(1024 * 1024): stdout.buffer.write(data) The user of this function could still decide to divert stdout into a file, so both use cases printing and saving would be covered. IMHO, the benefit-cost ratio is quite good: * can be a lifesaver (just like http.server) every once in a while in particular in a container or testing context * low implementation effort * easy to test and to maintain Tom

You are absolutely right, the functionality is there, but the idea is to make it easily available from the command line. Here is a line (with shortened URL) from a Dockerfile which installs poetry as suggested in the docs: RUN python -c "from urllib.request import urlopen; print(urlopen('https://.../install-poetry.py').read().decode())" | python With the proposed functionality, the urllib could provide a download entry point which would make the line look like this: RUN python -m urrlib.download "https://.../install-poetry.py" | python This is less error-prone, would also work for rather large downloads and wouldn't be much effort to implement/test/maintain.

On Tue, Oct 19, 2021 at 7:32 AM Tom Pohl <thomas.pohl@gmail.com> wrote:
You are absolutely right, the functionality is there, but the idea is to make it easily available from the command line.
I don't know about the others, but I missed that you were talking about an entry point in your initial post. I think that's a great idea -- and we could add a modest set of command line options as well. Folks are always resistant to adding new things, but once written, this would provide a nice feature, and not be difficult to maintain nor disruptive in any way. +1 for sure! -CHB _______________________________________________
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Thanks, finally a +1! \o/ It's funny that "entry point" triggered your reaction, because I think it's not the correct technical term. What I'm proposing is very similar to http.server: https://github.com/python/cpython/blob/3.10/Lib/http/server.py#L1257 Just like "python -m http.server" you could then use "python -m urllib.request.download <URL>".

A question for the Python experts: What is the correct technical term for a functionality like "http.server", i.e., a module with an actual "main" function?

Thanks. Not as catchy as I would have hoped, though. ;-) One person except me in favor of this idea. Any other feedback? How to proceed?

Thanks. Not as catchy as I would have hoped, though. ;-) When you respond to a message, could you keep a little of the context
On 10/25/2021 11:21 AM, Tom Pohl wrote: that you're replying to? I'm not sure what this refers to.
One person except me in favor of this idea. Any other feedback? How to proceed?
I think it's a good idea. I think the next step would be to create a PR, including documentation. IIRC, there's already an issue open for this. Eric

Thanks for the nudge. If anyone is interested (or could approve to make the pipeline run), here's the PR: https://github.com/python/cpython/pull/29217

Tom Pohl wrote:
A question for the Python experts: What is the correct technical term for a functionality like "http.server", i.e., a module with an actual "main" function?
There's some details about it here https://docs.python.org/3/library/__main__.html#idiomatic-usage

On 18Oct2021 21:25, Tom P <thomas.pohl@gmail.com> wrote:
Well, it could I suppose, but it is probably better to reach straight for the "requests" module (get it with "pip install requests"). It offers much functionality, in particular its response object has an iter_content() method for fetching payload content iteratively. The problem with a "download()" method is that it is almost never what you need. There are too many ways to want to do it, and one almost _never_ wants to suck the download itself into memory as you do above with read() because downloads are often large, sometimes very large. You also don't always want to put it into a file. As a result, a "download()" method would rapidly grow a bunch of options trying to accomodate many simple but differing uses. Instead you might be better with a method which could return the contents in a loop to process as you see fit, which is what iter_content() does. Untested example (but based on some real world download code I've got right here): rsp = requests.get(url) with open("filename","wb") as f: for chunk in rsp.iter_content(): f.write(chunk) Now, if you find yourself doing that _specific_ variation often, write yourself a function to do it and keep it in a module of your own. Cheers, Cameron Simpson <cs@cskk.id.au>

On Tue, Oct 19, 2021 at 9:00 AM Cameron Simpson <cs@cskk.id.au> wrote:
OTOH, if you *do* want to put it into a file, it should be possible to take advantage of zero-copy APIs to reduce unnecessary transfers. I'm not sure if there's a way to do that with requests. Ideally, what you want is os.sendfile() but it'd need to be cleanly wrapped by the library itself. ChrisA

Performance is not an issue in the use case I envision. This is about downloading small installation scripts (i.e for install poetry in a container) or a few megabytes of data. I just tested the improved script (with just 1MB of read buffer) and it could easily saturate my 100 Mbit/s connection while downloading a 1.9 GB file in under three minutes (other clients where using the same connection). Tom

On 19/10/2021 00.06, Chris Angelico wrote:
Splicing APIs like sendfile() require a Kernel socket. You cannot do sendfile() with userspace sockets like OpenSSL socket for e.g. HTTPS. Latest Linux Kernel and OpenSSL 3.0.0 have a new feature called kTLS. Kernel TLS uses OpenSSL to establish the TLS connection and then handles payload transfer in Kernel to for zero-copy sendfile(). Christian

On Wed, Oct 20, 2021 at 1:23 AM Christian Heimes <christian@python.org> wrote:
Ah, of course, forgot about that. But obviously other people have seen the same problem, and come up with a solution. In any case, it cements the need for an actual API for "download this into a file", despite the limitations of it. Obviously you also need the API of "give me the next lot of bytes as a string", but being able to download directly to a file *is* of significant value. ChrisA

I am aware of requests (or httpx, ...), but the idea is to do all that with the standard library. If I have to install stuff, I could just install wget/curl and be done. Feature creep is an issue here, but just like http.server one could be really strict about just covering 90% of the use cases (download stuff and print or save it) and not trying to handle any corner cases. The first code snippet was not supposed to be production-ready. Here's an improved version which only downloads 1MB a time and prints it. The only parameter could be the URL: from urllib.request import urlopen from sys import stdout with urlopen("https://coherentminds.de/") as response: while data := response.read(1024 * 1024): stdout.buffer.write(data) The user of this function could still decide to divert stdout into a file, so both use cases printing and saving would be covered. IMHO, the benefit-cost ratio is quite good: * can be a lifesaver (just like http.server) every once in a while in particular in a container or testing context * low implementation effort * easy to test and to maintain Tom

You are absolutely right, the functionality is there, but the idea is to make it easily available from the command line. Here is a line (with shortened URL) from a Dockerfile which installs poetry as suggested in the docs: RUN python -c "from urllib.request import urlopen; print(urlopen('https://.../install-poetry.py').read().decode())" | python With the proposed functionality, the urllib could provide a download entry point which would make the line look like this: RUN python -m urrlib.download "https://.../install-poetry.py" | python This is less error-prone, would also work for rather large downloads and wouldn't be much effort to implement/test/maintain.

On Tue, Oct 19, 2021 at 7:32 AM Tom Pohl <thomas.pohl@gmail.com> wrote:
You are absolutely right, the functionality is there, but the idea is to make it easily available from the command line.
I don't know about the others, but I missed that you were talking about an entry point in your initial post. I think that's a great idea -- and we could add a modest set of command line options as well. Folks are always resistant to adding new things, but once written, this would provide a nice feature, and not be difficult to maintain nor disruptive in any way. +1 for sure! -CHB _______________________________________________
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Thanks, finally a +1! \o/ It's funny that "entry point" triggered your reaction, because I think it's not the correct technical term. What I'm proposing is very similar to http.server: https://github.com/python/cpython/blob/3.10/Lib/http/server.py#L1257 Just like "python -m http.server" you could then use "python -m urllib.request.download <URL>".

A question for the Python experts: What is the correct technical term for a functionality like "http.server", i.e., a module with an actual "main" function?

Thanks. Not as catchy as I would have hoped, though. ;-) One person except me in favor of this idea. Any other feedback? How to proceed?

Thanks. Not as catchy as I would have hoped, though. ;-) When you respond to a message, could you keep a little of the context
On 10/25/2021 11:21 AM, Tom Pohl wrote: that you're replying to? I'm not sure what this refers to.
One person except me in favor of this idea. Any other feedback? How to proceed?
I think it's a good idea. I think the next step would be to create a PR, including documentation. IIRC, there's already an issue open for this. Eric

Thanks for the nudge. If anyone is interested (or could approve to make the pipeline run), here's the PR: https://github.com/python/cpython/pull/29217

Tom Pohl wrote:
A question for the Python experts: What is the correct technical term for a functionality like "http.server", i.e., a module with an actual "main" function?
There's some details about it here https://docs.python.org/3/library/__main__.html#idiomatic-usage
participants (10)
-
Cameron Simpson
-
Chris Angelico
-
Christian Heimes
-
Christopher Barker
-
Eric Fahlgren
-
Eric V. Smith
-
Steven D'Aprano
-
Thomas Grainger
-
Tom P
-
Tom Pohl