urllib.parrrrse does not supporrrrt bytes
Aye, me mateys, In Python 3 the parrrsing function of urllib do not work with bytes. What's the prrrroblem? I tell you: U'RLs only have a charrrrrset rrecommendation and sometimes you have to deal with URL encoded stuff that does not contain unicode data. I tried to crrreate a patch for urllib but it appears that you have to rrrrrreplicate ParseResult for byte strrrrings which seems wrong to me. Does anyone rrremember the rrreasons why urllib was not designed to work on bytes interrrnaly and only convert to unicode before/after converrrrrsion? But maybe we could also add IRI functions to that module, or add a irilib that allows the conversion between U'RIs and I'RIs. urllib depending on unicode strrrrings is for me the biggest rrreason to base WSGI for Python 3 exlusively on unicode. Yo ho, that's it frrom me. Shiver my timbers! Arrrrrmin
Initially I thought your 'r' key was having sticking issues. I really do hate Talk Like A Pirate Day. On Sat, Sep 19, 2009 at 12:27, Armin Ronacher <armin.ronacher@active-4.com> wrote:
Aye, me mateys,
In Python 3 the parrrsing function of urllib do not work with bytes. What's the prrrroblem? I tell you: U'RLs only have a charrrrrset rrecommendation and sometimes you have to deal with URL encoded stuff that does not contain unicode data.
I tried to crrreate a patch for urllib but it appears that you have to rrrrrreplicate ParseResult for byte strrrrings which seems wrong to me. Does anyone rrremember the rrreasons why urllib was not designed to work on bytes interrrnaly and only convert to unicode before/after converrrrrsion?
See, you are assuming any design went into other than to make the thing pass the unit tests. Most modules did not go through some rigorous design discussion to decide how to make it work with bytes. Someone just took it upon themselves to make the thing work and that was that. I am willing to guess this is more or less what happened with urllib, especially since it was a bit tricky to get it merged between urllib and urllib2. -Brett
Hi, Brett Cannon schrieb:
See, you are assuming any design went into other than to make the thing pass the unit tests. Most modules did not go through some rigorous design discussion to decide how to make it work with bytes. So I suppose there would be nothing wrong with providing a patch that makes it work with bytes internally?
Regards, Armin
Armin Ronacher schrieb:
Hi,
Brett Cannon schrieb:
See, you are assuming any design went into other than to make the thing pass the unit tests. Most modules did not go through some rigorous design discussion to decide how to make it work with bytes. So I suppose there would be nothing wrong with providing a patch that makes it work with bytes internally?
Aye, assuming no one else speaks up in defense of the current one. (But please don't provide the patch until tomorrrrow.) Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
On Sat, Sep 19, 2009 at 13:59, Armin Ronacher <armin.ronacher@active-4.com> wrote:
Hi,
Brett Cannon schrieb:
See, you are assuming any design went into other than to make the thing pass the unit tests. Most modules did not go through some rigorous design discussion to decide how to make it work with bytes. So I suppose there would be nothing wrong with providing a patch that makes it work with bytes internally?
As long as the external API continues to work as expected, then no, there are no problems with making the code more bytes/unicode friendly internally. -Brett
Le samedi 19 septembre 2009 à 22:59 +0200, Armin Ronacher a écrit :
Hi,
Brett Cannon schrieb:
See, you are assuming any design went into other than to make the thing pass the unit tests. Most modules did not go through some rigorous design discussion to decide how to make it work with bytes. So I suppose there would be nothing wrong with providing a patch that makes it work with bytes internally?
Can you please precise what you are trying to do? If you just want to replace the implementation without changing the API to support bytes, I'm not sure what the point is. If you want the public API to support bytes, perhaps it would be worth discussing it (on python-dev perhaps, since I'm not sure everyone concerned is here)? Although since HTTP is bytes at the protocol level anyway, it doesn't seem very controversial to me... Regards Antoine.
Antoine Pitrou wrote:
Le samedi 19 septembre 2009 à 22:59 +0200, Armin Ronacher a écrit :
Hi,
Brett Cannon schrieb:
See, you are assuming any design went into other than to make the thing pass the unit tests. Most modules did not go through some rigorous design discussion to decide how to make it work with bytes.
So I suppose there would be nothing wrong with providing a patch that makes it work with bytes internally?
Can you please precise what you are trying to do?
If you just want to replace the implementation without changing the API to support bytes, I'm not sure what the point is.
Uhm... the point being to support bytes? :-)
If you want the public API to support bytes, perhaps it would be worth discussing it (on python-dev perhaps, since I'm not sure everyone concerned is here)? Although since HTTP is bytes at the protocol level anyway, it doesn't seem very controversial to me...
I don't see why it should be controversial at all - and I'm sure Armin is aware enough of the issues to do a good job. Probably posting a patch and having a discussion on the bug tracker. Michael
Regards
Antoine.
_______________________________________________ stdlib-sig mailing list stdlib-sig@python.org http://mail.python.org/mailman/listinfo/stdlib-sig
-- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog
Hi, Antoine Pitrou schrieb:
Can you please precise what you are trying to do?
If you just want to replace the implementation without changing the API to support bytes, I'm not sure what the point is. What I'm trying to do and the point of my changes is support for bytes. A feature that disappared in Python 3.0 and is widely used.
Problems I've seen in the code so far that go beyond the point of what I intend to change: unused and undocumented interfaces: - urllib.parse.unwrap not sure what that is for, I guess testing - urllib.parse.to_bytes also unused, I guess it was in use for an older implementation - urllib.parse.test and urllib.parse.test_data unused test suite that is also in the real test suite. Can we delete it? It's also pretty inconsistent currently: urllib.parse.quote and urllib.parse.quote_plus already work with bytes (the code adds explicit byte support, one just calls into urllib.parse.quote_from_bytes). However the reverse functions do not work with bytes at all. What is quote_from_bytes used for when quote just calls into it? Why did this ever become a public interface? Also once unquote and unquote_plus are fixed, unquote_to_bytes is similarly useless, especially because both of them perform some weird utf-8 conversion if non-bytes are passed to them. I would just fix those and deprecate the explicit unquote_to_bytes and quote_from_bytes for future versions of Python 3. Any comments on that? Regards, Armin
Le dimanche 20 septembre 2009 à 20:48 +0200, Armin Ronacher a écrit :
What is quote_from_bytes used for when quote just calls into it? Why did this ever become a public interface? Also once unquote and unquote_plus are fixed, unquote_to_bytes is similarly useless, especially because both of them perform some weird utf-8 conversion if non-bytes are passed to them.
You should try a search in the python-dev archives. It was introduced recently, and after quite a bit of discussion I think.
I would just fix those and deprecate the explicit unquote_to_bytes and quote_from_bytes for future versions of Python 3.
Any comments on that?
Regardless of whether the semantics would be better or not, we can't change APIs every 2 years, it will make our users furous. Regards Antoine.
Hi, Antoine Pitrou schrieb:
You should try a search in the python-dev archives. It was introduced recently, and after quite a bit of discussion I think. Just found the discussion. Apparently the reason there are different functions is something Guido wanted.
I guess the best idea is to finish the patch and discuss that one on python-devel then.
Regardless of whether the semantics would be better or not, we can't change APIs every 2 years, it will make our users furous. Right, but how many people are using Python 3?
Regards, Armin
On 20 Sep, 2009, at 20:58, Armin Ronacher wrote:
Regardless of whether the semantics would be better or not, we can't change APIs every 2 years, it will make our users furous. Right, but how many people are using Python 3?
This irrelevant. Python 3.x is bound by the same API stability rules as the 2.x releases. One way to ensure that 3.x isn't used a lot is to keep making incompatible changes, that way anyone that wants to use Python 3 in production code will stay away. Ronald
Hi, Ronald Oussoren schrieb:
This irrelevant. Python 3.x is bound by the same API stability rules as the 2.x releases. Right now the number of Python 3 adopters that are using Python for web related application is a *lot* lower than the number of Python 2 users.
When it comes to actual users using it I think it's better to compare it to the early Python 1.5 days.
One way to ensure that 3.x isn't used a lot is to keep making incompatible changes, that way anyone that wants to use Python 3 in production code will stay away. So you think the broken behavior is the best way to encourage people working with Python 3? Sounds about right.
Regards, Armin
So you think the broken behavior is the best way to encourage people working with Python 3? Sounds about right.
You know, it would be better if you demonstrated that the behaviour is broken, rather than asserting it. The fact that a long discussion led to the current API is a good hint that it is probably not as broken as you make it to be.
Antoine Pitrou wrote:
So you think the broken behavior is the best way to encourage people working with Python 3? Sounds about right.
You know, it would be better if you demonstrated that the behaviour is broken, rather than asserting it. The fact that a long discussion led to the current API is a good hint that it is probably not as broken as you make it to be.
Agreed. The quote/unquote() APIs implement everything you need to support non-ASCII URLs - they even give you a choice of using a different encoding for the %-escaped parts of the URL. Armin, what else do you think you need ? If you do know the encoding of the URL, then you can easily convert a bytes URL into a Unicode one. If not, then it's better to stick to the standards and have the functions raise an exception if needed. If you don't want to see errors, use the errors="replace" error handler or just use "latin-1" as encoding for both unquote() and quote() - while it's not necessarily correct, it should get you pretty close to the bytes-behavior you see in Python 2.x. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 20 2009)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
Hi, Antoine Pitrou schrieb:
You know, it would be better if you demonstrated that the behaviour is broken, rather than asserting it. The fact that a long discussion led to the current API is a good hint that it is probably not as broken as you make it to be. So if you look at a current version of urllib.parse on Python 3.1 you can observe the following behavior:
- quote and quote_plus are written to support bytes by forwarding the call to - the unquote and unquote_plus functions do not work with bytes at all. You can currently quote to bytes, but don't go the other way. - None of the URL parsing functions currently work with bytes. - None of the URL joining functions currently work with bytes. - You cannot specifiy the URL encoding to urlencode, parse_qs and parse_qsl. Currently you can only decode/encode utf-8 data with these functions. Regards, Armin
On Sun, Sep 20, 2009 at 08:55:51PM +0200, Antoine Pitrou wrote:
Le dimanche 20 septembre 2009 à 20:48 +0200, Armin Ronacher a écrit :
What is quote_from_bytes used for when quote just calls into it? Why did this ever become a public interface? Also once unquote and unquote_plus are fixed, unquote_to_bytes is similarly useless, especially because both of them perform some weird utf-8 conversion if non-bytes are passed to them.
You should try a search in the python-dev archives. It was introduced
Well, I was discussing with Armin Ronacher at IRC and I pointed him to this: http://bugs.python.org/issue3300 Yes, it did go through considerable amount of discussion.
recently, and after quite a bit of discussion I think.
I would just fix those and deprecate the explicit unquote_to_bytes and quote_from_bytes for future versions of Python 3.
Any comments on that?
Regardless of whether the semantics would be better or not, we can't change APIs every 2 years, it will make our users furous.
More so for commonly used ones.. -- Senthil Smoking is one of the leading causes of statistics. -- Fletcher Knebel
Hi, Senthil Kumaran schrieb:
Well, I was discussing with Armin Ronacher at IRC and I pointed him to this: http://bugs.python.org/issue3300 However this is a different issue. That ticket covers the *unicode* related parts of urllib, I only work on the *byte* parts which currently either do not work or are not enough tested and break easily.
I just aim to make sure urllib in Python 3 will also work on bytes. Regards, Armin
On Sun, Sep 20, 2009 at 08:48:04PM +0200, Armin Ronacher wrote:
unused and undocumented interfaces:
Not really unused.
- urllib.parse.unwrap not sure what that is for, I guess testing
Head over to request.py and see how Request class unwraps the URL presented in the format it handles. This has been there since 2.x urlparse and this is parsing function and not does belong to urllibx.py. I am not sure, if people are not using this interface in any of their codes, because this has been present in urlparse for a long time.
- urllib.parse.to_bytes also unused, I guess it was in use for an
Again used in request.py to validate that URL as str
- urllib.parse.test and urllib.parse.test_data unused test suite that is also in the real test suite. Can we delete it?
These are just hanging around with as No-Harm thing :) Can be removed. As we have tests which cover them. Its been there, because there is 'no-harm' in keeping them around. -- Senthil I believe a little incompatibility is the spice of life, particularly if he has income and she is pattable. -- Ogden Nash
participants (8)
-
Antoine Pitrou
-
Armin Ronacher
-
Brett Cannon
-
Georg Brandl
-
M.-A. Lemburg
-
Michael Foord
-
Ronald Oussoren
-
Senthil Kumaran