_socket efficiencies ideas

I have been in discussion recently with Martin v. Loewis about an idea I have been thinking about for a while to improve the efficiency of the connect method in the _socket module. I posted the original suggestion to the python suggestions tracker on sourceforge as item 706392. A bit of history and justification: I am doing a lot of work using python to develop almost-real-time distributed data acquisition and control systems from running laboratory apparatus. In this environment, I do a lot of sun-rpc calls as part of the vxi-11 protocol to allow TCP/IP access to gpib-like devices. As a part of this, I do a lot sock socket.connect() calls, often with the connections being quite transient. The problem is that the current python _socket module makes a DNS call to try to resolve each address before connect is called, which if I am connecting/disconnecting many times a second results in pathological and gratuitous network activity. Incidentally, I am in the process of creating a sourceforge project, pythonlabtools (just approved this morning), in which I will start maintaining a repository of the tools I have been working on. My first solution to this, for which I submitted a patch to the tracker system (with guidance from Martin), was to create a wrapper for the sockaddr object, which one can create in advance, and when _socket.connect() is called (actually when getsockaddrarg() is called by connect), results in an immediate connection without any DNS activity. This solution solves part of the problem, but may not be the right final one. After writing this patch and verifying its functionality, I tried it in the real world. Then, I realized that for sun-rpc work, it wasn't quite what I needed, since the socket number may be changing each time the rpc request is made, resulting in a new address wrapper being needed, and thus DNS activity again. After thinking about what I have done with this patch, I would also like to suggest another change (for which I am also willing to submit the patch, which is quite simple): Consistent with some of the already extant glue in _socket to handle addresses like <broadcast>, would there be any reason no to modify setipaddr() and getaddrinfo() so that if an address is prefixed with <numeric> (e.g. <numeric>127.0.0.1) that the PASSIVE and NUMERIC flags are always set so these routines reject any non-numeric address, but handle numeric ones very efficiently? I have already implemented a predecessor to this which I am experimentally running at home in python 2.2.2, in which I made it so that prefixing the address with an exclamation point provided this functionality. Given the somewhat more legible approach the team has already chosen for special addresses, I see no reason why using a <numeric> (or some such) prefix isn't reasonable. Do any members of the development team have commentary on this? Would such a change be likely to be accepted into the system? Any reasons which it might break something? The actual patch would be only about 10 lines of code, (plus some documentation), a few in each of the routines mentioned above. Thanks for any suggestions. Marcus Mendenhall

Are you sure that it tries make a DNS call even when the address is pure numeric? That seems a mistake, and if that's really happening, I think that is the part that should be fixed. Maybe in the _socket module, maybe in getaddrinfo().
I don't see why we would have to add the <numeric> flag to the address when the form of the address itself is already a perfect clue that the address is purely numeric. I'd be happy to see a patch that intercepts addresses of the form \d+\.\d+\.\d+\.\d+ and parses those without calling getaddrinfo(). --Guido van Rossum (home page: http://www.python.org/~guido/)

Thanks for your prompt reply! On Tuesday, April 8, 2003, at 09:50 AM, Guido van Rossum wrote:
Yes, it seems to do this. It sets the PASSIVE flags, but that doesn't seem to be quite enough to prevent DNS activity, although the NUMERIC flag does the job. This is true, at least, in 2.3.x on MacOSX, and since the socket stuff is all the same, I suspect it is true on many Unixes. Note that this doesn't happen on the MacOS9 version, which provides its own socket interface through GUSI, which apparently is smart enough to handle it.
Do we want this? The parser also then have to be modified when to handle numeric INET6 addresses, when they become popular. I actually did implement one of my trial versions this way, and it worked fine. There is one minor issue, too. In urllib, there are some calls to getaddrinfo to get (for maybe no good reason), CNAMEs of addresses. I would like some way to tag an address with a very strong comment that it is what it is, and I would like all further processing disabled. Also, a 'trial' parsing of an address for matching a a.b.c.d pattern each time is a lot more processor inensive than checking for <numeric> at the beginning. I am perfectly happy to implement it either way.
--Guido van Rossum (home page: http://www.python.org/~guido/)

On Tue, Apr 08, 2003 at 10:50:50AM -0400, Guido van Rossum wrote:
Are you sure that it tries make a DNS call even when the address is pure numeric? That seems a mistake, and if that's really happening, I
My first thought is that there should be a local DNS cache on the machine that is running these apps. My second thought is that Python could benefit from caching some lookup information...
It's not quite that easy. Beyond the IPV6 issues mentioned elsewhere, you'd also want to check "\d+.\d+" and "\d+\.\d+\.\d+". IP addresses will fill in missing ".0"s, which is particularly handy for accessing "127.1", which gets expanded to "127.0.0.1". Sean -- Rocky: "Do you know what an A-Bomb is?" Bullwinkle: "Of course. ``A Bomb'' is what some people call our show." Sean Reifschneider, Inimitably Superfluous <jafo@tummy.com> tummy.com, ltd. - Linux Consulting since 1995. Qmail, Python, SysAdmin

I don't want to build a cache into Python, it should already be part of libresolv.
The IPv6 folks can add their own cache.
I didn't even know this, and I think it's bad style to use something that obscure (most people would probably guess that 127.1 means 0.0.127.1 or 127.1.0.0). But since you seem to know about this stuff, perhaps you can submit a patch? --Guido van Rossum (home page: http://www.python.org/~guido/)

OK, I'll chime back in on the thread I started... I mostly have a question for Sean, since he seems to know the networking stuff well. Do you know of any reason why my original proposal (which is to allows IP addresses prefixed with <numeric> e.g. <numeric>127.0.0.1 to cause both the AI_PASSIVE _and_ AI_NUMERIC flags to get set when resolution is attempted, which basically causes parsing with not real resolution at all) would break any known or plausible networking standards? The current Python socket module basically hides this part of the BSD socket API, and I find it quite useful to be able to suppress DNS activity absolutely for some addresses. And for Guido: since this type of tag has already been used in Python (as <broadcast>), is there any reason why this solution is inelegant? Thanks. Marcus On Wednesday, April 9, 2003, at 08:51 AM, Guido van Rossum wrote:

OK, I'll chime back in on the thread I started... I mostly have a question for Sean, since he seems to know the networking stuff well.
I'll chime in nevertheless.
What are those flags? Which API uses them? I still don't understand why intercepting the all-numeric syntax isn't good enough, and why you want a <numeric> prefix.
The reason I'm reluctant to add a new notation is that AFAIK it would be unique to Python. It's better to stick to standard notations IMO. <broadcast> was probably a mistake, since it seems to mean the same as 0.0.0.0 (for IPv4). --Guido van Rossum (home page: http://www.python.org/~guido/)

On Wednesday, April 9, 2003, at 09:37 AM, Guido van Rossum wrote:
The getsockaddr call uses them (actually the correct name for one of the flags is AI_NUMERICHOST, not AI_NUMERIC as I originally stated), and its part of the BSD sockets library, which is basically what the python socketmodule wraps.
I still don't understand why intercepting the all-numeric syntax isn't good enough, and why you want a <numeric> prefix.
I guess intercepting all numeric is OK, it is just less efficient (since it requires a trial parsing of an address, which is wasted if it is not all numeric), and because it is so easy to implement <numeric>. However, all my operational goals are achieved if the old check for pure numeric is reinstated at the lowest level (probably in getsockaddrarg in socketmodule.c), so it is used everywhere.

The performance loss will be unmeasurable (parsing a string of at most 11 bytes against a very simple pattern). Compare that to the true cost of adding <numeric>: documentation has to be added (and dozens of books updated), and code that wants to use numeric addresses has to be changed.
Right.
You're right, this functionality should be made available. IMO the right solution is to make it a separate API in the socket module, not to add more syntax to the existing address parsing code. --Guido van Rossum (home page: http://www.python.org/~guido/)

Marcus Mendenhall wrote:
More importantly, it is part of RFC 2553, which Python uses; it is also part of Winsock2.
But isn't the same trial parsing needed to determine presence of the "<numeric>" flag? The trial parsing Guido proposes usually stops with the first letter in a non-numeric address, and accesses up to 16 letters for a numeric address. Regards, Martin

On Wednesday, April 9, 2003, at 01:49 PM, Martin v. Löwis wrote: the first compare avoids even a subroutine call in the most likely case (string does not begin with <numeric>) but then checks extremely quickly if it is right after that. Even though cpu time is cheap, we should save it for useful work. Marcus

Marcus Mendenhall wrote:
Even though cpu time is cheap, we should save it for useful work.
Saving a few cycles while having the complicate the interface is not the Python way. +1 on restoring the old sscanf code (or something similar to it). ObTrivia: IP addresses can be written as a single number (at least for many IP implementations). Try "ping 2130706433". Neil

Neil Schemenauer <nas@python.ca> writes:
For what it's worth, whenever I had network code that I wanted to accept names or addresses, I always distinguished them through an attempt using the platform inet_addr() system call. If that returns an error (-1), then I go ahead and process it as a name, otherwise I use the address it returns. inet_addr() will itself take care of validating that the address is legal (e.g., no octet over 255 and only up to 4 octets), padding values as necessary (e.g., x.y.z is processed as if z was a 16-bit value, x.z as if z was a 24-bit value, x as a 32-bit value), and permits decimal, octal or hexadecimal forms of the individual octets. I believe this behavior is portable and well defined. If you wanted the same code to work for IPv4 and IPv6, you'd probably want to use inet_pton() instead since inet_addr() only does IPv4, although that would lose the hex/octal options. You'd probably have to conditionalize that anyway since it might not be available on IPv4 only configurations, so I could see using inet_addr() for IPv4 and inet_pton() for IPv6.
ObTrivia: IP addresses can be written as a single number (at least for many IP implementations). Try "ping 2130706433".
That's part of the inet_addr() definition. When a single value is given as the string, it is assumed to be the complete 32-bit address value, and is stored directly without any byte rearrangement. So, 2130706433 is (127*2^24) + 1, or "127.0.0.1" - but then obviously you knew that :-) -- David

Even though cpu time is cheap, we should save it for useful work.
With that attitude, I'm surprised you're using Python at all. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)

Marcus Mendenhall <marcus.h.mendenhall@vanderbilt.edu>:
Just: if (string[0]=='<' && not strncmp(string,"<numeric>",9)) {whatever}
By the same token, checking whether the first char is a digit ought to weed out about 99.999% of all non-numeric domain name addresses. If this is even a problem, which I doubt. We're talking about something called from Python, for goodness sake... Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

this is a fragment from RFC 1034 (DOMAIN NAMES - CONCEPTS AND FACILITIES) http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1034.html i'm not 100% sure that this is the "normative" definition, but if it is then it clearly requires a non-numeric initial character for each label. (sorry if someone has already mentioned this!) andrew 3.5 Preferred name syntax The DNS specifications attempt to be as general as possible in the rules for constructing domain names. The idea is that the name of any existing object can be expressed as a domain name with minimal changes. However, when assigning a domain name for an object, the prudent user will select a name which satisfies both the rules of the domain system and any existing rules for the object, whether these rules are published or implied by existing programs. For example, when naming a mail domain, the user should satisfy both the rules of this memo and those in RFC-822. When creating a new host name, the old rules for HOSTS.TXT should be followed. This avoids problems when old software is converted to use domain names. The following syntax will result in fewer problems with many applications that use domain names (e.g., mail, TELNET). <domain> ::= <subdomain> | " " <subdomain> ::= <label> | <subdomain> "." <label> <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ] <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str> <let-dig-hyp> ::= <let-dig> | "-" <let-dig> ::= <letter> | <digit> <letter> ::= any one of the 52 alphabetic characters A through Z in upper case and a through z in lower case <digit> ::= any one of the ten digits 0 through 9 Note that while upper and lower case letters are allowed in domain names, no significance is attached to the case. That is, two names with the same spelling but different case are to be treated as if identical. The labels must follow the rules for ARPANET host names. They must start with a letter, end with a letter or digit, and have as interior characters only letters, digits, and hyphen. There are also some restrictions on the length. Labels must be 63 characters or less. -- http://www.acooke.org/andrew

As is 3com.com, and, for a more python-related example, 4suite.org. The latter also has an A record. 411.com and 911.com are both valid domains, as is 123.com. With the appropriate resolv.conf search path (ie including '.com'), you could enter '123' and expect to get back the address 64.186.10.158. Isn't the DNS fun. Anthony -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.

On Wed, Apr 09, 2003 at 09:51:26AM -0400, Guido van Rossum wrote:
I didn't even know this, and I think it's bad style to use something that obscure
Perhaps... It's also bad style to break the obscure cases that are defined by the specifications... ;-)
(most people would probably guess that 127.1 means 0.0.127.1 or 127.1.0.0).
Yeah, unfortunately it's one of those cases that it doesn't really make sense until you actually know the padding happens, and then think about it... It really only makes sense to pad within the address because you are rarely going to have leading or trailing 0s in a network address. So, it pads before the trailing specified octet: 10.1 => 10.0.0.1 10.9.1 => 10.9.0.1
But since you seem to know about this stuff, perhaps you can submit a patch?
I've updated my local CVS repository, I'll see if I can get a change done on the airplane today. Sean -- The structure of a system reflects the structure of the organization that built it. -- Richard E. Fairley Sean Reifschneider, Inimitably Superfluous <jafo@tummy.com> tummy.com, ltd. - Linux Consulting since 1995. Qmail, Python, SysAdmin

Sure. I propose to special-case only what we *absolutely* *know* we can handle, and if on closer inspection we can't (e.g. someone writes 999.999.999.999) we pass it on to the official code. Here's the 2.1 code, which takes that approach: if (sscanf(name, "%d.%d.%d.%d%c", &d1, &d2, &d3, &d4, &ch) == 4 && 0 <= d1 && d1 <= 255 && 0 <= d2 && d2 <= 255 && 0 <= d3 && d3 <= 255 && 0 <= d4 && d4 <= 255) { addr_ret->sin_addr.s_addr = htonl( ((long) d1 << 24) | ((long) d2 << 16) | ((long) d3 << 8) | ((long) d4 << 0)); return 4; }
Great! --Guido van Rossum (home page: http://www.python.org/~guido/)

It should be automatically recognized. Python has always done this (until 2.1 at least). I don't think there is any ambiguity; AFAIK it's not possible to put something in the DNS so that an all-numeric address gets remapped (that would be a nasty security problem waiting to happen). --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum <guido@python.org>:
AFAIK it's not possible to put something in the DNS so that an all-numeric address gets remapped
In that case, there's no problem at all, and I withdraw my suggestion about using tuples for numeric addresses. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Ick ick. This is putting a bunch of code for a stub resolver into python. This stuff is hard to get right - I implemented this on top of pydns, and it was a lot of work to get (what I think is) correct, for not very much gain. The idea of either suppressing DNS lookups for all-numeric addresses, or some sort of extended API for suppressing DNS lookups might be better, but really, isn't this the job of the stub resolver? Anthony -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.

On Thu, Apr 10, 2003 at 12:24:45AM +1000, Anthony Baxter wrote:
Well, ideally you'd cache the data for as long as the SOA says to cache it. However, it sounds like in the situation that started this thread, even caching that data for some small but configurable number of seconds might help out.
Definitely, on both counts... I like the idea of the "<numeric>127.0.0.1" or otherwise somehow specifying that the address shouldn't be resolved. I wouldn't think that it'd be good to do lookups of purely IP addresses, but there is probably some obscure part of some spec that says it should happen. Contrary to popular belief, just because I know that IP addresses get padded with 0s, I'm not a networking lawyer. ;-) I learned that trick because it can help make dealing with IPV6 addresses much easier, but I've found it most useful with 127.1. Sean -- This message is REALLY offensive, so I ROT-13d it TWICE. -- Sean Reifschneider being silly on #python, 2000 Sean Reifschneider, Inimitably Superfluous <jafo@tummy.com> tummy.com, ltd. - Linux Consulting since 1995. Qmail, Python, SysAdmin

On woensdag, apr 9, 2003, at 16:40 Europe/Amsterdam, Sean Reifschneider wrote:
I wouldn't touch caching with a ten foot pole here: Python cannot know what happens under the hood of the network. For example, if I move my WiFi-equipped laptop from one location to another I don't want to be forced to restart my Python applications just to clear some silly cache, knowing that the OS and libc layers have handled the switch fine. (And, yes, Windoze-users are probably required to reboot anyway, but my Mac handles changing IP addresses just nicely:-) -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

What I said.
Hey, I just figured it out. The old socket module (Python 2.1 and before) *did* special-case \d+\.\d+\.\d+\.\d+! This code was somehow lost when the IPv6 support was added. I propose to put it back in, at least for IPv4 (AF_INET). Patch anyone? --Guido van Rossum (home page: http://www.python.org/~guido/)

https://sourceforge.net/tracker/index.php?func=detail&aid=731209&group_id=5470&atid=305470 Unfortunately the code still goes through the idna encoding module - this is some overhead that it would be nice to avoid for all-numeric addresses. Anthony

Anthony Baxter wrote:
Ah. That could be the case - I think I'm loading the address from an XML file in the test case I used... will fix that.
If you mean "I'll fix the test case to not use XML anymore" - that might be reasonable. If you mean "I'll fix the test case to convert the Unicode arguments to byte strings before passing them to the socket module", I suggest that this should not be needed: the IDNA codec should complete quickly if the Unicode string is ASCII only (perhaps not as fast as converting the string to ASCII beforehand, but not significantly slower). Regards, Martin

Anthony Baxter <anthony@interlink.com.au>:
Seems to me the basic problem is that we're representing to completely different things -- a DNS name and a raw IP address -- the same way, i.e. as a string. A raw IP address should (at least optionally) be represented by something different, such as a tuple of ints. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Why? There's never any ambiguity about which kind is intended. --Guido van Rossum (home page: http://www.python.org/~guido/)

Sean Reifschneider wrote:
I disagree. Python should expose the resolver library, and leave caching to it; many such libraries do caching already, in some form. The issue is different: In some cases the application just *knows* that an address is numeric, and that DNS lookup will fail. In these cases, lookup should be avoided - whether by explicit request from the application or by Python implicitly just knowing is a different issue. It turns out that Python doesn't need to 100% detect numeric addresses, as long as it would not classify addresses as numeric which aren't. Perhaps it is even possible to leave the "is numeric" test to the implementation of getaddrinfo, i.e. calling it twice (try numeric first, then try resolving the name)? Regards, Martin

Martin> It turns out that Python doesn't need to 100% detect numeric Martin> addresses, as long as it would not classify addresses as numeric Martin> which aren't. Perhaps it is even possible to leave the "is Martin> numeric" test to the implementation of getaddrinfo, i.e. calling Martin> it twice (try numeric first, then try resolving the name)? Can a top-level domain be all digits? If not, why not assume numeric if re.search(r"\.\d+$", addr) is not None? Skip

Skip Montanaro wrote:
Can a top-level domain be all digits?
It appears nobody here can answer this question with certainty. If the answer is "no", it is surprising that getaddrinfo implementations still make resolver calls in this case even if they are sure that those resolver calls fail. One would hope that people writing socket libraries should no the answer. Regards, Martin

On Wed, Apr 09, 2003 at 01:44:51PM -0500, Skip Montanaro wrote:
Can a top-level domain be all digits? If not, why not assume numeric if re.search(r"\.\d+$", addr) is not None?
I don't think anyone sane would create a top-level that's digits, particularly in the range of 0 to 255. That probably means that somebody is going to do it... ;-/ I think checking for 2 to 4 dotted octets in the range of 0 to 255 would be safest... Yes, you can probably get away with using the regex above, but I wouldn't want to. Sean -- Sucking all the marrow out of life doesn't mean choking on the bone. -- _Dead_Poet's_Society_ Sean Reifschneider, Inimitably Superfluous <jafo@tummy.com> tummy.com, ltd. - Linux Consulting since 1995. Qmail, Python, SysAdmin

Indeed, Anthony brought the example of 911.com, which has been registered despite being illegal.
At least 911 is greater than 255, which unfortunately isn't the case for 123. But all these would be caught by requiring a full 4-number address before deciding it's numeric. (I don't think it's worth allowing for 0-padding if there are less than 4 numbers.) Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

[MvL]
I disagree. Python should expose the resolver library, and leave caching to it; many such libraries do caching already, in some form.
Right.
The issue is different: In some cases the application just *knows* that an address is numeric, and that DNS lookup will fail.
In fact, I've often written code that passes a numeric address, and I've always assumed that in that case the code would take a shortcut because there's nothing to look up (only to parse).
Perhaps, as long as we can safely ignore the first error. This would probably be a little slower, but probably not slow enoug to matter, and it sounds like a very general solution. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Wed, Apr 09, 2003 at 08:36:17PM +0200, "Martin v. L?wis" wrote:
I disagree. Python should expose the resolver library, and leave caching to it; many such libraries do caching already, in some form.
Why don't we carry it to the logical conclusion and say that the resolver should also avoid doing a forward lookup on an already numeric IP? I've noticed that before the Red Hat 8.0 release, doing a "telnet <ip>" would usually be very fast on the initial connection, and since 8.0 it's been slow as if doing a lookup... To me that indicates that the resolver used to do this and has been changed to not, which makes me wonder why that was... Perhaps we're being too clever and it's going to come back to bite us? The "<numeric>" syntax would allow us to leave resolution as it is and let the user override it when they deem necessary. If we try to auto-detect (which I'm usually all for), we should probably implement a "<forcedns>" or similar? Sean -- Geek English Rule #7: To reduce redundancy, the word "scary" can be left out of any statement containing the phrase "scary java applet". Sean Reifschneider, Inimitably Superfluous <jafo@tummy.com> tummy.com, ltd. - Linux Consulting since 1995. Qmail, Python, SysAdmin

I think it's the other way around. The resolver lost some perfectly good caching in the upgrade to support IPv6. The designers probably didn't notice the difference because in their own setup, DNS is fast. I expect the caching will come back eventually.
YAGNI. --Guido van Rossum (home page: http://www.python.org/~guido/)

Are you sure that it tries make a DNS call even when the address is pure numeric? That seems a mistake, and if that's really happening, I think that is the part that should be fixed. Maybe in the _socket module, maybe in getaddrinfo().
I don't see why we would have to add the <numeric> flag to the address when the form of the address itself is already a perfect clue that the address is purely numeric. I'd be happy to see a patch that intercepts addresses of the form \d+\.\d+\.\d+\.\d+ and parses those without calling getaddrinfo(). --Guido van Rossum (home page: http://www.python.org/~guido/)

Thanks for your prompt reply! On Tuesday, April 8, 2003, at 09:50 AM, Guido van Rossum wrote:
Yes, it seems to do this. It sets the PASSIVE flags, but that doesn't seem to be quite enough to prevent DNS activity, although the NUMERIC flag does the job. This is true, at least, in 2.3.x on MacOSX, and since the socket stuff is all the same, I suspect it is true on many Unixes. Note that this doesn't happen on the MacOS9 version, which provides its own socket interface through GUSI, which apparently is smart enough to handle it.
Do we want this? The parser also then have to be modified when to handle numeric INET6 addresses, when they become popular. I actually did implement one of my trial versions this way, and it worked fine. There is one minor issue, too. In urllib, there are some calls to getaddrinfo to get (for maybe no good reason), CNAMEs of addresses. I would like some way to tag an address with a very strong comment that it is what it is, and I would like all further processing disabled. Also, a 'trial' parsing of an address for matching a a.b.c.d pattern each time is a lot more processor inensive than checking for <numeric> at the beginning. I am perfectly happy to implement it either way.
--Guido van Rossum (home page: http://www.python.org/~guido/)

On Tue, Apr 08, 2003 at 10:50:50AM -0400, Guido van Rossum wrote:
Are you sure that it tries make a DNS call even when the address is pure numeric? That seems a mistake, and if that's really happening, I
My first thought is that there should be a local DNS cache on the machine that is running these apps. My second thought is that Python could benefit from caching some lookup information...
It's not quite that easy. Beyond the IPV6 issues mentioned elsewhere, you'd also want to check "\d+.\d+" and "\d+\.\d+\.\d+". IP addresses will fill in missing ".0"s, which is particularly handy for accessing "127.1", which gets expanded to "127.0.0.1". Sean -- Rocky: "Do you know what an A-Bomb is?" Bullwinkle: "Of course. ``A Bomb'' is what some people call our show." Sean Reifschneider, Inimitably Superfluous <jafo@tummy.com> tummy.com, ltd. - Linux Consulting since 1995. Qmail, Python, SysAdmin

I don't want to build a cache into Python, it should already be part of libresolv.
The IPv6 folks can add their own cache.
I didn't even know this, and I think it's bad style to use something that obscure (most people would probably guess that 127.1 means 0.0.127.1 or 127.1.0.0). But since you seem to know about this stuff, perhaps you can submit a patch? --Guido van Rossum (home page: http://www.python.org/~guido/)

OK, I'll chime back in on the thread I started... I mostly have a question for Sean, since he seems to know the networking stuff well. Do you know of any reason why my original proposal (which is to allows IP addresses prefixed with <numeric> e.g. <numeric>127.0.0.1 to cause both the AI_PASSIVE _and_ AI_NUMERIC flags to get set when resolution is attempted, which basically causes parsing with not real resolution at all) would break any known or plausible networking standards? The current Python socket module basically hides this part of the BSD socket API, and I find it quite useful to be able to suppress DNS activity absolutely for some addresses. And for Guido: since this type of tag has already been used in Python (as <broadcast>), is there any reason why this solution is inelegant? Thanks. Marcus On Wednesday, April 9, 2003, at 08:51 AM, Guido van Rossum wrote:

OK, I'll chime back in on the thread I started... I mostly have a question for Sean, since he seems to know the networking stuff well.
I'll chime in nevertheless.
What are those flags? Which API uses them? I still don't understand why intercepting the all-numeric syntax isn't good enough, and why you want a <numeric> prefix.
The reason I'm reluctant to add a new notation is that AFAIK it would be unique to Python. It's better to stick to standard notations IMO. <broadcast> was probably a mistake, since it seems to mean the same as 0.0.0.0 (for IPv4). --Guido van Rossum (home page: http://www.python.org/~guido/)

On Wednesday, April 9, 2003, at 09:37 AM, Guido van Rossum wrote:
The getsockaddr call uses them (actually the correct name for one of the flags is AI_NUMERICHOST, not AI_NUMERIC as I originally stated), and its part of the BSD sockets library, which is basically what the python socketmodule wraps.
I still don't understand why intercepting the all-numeric syntax isn't good enough, and why you want a <numeric> prefix.
I guess intercepting all numeric is OK, it is just less efficient (since it requires a trial parsing of an address, which is wasted if it is not all numeric), and because it is so easy to implement <numeric>. However, all my operational goals are achieved if the old check for pure numeric is reinstated at the lowest level (probably in getsockaddrarg in socketmodule.c), so it is used everywhere.

The performance loss will be unmeasurable (parsing a string of at most 11 bytes against a very simple pattern). Compare that to the true cost of adding <numeric>: documentation has to be added (and dozens of books updated), and code that wants to use numeric addresses has to be changed.
Right.
You're right, this functionality should be made available. IMO the right solution is to make it a separate API in the socket module, not to add more syntax to the existing address parsing code. --Guido van Rossum (home page: http://www.python.org/~guido/)

Marcus Mendenhall wrote:
More importantly, it is part of RFC 2553, which Python uses; it is also part of Winsock2.
But isn't the same trial parsing needed to determine presence of the "<numeric>" flag? The trial parsing Guido proposes usually stops with the first letter in a non-numeric address, and accesses up to 16 letters for a numeric address. Regards, Martin

On Wednesday, April 9, 2003, at 01:49 PM, Martin v. Löwis wrote: the first compare avoids even a subroutine call in the most likely case (string does not begin with <numeric>) but then checks extremely quickly if it is right after that. Even though cpu time is cheap, we should save it for useful work. Marcus

Marcus Mendenhall wrote:
Even though cpu time is cheap, we should save it for useful work.
Saving a few cycles while having the complicate the interface is not the Python way. +1 on restoring the old sscanf code (or something similar to it). ObTrivia: IP addresses can be written as a single number (at least for many IP implementations). Try "ping 2130706433". Neil

Neil Schemenauer <nas@python.ca> writes:
For what it's worth, whenever I had network code that I wanted to accept names or addresses, I always distinguished them through an attempt using the platform inet_addr() system call. If that returns an error (-1), then I go ahead and process it as a name, otherwise I use the address it returns. inet_addr() will itself take care of validating that the address is legal (e.g., no octet over 255 and only up to 4 octets), padding values as necessary (e.g., x.y.z is processed as if z was a 16-bit value, x.z as if z was a 24-bit value, x as a 32-bit value), and permits decimal, octal or hexadecimal forms of the individual octets. I believe this behavior is portable and well defined. If you wanted the same code to work for IPv4 and IPv6, you'd probably want to use inet_pton() instead since inet_addr() only does IPv4, although that would lose the hex/octal options. You'd probably have to conditionalize that anyway since it might not be available on IPv4 only configurations, so I could see using inet_addr() for IPv4 and inet_pton() for IPv6.
ObTrivia: IP addresses can be written as a single number (at least for many IP implementations). Try "ping 2130706433".
That's part of the inet_addr() definition. When a single value is given as the string, it is assumed to be the complete 32-bit address value, and is stored directly without any byte rearrangement. So, 2130706433 is (127*2^24) + 1, or "127.0.0.1" - but then obviously you knew that :-) -- David

Even though cpu time is cheap, we should save it for useful work.
With that attitude, I'm surprised you're using Python at all. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)

Marcus Mendenhall <marcus.h.mendenhall@vanderbilt.edu>:
Just: if (string[0]=='<' && not strncmp(string,"<numeric>",9)) {whatever}
By the same token, checking whether the first char is a digit ought to weed out about 99.999% of all non-numeric domain name addresses. If this is even a problem, which I doubt. We're talking about something called from Python, for goodness sake... Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

this is a fragment from RFC 1034 (DOMAIN NAMES - CONCEPTS AND FACILITIES) http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1034.html i'm not 100% sure that this is the "normative" definition, but if it is then it clearly requires a non-numeric initial character for each label. (sorry if someone has already mentioned this!) andrew 3.5 Preferred name syntax The DNS specifications attempt to be as general as possible in the rules for constructing domain names. The idea is that the name of any existing object can be expressed as a domain name with minimal changes. However, when assigning a domain name for an object, the prudent user will select a name which satisfies both the rules of the domain system and any existing rules for the object, whether these rules are published or implied by existing programs. For example, when naming a mail domain, the user should satisfy both the rules of this memo and those in RFC-822. When creating a new host name, the old rules for HOSTS.TXT should be followed. This avoids problems when old software is converted to use domain names. The following syntax will result in fewer problems with many applications that use domain names (e.g., mail, TELNET). <domain> ::= <subdomain> | " " <subdomain> ::= <label> | <subdomain> "." <label> <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ] <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str> <let-dig-hyp> ::= <let-dig> | "-" <let-dig> ::= <letter> | <digit> <letter> ::= any one of the 52 alphabetic characters A through Z in upper case and a through z in lower case <digit> ::= any one of the ten digits 0 through 9 Note that while upper and lower case letters are allowed in domain names, no significance is attached to the case. That is, two names with the same spelling but different case are to be treated as if identical. The labels must follow the rules for ARPANET host names. They must start with a letter, end with a letter or digit, and have as interior characters only letters, digits, and hyphen. There are also some restrictions on the length. Labels must be 63 characters or less. -- http://www.acooke.org/andrew

As is 3com.com, and, for a more python-related example, 4suite.org. The latter also has an A record. 411.com and 911.com are both valid domains, as is 123.com. With the appropriate resolv.conf search path (ie including '.com'), you could enter '123' and expect to get back the address 64.186.10.158. Isn't the DNS fun. Anthony -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.

On Wed, Apr 09, 2003 at 09:51:26AM -0400, Guido van Rossum wrote:
I didn't even know this, and I think it's bad style to use something that obscure
Perhaps... It's also bad style to break the obscure cases that are defined by the specifications... ;-)
(most people would probably guess that 127.1 means 0.0.127.1 or 127.1.0.0).
Yeah, unfortunately it's one of those cases that it doesn't really make sense until you actually know the padding happens, and then think about it... It really only makes sense to pad within the address because you are rarely going to have leading or trailing 0s in a network address. So, it pads before the trailing specified octet: 10.1 => 10.0.0.1 10.9.1 => 10.9.0.1
But since you seem to know about this stuff, perhaps you can submit a patch?
I've updated my local CVS repository, I'll see if I can get a change done on the airplane today. Sean -- The structure of a system reflects the structure of the organization that built it. -- Richard E. Fairley Sean Reifschneider, Inimitably Superfluous <jafo@tummy.com> tummy.com, ltd. - Linux Consulting since 1995. Qmail, Python, SysAdmin

Sure. I propose to special-case only what we *absolutely* *know* we can handle, and if on closer inspection we can't (e.g. someone writes 999.999.999.999) we pass it on to the official code. Here's the 2.1 code, which takes that approach: if (sscanf(name, "%d.%d.%d.%d%c", &d1, &d2, &d3, &d4, &ch) == 4 && 0 <= d1 && d1 <= 255 && 0 <= d2 && d2 <= 255 && 0 <= d3 && d3 <= 255 && 0 <= d4 && d4 <= 255) { addr_ret->sin_addr.s_addr = htonl( ((long) d1 << 24) | ((long) d2 << 16) | ((long) d3 << 8) | ((long) d4 << 0)); return 4; }
Great! --Guido van Rossum (home page: http://www.python.org/~guido/)

It should be automatically recognized. Python has always done this (until 2.1 at least). I don't think there is any ambiguity; AFAIK it's not possible to put something in the DNS so that an all-numeric address gets remapped (that would be a nasty security problem waiting to happen). --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum <guido@python.org>:
AFAIK it's not possible to put something in the DNS so that an all-numeric address gets remapped
In that case, there's no problem at all, and I withdraw my suggestion about using tuples for numeric addresses. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Ick ick. This is putting a bunch of code for a stub resolver into python. This stuff is hard to get right - I implemented this on top of pydns, and it was a lot of work to get (what I think is) correct, for not very much gain. The idea of either suppressing DNS lookups for all-numeric addresses, or some sort of extended API for suppressing DNS lookups might be better, but really, isn't this the job of the stub resolver? Anthony -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.

On Thu, Apr 10, 2003 at 12:24:45AM +1000, Anthony Baxter wrote:
Well, ideally you'd cache the data for as long as the SOA says to cache it. However, it sounds like in the situation that started this thread, even caching that data for some small but configurable number of seconds might help out.
Definitely, on both counts... I like the idea of the "<numeric>127.0.0.1" or otherwise somehow specifying that the address shouldn't be resolved. I wouldn't think that it'd be good to do lookups of purely IP addresses, but there is probably some obscure part of some spec that says it should happen. Contrary to popular belief, just because I know that IP addresses get padded with 0s, I'm not a networking lawyer. ;-) I learned that trick because it can help make dealing with IPV6 addresses much easier, but I've found it most useful with 127.1. Sean -- This message is REALLY offensive, so I ROT-13d it TWICE. -- Sean Reifschneider being silly on #python, 2000 Sean Reifschneider, Inimitably Superfluous <jafo@tummy.com> tummy.com, ltd. - Linux Consulting since 1995. Qmail, Python, SysAdmin

On woensdag, apr 9, 2003, at 16:40 Europe/Amsterdam, Sean Reifschneider wrote:
I wouldn't touch caching with a ten foot pole here: Python cannot know what happens under the hood of the network. For example, if I move my WiFi-equipped laptop from one location to another I don't want to be forced to restart my Python applications just to clear some silly cache, knowing that the OS and libc layers have handled the switch fine. (And, yes, Windoze-users are probably required to reboot anyway, but my Mac handles changing IP addresses just nicely:-) -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

What I said.
Hey, I just figured it out. The old socket module (Python 2.1 and before) *did* special-case \d+\.\d+\.\d+\.\d+! This code was somehow lost when the IPv6 support was added. I propose to put it back in, at least for IPv4 (AF_INET). Patch anyone? --Guido van Rossum (home page: http://www.python.org/~guido/)

https://sourceforge.net/tracker/index.php?func=detail&aid=731209&group_id=5470&atid=305470 Unfortunately the code still goes through the idna encoding module - this is some overhead that it would be nice to avoid for all-numeric addresses. Anthony

Anthony Baxter wrote:
Ah. That could be the case - I think I'm loading the address from an XML file in the test case I used... will fix that.
If you mean "I'll fix the test case to not use XML anymore" - that might be reasonable. If you mean "I'll fix the test case to convert the Unicode arguments to byte strings before passing them to the socket module", I suggest that this should not be needed: the IDNA codec should complete quickly if the Unicode string is ASCII only (perhaps not as fast as converting the string to ASCII beforehand, but not significantly slower). Regards, Martin

Anthony Baxter <anthony@interlink.com.au>:
Seems to me the basic problem is that we're representing to completely different things -- a DNS name and a raw IP address -- the same way, i.e. as a string. A raw IP address should (at least optionally) be represented by something different, such as a tuple of ints. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Why? There's never any ambiguity about which kind is intended. --Guido van Rossum (home page: http://www.python.org/~guido/)

Sean Reifschneider wrote:
I disagree. Python should expose the resolver library, and leave caching to it; many such libraries do caching already, in some form. The issue is different: In some cases the application just *knows* that an address is numeric, and that DNS lookup will fail. In these cases, lookup should be avoided - whether by explicit request from the application or by Python implicitly just knowing is a different issue. It turns out that Python doesn't need to 100% detect numeric addresses, as long as it would not classify addresses as numeric which aren't. Perhaps it is even possible to leave the "is numeric" test to the implementation of getaddrinfo, i.e. calling it twice (try numeric first, then try resolving the name)? Regards, Martin

Martin> It turns out that Python doesn't need to 100% detect numeric Martin> addresses, as long as it would not classify addresses as numeric Martin> which aren't. Perhaps it is even possible to leave the "is Martin> numeric" test to the implementation of getaddrinfo, i.e. calling Martin> it twice (try numeric first, then try resolving the name)? Can a top-level domain be all digits? If not, why not assume numeric if re.search(r"\.\d+$", addr) is not None? Skip

Skip Montanaro wrote:
Can a top-level domain be all digits?
It appears nobody here can answer this question with certainty. If the answer is "no", it is surprising that getaddrinfo implementations still make resolver calls in this case even if they are sure that those resolver calls fail. One would hope that people writing socket libraries should no the answer. Regards, Martin

On Wed, Apr 09, 2003 at 01:44:51PM -0500, Skip Montanaro wrote:
Can a top-level domain be all digits? If not, why not assume numeric if re.search(r"\.\d+$", addr) is not None?
I don't think anyone sane would create a top-level that's digits, particularly in the range of 0 to 255. That probably means that somebody is going to do it... ;-/ I think checking for 2 to 4 dotted octets in the range of 0 to 255 would be safest... Yes, you can probably get away with using the regex above, but I wouldn't want to. Sean -- Sucking all the marrow out of life doesn't mean choking on the bone. -- _Dead_Poet's_Society_ Sean Reifschneider, Inimitably Superfluous <jafo@tummy.com> tummy.com, ltd. - Linux Consulting since 1995. Qmail, Python, SysAdmin

Indeed, Anthony brought the example of 911.com, which has been registered despite being illegal.
At least 911 is greater than 255, which unfortunately isn't the case for 123. But all these would be caught by requiring a full 4-number address before deciding it's numeric. (I don't think it's worth allowing for 0-padding if there are less than 4 numbers.) Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

[MvL]
I disagree. Python should expose the resolver library, and leave caching to it; many such libraries do caching already, in some form.
Right.
The issue is different: In some cases the application just *knows* that an address is numeric, and that DNS lookup will fail.
In fact, I've often written code that passes a numeric address, and I've always assumed that in that case the code would take a shortcut because there's nothing to look up (only to parse).
Perhaps, as long as we can safely ignore the first error. This would probably be a little slower, but probably not slow enoug to matter, and it sounds like a very general solution. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Wed, Apr 09, 2003 at 08:36:17PM +0200, "Martin v. L?wis" wrote:
I disagree. Python should expose the resolver library, and leave caching to it; many such libraries do caching already, in some form.
Why don't we carry it to the logical conclusion and say that the resolver should also avoid doing a forward lookup on an already numeric IP? I've noticed that before the Red Hat 8.0 release, doing a "telnet <ip>" would usually be very fast on the initial connection, and since 8.0 it's been slow as if doing a lookup... To me that indicates that the resolver used to do this and has been changed to not, which makes me wonder why that was... Perhaps we're being too clever and it's going to come back to bite us? The "<numeric>" syntax would allow us to leave resolution as it is and let the user override it when they deem necessary. If we try to auto-detect (which I'm usually all for), we should probably implement a "<forcedns>" or similar? Sean -- Geek English Rule #7: To reduce redundancy, the word "scary" can be left out of any statement containing the phrase "scary java applet". Sean Reifschneider, Inimitably Superfluous <jafo@tummy.com> tummy.com, ltd. - Linux Consulting since 1995. Qmail, Python, SysAdmin

I think it's the other way around. The resolver lost some perfectly good caching in the upgrade to support IPv6. The designers probably didn't notice the difference because in their own setup, DNS is fast. I expect the caching will come back eventually.
YAGNI. --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (14)
-
"Martin v. Löwis"
-
andrew cooke
-
Anthony Baxter
-
David Bolen
-
Gisle Aas
-
Greg Ewing
-
Guido van Rossum
-
Jack Jansen
-
Marcus Mendenhall
-
martin@v.loewis.de
-
Neil Schemenauer
-
Paul Svensson
-
Sean Reifschneider
-
Skip Montanaro