<span class="Apple-style-span" style="font-family: Times; font-size: 16px; "><div style="margin-top: 8px; margin-right: 8px; margin-bottom: 8px; margin-left: 8px; font: normal normal normal small/normal arial; ">I'm just getting ready to start the semester using my new book (Python Programming in Context) and noticed that I somehow missed all the changes to urllib in python 3.0. ARGH to say the least. I like using urllib in the intro class because we can get data from places that are more interesting/motivating/relevant to the students.<div>
<br><div>Here are some of my observations on trying to do very basic stuff with urllib:</div><div><br></div><div>1. urllib.urlopen is now urllib.request.urlopen</div><div>2. The object returned by urlopen is no longer iterable! no more for line in url.</div>
<div>3. read, readline, readlines now return bytes objects or arrays of bytes instead of a str and array of str</div><div>4. Taking the naive approach to converting a bytes object to a str does not work as you would expect.<br clear="all">
<br></div><div><div>>>> import urllib.request</div><div>>>> page = urllib.request.urlopen('<a href="http://knuth.luther.edu/test.html">http://knuth.luther.edu/test.html</a>')</div><div>>>> page</div>
<div><addinfourl at 16419792 whose fp = <socket.SocketIO object at 0xfa8570>></div><div>>>> line = page.readline()</div><div>>>> line</div><div>b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\n'</div>
<div>>>> str(line)</div><div>'b\'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\\n\''</div><div>>>> </div><div><br></div><div>As you can see from the example the 'b' becomes part of the string! It seems like this should be a bug, is it?</div>
<div><br></div><div><br></div><div>Here's the iteration problem:</div><div><div>'b\'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\\n\''</div><div>>>> for line in page:</div>
<div><span class="Apple-tab-span" style="white-space: pre; ">        </span>print(line)</div><div><br></div><div>Traceback (most recent call last):</div><div> File "<pyshell#10>", line 1, in <module></div>
<div> for line in page:</div><div>TypeError: 'addinfourl' object is not iterable</div><div><br></div><div>Why is this not iterable anymore? Is this too a bug? What the heck is an addinfourl object?</div><div>
<br></div><div><br></div><div>5. Finally, I see that a bytes object has some of the same methods as strings. But the error messages are confusing.</div><div><br></div><div><div>>>> line</div><div>b' "<a href="http://www.w3.org/TR/html4/loose.dtd">http://www.w3.org/TR/html4/loose.dtd</a>">\n'</div>
<div>>>> line.find('www')</div><div>Traceback (most recent call last):</div><div> File "<pyshell#18>", line 1, in <module></div><div> line.find('www')</div><div>TypeError: expected an object with the buffer interface</div>
<div>>>> line.find(b'www')</div><div>11</div><div><br></div><div>Why couldn't find take string as a parameter?</div><div><br></div><div>If folks have advice on which, if any, of these are bugs please let me know and I'll file them, and if possible work on fixes for them too.</div>
<div><br></div><div>If you have advice on how I should better be teaching this new urllib that would be great to hear as well.<br></div><div><br></div><div><br></div><div>Thanks,</div><div><br></div><div>Brad</div><div><br>
</div></div></div>-- <br>Brad Miller<br>Assistant Professor, Computer Science<br>Luther College</div></div></div></span><br>-- <br>Brad Miller<br>Assistant Professor, Computer Science<br>Luther College<br>