Update of /cvsroot/python/python/dist/src/Doc/howto In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv32499 Added Files: Makefile advocacy.tex curses.tex doanddont.tex regex.tex rexec.tex sockets.tex sorting.tex unicode.rst Log Message: Commit the howto source to the main Python repository, with Fred's approval --- NEW FILE: Makefile --- MKHOWTO=../tools/mkhowto WEBDIR=. RSTARGS = --input-encoding=utf-8 VPATH=.:dvi:pdf:ps:txt # List of HOWTOs that aren't to be processed REMOVE_HOWTO = # Determine list of files to be built HOWTO=$(filter-out $(REMOVE_HOWTO),$(wildcard *.tex)) RST_SOURCES = $(shell echo *.rst) DVI =$(patsubst %.tex,%.dvi,$(HOWTO)) PDF =$(patsubst %.tex,%.pdf,$(HOWTO)) PS =$(patsubst %.tex,%.ps,$(HOWTO)) TXT =$(patsubst %.tex,%.txt,$(HOWTO)) HTML =$(patsubst %.tex,%,$(HOWTO)) # Rules for building various formats %.dvi : %.tex $(MKHOWTO) --dvi $< mv $@ dvi %.pdf : %.tex $(MKHOWTO) --pdf $< mv $@ pdf %.ps : %.tex $(MKHOWTO) --ps $< mv $@ ps %.txt : %.tex $(MKHOWTO) --text $< mv $@ txt % : %.tex $(MKHOWTO) --html --iconserver="." $< tar -zcvf html/$*.tgz $* #zip -r html/$*.zip $* default: @echo "'all' -- build all files" @echo "'dvi', 'pdf', 'ps', 'txt', 'html' -- build one format" all: $(HTML) .PHONY : dvi pdf ps txt html rst dvi: $(DVI) pdf: $(PDF) ps: $(PS) txt: $(TXT) html:$(HTML) # Rule to build collected tar files dist: #all for i in dvi pdf ps txt ; do \ cd $$i ; \ tar -zcf All.tgz *.$$i ;\ cd .. ;\ done # Rule to copy files to the Web tree on AMK's machine web: dist cp dvi/* $(WEBDIR)/dvi cp ps/* $(WEBDIR)/ps cp pdf/* $(WEBDIR)/pdf cp txt/* $(WEBDIR)/txt for dir in $(HTML) ; do cp -rp $$dir $(WEBDIR) ; done for ltx in $(HOWTO) ; do cp -p $$ltx $(WEBDIR)/latex ; done rst: unicode.html %.html: %.rst rst2html $(RSTARGS) $< >$@ clean: rm -f *~ *.log *.ind *.l2h *.aux *.toc *.how rm -f *.dvi *.ps *.pdf *.bkm rm -f unicode.html clobber: rm dvi/* ps/* pdf/* txt/* html/* --- NEW FILE: advocacy.tex --- \documentclass{howto} \title{Python Advocacy HOWTO} \release{0.03} \author{A.M. Kuchling} \authoraddress{\email{amk@amk.ca}} \begin{document} \maketitle \begin{abstract} \noindent It's usually difficult to get your management to accept open source software, and Python is no exception to this rule. This document discusses reasons to use Python, strategies for winning acceptance, facts and arguments you can use, and cases where you \emph{shouldn't} try to use Python. This document is available from the Python HOWTO page at \url{http://www.python.org/doc/howto}. \end{abstract} \tableofcontents \section{Reasons to Use Python} There are several reasons to incorporate a scripting language into your development process, and this section will discuss them, and why Python has some properties that make it a particularly good choice. \subsection{Programmability} Programs are often organized in a modular fashion. Lower-level operations are grouped together, and called by higher-level functions, which may in turn be used as basic operations by still further upper levels. For example, the lowest level might define a very low-level set of functions for accessing a hash table. The next level might use hash tables to store the headers of a mail message, mapping a header name like \samp{Date} to a value such as \samp{Tue, 13 May 1997 20:00:54 -0400}. A yet higher level may operate on message objects, without knowing or caring that message headers are stored in a hash table, and so forth. Often, the lowest levels do very simple things; they implement a data structure such as a binary tree or hash table, or they perform some simple computation, such as converting a date string to a number. The higher levels then contain logic connecting these primitive operations. Using the approach, the primitives can be seen as basic building blocks which are then glued together to produce the complete product. Why is this design approach relevant to Python? Because Python is well suited to functioning as such a glue language. A common approach is to write a Python module that implements the lower level operations; for the sake of speed, the implementation might be in C, Java, or even Fortran. Once the primitives are available to Python programs, the logic underlying higher level operations is written in the form of Python code. The high-level logic is then more understandable, and easier to modify. John Ousterhout wrote a paper that explains this idea at greater length, entitled ``Scripting: Higher Level Programming for the 21st Century''. I recommend that you read this paper; see the references for the URL. Ousterhout is the inventor of the Tcl language, and therefore argues that Tcl should be used for this purpose; he only briefly refers to other languages such as Python, Perl, and Lisp/Scheme, but in reality, Ousterhout's argument applies to scripting languages in general, since you could equally write extensions for any of the languages mentioned above. \subsection{Prototyping} In \emph{The Mythical Man-Month}, Fredrick Brooks suggests the following rule when planning software projects: ``Plan to throw one away; you will anyway.'' Brooks is saying that the first attempt at a software design often turns out to be wrong; unless the problem is very simple or you're an extremely good designer, you'll find that new requirements and features become apparent once development has actually started. If these new requirements can't be cleanly incorporated into the program's structure, you're presented with two unpleasant choices: hammer the new features into the program somehow, or scrap everything and write a new version of the program, taking the new features into account from the beginning. Python provides you with a good environment for quickly developing an initial prototype. That lets you get the overall program structure and logic right, and you can fine-tune small details in the fast development cycle that Python provides. Once you're satisfied with the GUI interface or program output, you can translate the Python code into C++, Fortran, Java, or some other compiled language. Prototyping means you have to be careful not to use too many Python features that are hard to implement in your other language. Using \code{eval()}, or regular expressions, or the \module{pickle} module, means that you're going to need C or Java libraries for formula evaluation, regular expressions, and serialization, for example. But it's not hard to avoid such tricky code, and in the end the translation usually isn't very difficult. The resulting code can be rapidly debugged, because any serious logical errors will have been removed from the prototype, leaving only more minor slip-ups in the translation to track down. This strategy builds on the earlier discussion of programmability. Using Python as glue to connect lower-level components has obvious relevance for constructing prototype systems. In this way Python can help you with development, even if end users never come in contact with Python code at all. If the performance of the Python version is adequate and corporate politics allow it, you may not need to do a translation into C or Java, but it can still be faster to develop a prototype and then translate it, instead of attempting to produce the final version immediately. One example of this development strategy is Microsoft Merchant Server. Version 1.0 was written in pure Python, by a company that subsequently was purchased by Microsoft. Version 2.0 began to translate the code into \Cpp, shipping with some \Cpp code and some Python code. Version 3.0 didn't contain any Python at all; all the code had been translated into \Cpp. Even though the product doesn't contain a Python interpreter, the Python language has still served a useful purpose by speeding up development. This is a very common use for Python. Past conference papers have also described this approach for developing high-level numerical algorithms; see David M. Beazley and Peter S. Lomdahl's paper ``Feeding a Large-scale Physics Application to Python'' in the references for a good example. If an algorithm's basic operations are things like "Take the inverse of this 4000x4000 matrix", and are implemented in some lower-level language, then Python has almost no additional performance cost; the extra time required for Python to evaluate an expression like \code{m.invert()} is dwarfed by the cost of the actual computation. It's particularly good for applications where seemingly endless tweaking is required to get things right. GUI interfaces and Web sites are prime examples. The Python code is also shorter and faster to write (once you're familiar with Python), so it's easier to throw it away if you decide your approach was wrong; if you'd spent two weeks working on it instead of just two hours, you might waste time trying to patch up what you've got out of a natural reluctance to admit that those two weeks were wasted. Truthfully, those two weeks haven't been wasted, since you've learnt something about the problem and the technology you're using to solve it, but it's human nature to view this as a failure of some sort. \subsection{Simplicity and Ease of Understanding} Python is definitely \emph{not} a toy language that's only usable for small tasks. The language features are general and powerful enough to enable it to be used for many different purposes. It's useful at the small end, for 10- or 20-line scripts, but it also scales up to larger systems that contain thousands of lines of code. However, this expressiveness doesn't come at the cost of an obscure or tricky syntax. While Python has some dark corners that can lead to obscure code, there are relatively few such corners, and proper design can isolate their use to only a few classes or modules. It's certainly possible to write confusing code by using too many features with too little concern for clarity, but most Python code can look a lot like a slightly-formalized version of human-understandable pseudocode. In \emph{The New Hacker's Dictionary}, Eric S. Raymond gives the following definition for "compact": \begin{quotation} Compact \emph{adj.} Of a design, describes the valuable property that it can all be apprehended at once in one's head. This generally means the thing created from the design can be used with greater facility and fewer errors than an equivalent tool that is not compact. Compactness does not imply triviality or lack of power; for example, C is compact and FORTRAN is not, but C is more powerful than FORTRAN. Designs become non-compact through accreting features and cruft that don't merge cleanly into the overall design scheme (thus, some fans of Classic C maintain that ANSI C is no longer compact). \end{quotation} (From \url{http://sagan.earthspace.net/jargon/jargon_18.html\#SEC25}) In this sense of the word, Python is quite compact, because the language has just a few ideas, which are used in lots of places. Take namespaces, for example. Import a module with \code{import math}, and you create a new namespace called \samp{math}. Classes are also namespaces that share many of the properties of modules, and have a few of their own; for example, you can create instances of a class. Instances? They're yet another namespace. Namespaces are currently implemented as Python dictionaries, so they have the same methods as the standard dictionary data type: .keys() returns all the keys, and so forth. This simplicity arises from Python's development history. The language syntax derives from different sources; ABC, a relatively obscure teaching language, is one primary influence, and Modula-3 is another. (For more information about ABC and Modula-3, consult their respective Web sites at \url{http://www.cwi.nl/~steven/abc/} and \url{http://www.m3.org}.) Other features have come from C, Icon, Algol-68, and even Perl. Python hasn't really innovated very much, but instead has tried to keep the language small and easy to learn, building on ideas that have been tried in other languages and found useful. Simplicity is a virtue that should not be underestimated. It lets you learn the language more quickly, and then rapidly write code, code that often works the first time you run it. \subsection{Java Integration} If you're working with Java, Jython (\url{http://www.jython.org/}) is definitely worth your attention. Jython is a re-implementation of Python in Java that compiles Python code into Java bytecodes. The resulting environment has very tight, almost seamless, integration with Java. It's trivial to access Java classes from Python, and you can write Python classes that subclass Java classes. Jython can be used for prototyping Java applications in much the same way CPython is used, and it can also be used for test suites for Java code, or embedded in a Java application to add scripting capabilities. \section{Arguments and Rebuttals} Let's say that you've decided upon Python as the best choice for your application. How can you convince your management, or your fellow developers, to use Python? This section lists some common arguments against using Python, and provides some possible rebuttals. \emph{Python is freely available software that doesn't cost anything. How good can it be?} Very good, indeed. These days Linux and Apache, two other pieces of open source software, are becoming more respected as alternatives to commercial software, but Python hasn't had all the publicity. Python has been around for several years, with many users and developers. Accordingly, the interpreter has been used by many people, and has gotten most of the bugs shaken out of it. While bugs are still discovered at intervals, they're usually either quite obscure (they'd have to be, for no one to have run into them before) or they involve interfaces to external libraries. The internals of the language itself are quite stable. Having the source code should be viewed as making the software available for peer review; people can examine the code, suggest (and implement) improvements, and track down bugs. To find out more about the idea of open source code, along with arguments and case studies supporting it, go to \url{http://www.opensource.org}. \emph{Who's going to support it?} Python has a sizable community of developers, and the number is still growing. The Internet community surrounding the language is an active one, and is worth being considered another one of Python's advantages. Most questions posted to the comp.lang.python newsgroup are quickly answered by someone. Should you need to dig into the source code, you'll find it's clear and well-organized, so it's not very difficult to write extensions and track down bugs yourself. If you'd prefer to pay for support, there are companies and individuals who offer commercial support for Python. \emph{Who uses Python for serious work?} Lots of people; one interesting thing about Python is the surprising diversity of applications that it's been used for. People are using Python to: \begin{itemize} \item Run Web sites \item Write GUI interfaces \item Control number-crunching code on supercomputers \item Make a commercial application scriptable by embedding the Python interpreter inside it \item Process large XML data sets \item Build test suites for C or Java code \end{itemize} Whatever your application domain is, there's probably someone who's used Python for something similar. Yet, despite being useable for such high-end applications, Python's still simple enough to use for little jobs. See \url{http://www.python.org/psa/Users.html} for a list of some of the organizations that use Python. \emph{What are the restrictions on Python's use?} They're practically nonexistent. Consult the \file{Misc/COPYRIGHT} file in the source distribution, or \url{http://www.python.org/doc/Copyright.html} for the full language, but it boils down to three conditions. \begin{itemize} \item You have to leave the copyright notice on the software; if you don't include the source code in a product, you have to put the copyright notice in the supporting documentation. \item Don't claim that the institutions that have developed Python endorse your product in any way. \item If something goes wrong, you can't sue for damages. Practically all software licences contain this condition. \end{itemize} Notice that you don't have to provide source code for anything that contains Python or is built with it. Also, the Python interpreter and accompanying documentation can be modified and redistributed in any way you like, and you don't have to pay anyone any licensing fees at all. \emph{Why should we use an obscure language like Python instead of well-known language X?} I hope this HOWTO, and the documents listed in the final section, will help convince you that Python isn't obscure, and has a healthily growing user base. One word of advice: always present Python's positive advantages, instead of concentrating on language X's failings. People want to know why a solution is good, rather than why all the other solutions are bad. So instead of attacking a competing solution on various grounds, simply show how Python's virtues can help. \section{Useful Resources} \begin{definitions} \term{\url{http://www.fsbassociates.com/books/pythonchpt1.htm}} The first chapter of \emph{Internet Programming with Python} also examines some of the reasons for using Python. The book is well worth buying, but the publishers have made the first chapter available on the Web. \term{\url{http://home.pacbell.net/ouster/scripting.html}} John Ousterhout's white paper on scripting is a good argument for the utility of scripting languages, though naturally enough, he emphasizes Tcl, the language he developed. Most of the arguments would apply to any scripting language. \term{\url{http://www.python.org/workshops/1997-10/proceedings/beazley.html}} The authors, David M. Beazley and Peter S. Lomdahl, describe their use of Python at Los Alamos National Laboratory. It's another good example of how Python can help get real work done. This quotation from the paper has been echoed by many people: \begin{quotation} Originally developed as a large monolithic application for massively parallel processing systems, we have used Python to transform our application into a flexible, highly modular, and extremely powerful system for performing simulation, data analysis, and visualization. In addition, we describe how Python has solved a number of important problems related to the development, debugging, deployment, and maintenance of scientific software. \end{quotation} %\term{\url{http://www.pythonjournal.com/volume1/art-interview/}} %This interview with Andy Feit, discussing Infoseek's use of Python, can be %used to show that choosing Python didn't introduce any difficulties %into a company's development process, and provided some substantial benefits. \term{\url{http://www.python.org/psa/Commercial.html}} Robin Friedrich wrote this document on how to support Python's use in commercial projects. \term{\url{http://www.python.org/workshops/1997-10/proceedings/stein.ps}} For the 6th Python conference, Greg Stein presented a paper that traced Python's adoption and usage at a startup called eShop, and later at Microsoft. \term{\url{http://www.opensource.org}} Management may be doubtful of the reliability and usefulness of software that wasn't written commercially. This site presents arguments that show how open source software can have considerable advantages over closed-source software. \term{\url{http://sunsite.unc.edu/LDP/HOWTO/mini/Advocacy.html}} The Linux Advocacy mini-HOWTO was the inspiration for this document, and is also well worth reading for general suggestions on winning acceptance for a new technology, such as Linux or Python. In general, you won't make much progress by simply attacking existing systems and complaining about their inadequacies; this often ends up looking like unfocused whining. It's much better to point out some of the many areas where Python is an improvement over other systems. \end{definitions} \end{document} --- NEW FILE: curses.tex --- \documentclass{howto} \title{Curses Programming with Python} \release{2.01} \author{A.M. Kuchling, Eric S. Raymond} \authoraddress{\email{amk@amk.ca}, \email{esr@thyrsus.com}} \begin{document} \maketitle \begin{abstract} \noindent This document describes how to write text-mode programs with Python 2.x, using the \module{curses} extension module to control the display. This document is available from the Python HOWTO page at \url{http://www.python.org/doc/howto}. \end{abstract} \tableofcontents \section{What is curses?} The curses library supplies a terminal-independent screen-painting and keyboard-handling facility for text-based terminals; such terminals include VT100s, the Linux console, and the simulated terminal provided by X11 programs such as xterm and rxvt. Display terminals support various control codes to perform common operations such as moving the cursor, scrolling the screen, and erasing areas. Different terminals use widely differing codes, and often have their own minor quirks. In a world of X displays, one might ask ``why bother''? It's true that character-cell display terminals are an obsolete technology, but there are niches in which being able to do fancy things with them are still valuable. One is on small-footprint or embedded Unixes that don't carry an X server. Another is for tools like OS installers and kernel configurators that may have to run before X is available. The curses library hides all the details of different terminals, and provides the programmer with an abstraction of a display, containing multiple non-overlapping windows. The contents of a window can be changed in various ways--adding text, erasing it, changing its appearance--and the curses library will automagically figure out what control codes need to be sent to the terminal to produce the right output. The curses library was originally written for BSD Unix; the later System V versions of Unix from AT\&T added many enhancements and new functions. BSD curses is no longer maintained, having been replaced by ncurses, which is an open-source implementation of the AT\&T interface. If you're using an open-source Unix such as Linux or FreeBSD, your system almost certainly uses ncurses. Since most current commercial Unix versions are based on System V code, all the functions described here will probably be available. The older versions of curses carried by some proprietary Unixes may not support everything, though. No one has made a Windows port of the curses module. On a Windows platform, try the Console module written by Fredrik Lundh. The Console module provides cursor-addressable text output, plus full support for mouse and keyboard input, and is available from \url{http://effbot.org/efflib/console}. \subsection{The Python curses module} Thy Python module is a fairly simple wrapper over the C functions provided by curses; if you're already familiar with curses programming in C, it's really easy to transfer that knowledge to Python. The biggest difference is that the Python interface makes things simpler, by merging different C functions such as \function{addstr}, \function{mvaddstr}, \function{mvwaddstr}, into a single \method{addstr()} method. You'll see this covered in more detail later. This HOWTO is simply an introduction to writing text-mode programs with curses and Python. It doesn't attempt to be a complete guide to the curses API; for that, see the Python library guide's serction on ncurses, and the C manual pages for ncurses. It will, however, give you the basic ideas. \section{Starting and ending a curses application} Before doing anything, curses must be initialized. This is done by calling the \function{initscr()} function, which will determine the terminal type, send any required setup codes to the terminal, and create various internal data structures. If successful, \function{initscr()} returns a window object representing the entire screen; this is usually called \code{stdscr}, after the name of the corresponding C variable. \begin{verbatim} import curses stdscr = curses.initscr() \end{verbatim} Usually curses applications turn off automatic echoing of keys to the screen, in order to be able to read keys and only display them under certain circumstances. This requires calling the \function{noecho()} function. \begin{verbatim} curses.noecho() \end{verbatim} Applications will also commonly need to react to keys instantly, without requiring the Enter key to be pressed; this is called cbreak mode, as opposed to the usual buffered input mode. \begin{verbatim} curses.cbreak() \end{verbatim} Terminals usually return special keys, such as the cursor keys or navigation keys such as Page Up and Home, as a multibyte escape sequence. While you could write your application to expect such sequences and process them accordingly, curses can do it for you, returning a special value such as \constant{curses.KEY_LEFT}. To get curses to do the job, you'll have to enable keypad mode. \begin{verbatim} stdscr.keypad(1) \end{verbatim} Terminating a curses application is much easier than starting one. You'll need to call \begin{verbatim} curses.nocbreak(); stdscr.keypad(0); curses.echo() \end{verbatim} to reverse the curses-friendly terminal settings. Then call the \function{endwin()} function to restore the terminal to its original operating mode. \begin{verbatim} curses.endwin() \end{verbatim} A common problem when debugging a curses application is to get your terminal messed up when the application dies without restoring the terminal to its previous state. In Python this commonly happens when your code is buggy and raises an uncaught exception. Keys are no longer be echoed to the screen when you type them, for example, which makes using the shell difficult. In Python you can avoid these complications and make debugging much easier by importing the module \module{curses.wrapper}. It supplies a function \function{wrapper} that takes a hook argument. It does the initializations described above, and also initializes colors if color support is present. It then runs your hook, and then finally deinitializes appropriately. The hook is called inside a try-catch clause which catches exceptions, performs curses deinitialization, and then passes the exception upwards. Thus, your terminal won't be left in a funny state on exception. \section{Windows and Pads} Windows are the basic abstraction in curses. A window object represents a rectangular area of the screen, and supports various methods to display text, erase it, allow the user to input strings, and so forth. The \code{stdscr} object returned by the \function{initscr()} function is a window object that covers the entire screen. Many programs may need only this single window, but you might wish to divide the screen into smaller windows, in order to redraw or clear them separately. The \function{newwin()} function creates a new window of a given size, returning the new window object. \begin{verbatim} begin_x = 20 ; begin_y = 7 height = 5 ; width = 40 win = curses.newwin(height, width, begin_y, begin_x) \end{verbatim} A word about the coordinate system used in curses: coordinates are always passed in the order \emph{y,x}, and the top-left corner of a window is coordinate (0,0). This breaks a common convention for handling coordinates, where the \emph{x} coordinate usually comes first. This is an unfortunate difference from most other computer applications, but it's been part of curses since it was first written, and it's too late to change things now. When you call a method to display or erase text, the effect doesn't immediately show up on the display. This is because curses was originally written with slow 300-baud terminal connections in mind; with these terminals, minimizing the time required to redraw the screen is very important. This lets curses accumulate changes to the screen, and display them in the most efficient manner. For example, if your program displays some characters in a window, and then clears the window, there's no need to send the original characters because they'd never be visible. Accordingly, curses requires that you explicitly tell it to redraw windows, using the \function{refresh()} method of window objects. In practice, this doesn't really complicate programming with curses much. Most programs go into a flurry of activity, and then pause waiting for a keypress or some other action on the part of the user. All you have to do is to be sure that the screen has been redrawn before pausing to wait for user input, by simply calling \code{stdscr.refresh()} or the \function{refresh()} method of some other relevant window. A pad is a special case of a window; it can be larger than the actual display screen, and only a portion of it displayed at a time. Creating a pad simply requires the pad's height and width, while refreshing a pad requires giving the coordinates of the on-screen area where a subsection of the pad will be displayed. \begin{verbatim} pad = curses.newpad(100, 100) # These loops fill the pad with letters; this is # explained in the next section for y in range(0, 100): for x in range(0, 100): try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 ) except curses.error: pass # Displays a section of the pad in the middle of the screen pad.refresh( 0,0, 5,5, 20,75) \end{verbatim} The \function{refresh()} call displays a section of the pad in the rectangle extending from coordinate (5,5) to coordinate (20,75) on the screen;the upper left corner of the displayed section is coordinate (0,0) on the pad. Beyond that difference, pads are exactly like ordinary windows and support the same methods. If you have multiple windows and pads on screen there is a more efficient way to go, which will prevent annoying screen flicker at refresh time. Use the methods \method{noutrefresh()} and/or \method{noutrefresh()} of each window to update the data structure representing the desired state of the screen; then change the physical screen to match the desired state in one go with the function \function{doupdate()}. The normal \method{refresh()} method calls \function{doupdate()} as its last act. \section{Displaying Text} {}From a C programmer's point of view, curses may sometimes look like a twisty maze of functions, all subtly different. For example, \function{addstr()} displays a string at the current cursor location in the \code{stdscr} window, while \function{mvaddstr()} moves to a given y,x coordinate first before displaying the string. \function{waddstr()} is just like \function{addstr()}, but allows specifying a window to use, instead of using \code{stdscr} by default. \function{mvwaddstr()} follows similarly. Fortunately the Python interface hides all these details; \code{stdscr} is a window object like any other, and methods like \function{addstr()} accept multiple argument forms. Usually there are four different forms. \begin{tableii}{|c|l|}{textrm}{Form}{Description} \lineii{\var{str} or \var{ch}}{Display the string \var{str} or character \var{ch}} \lineii{\var{str} or \var{ch}, \var{attr}}{Display the string \var{str} or character \var{ch}, using attribute \var{attr}} \lineii{\var{y}, \var{x}, \var{str} or \var{ch}} {Move to position \var{y,x} within the window, and display \var{str} or \var{ch}} \lineii{\var{y}, \var{x}, \var{str} or \var{ch}, \var{attr}} {Move to position \var{y,x} within the window, and display \var{str} or \var{ch}, using attribute \var{attr}} \end{tableii} Attributes allow displaying text in highlighted forms, such as in boldface, underline, reverse code, or in color. They'll be explained in more detail in the next subsection. The \function{addstr()} function takes a Python string as the value to be displayed, while the \function{addch()} functions take a character, which can be either a Python string of length 1, or an integer. If it's a string, you're limited to displaying characters between 0 and 255. SVr4 curses provides constants for extension characters; these constants are integers greater than 255. For example, \constant{ACS_PLMINUS} is a +/- symbol, and \constant{ACS_ULCORNER} is the upper left corner of a box (handy for drawing borders). Windows remember where the cursor was left after the last operation, so if you leave out the \var{y,x} coordinates, the string or character will be displayed wherever the last operation left off. You can also move the cursor with the \function{move(\var{y,x})} method. Because some terminals always display a flashing cursor, you may want to ensure that the cursor is positioned in some location where it won't be distracting; it can be confusing to have the cursor blinking at some apparently random location. If your application doesn't need a blinking cursor at all, you can call \function{curs_set(0)} to make it invisible. Equivalently, and for compatibility with older curses versions, there's a \function{leaveok(\var{bool})} function. When \var{bool} is true, the curses library will attempt to suppress the flashing cursor, and you won't need to worry about leaving it in odd locations. \subsection{Attributes and Color} Characters can be displayed in different ways. Status lines in a text-based application are commonly shown in reverse video; a text viewer may need to highlight certain words. curses supports this by allowing you to specify an attribute for each cell on the screen. An attribute is a integer, each bit representing a different attribute. You can try to display text with multiple attribute bits set, but curses doesn't guarantee that all the possible combinations are available, or that they're all visually distinct. That depends on the ability of the terminal being used, so it's safest to stick to the most commonly available attributes, listed here. \begin{tableii}{|c|l|}{constant}{Attribute}{Description} \lineii{A_BLINK}{Blinking text} \lineii{A_BOLD}{Extra bright or bold text} \lineii{A_DIM}{Half bright text} \lineii{A_REVERSE}{Reverse-video text} \lineii{A_STANDOUT}{The best highlighting mode available} \lineii{A_UNDERLINE}{Underlined text} \end{tableii} So, to display a reverse-video status line on the top line of the screen, you could code: \begin{verbatim} stdscr.addstr(0, 0, "Current mode: Typing mode", curses.A_REVERSE) stdscr.refresh() \end{verbatim} The curses library also supports color on those terminals that provide it, The most common such terminal is probably the Linux console, followed by color xterms. To use color, you must call the \function{start_color()} function soon after calling \function{initscr()}, to initialize the default color set (the \function{curses.wrapper.wrapper()} function does this automatically). Once that's done, the \function{has_colors()} function returns TRUE if the terminal in use can actually display color. (Note from AMK: curses uses the American spelling 'color', instead of the Canadian/British spelling 'colour'. If you're like me, you'll have to resign yourself to misspelling it for the sake of these functions.) The curses library maintains a finite number of color pairs, containing a foreground (or text) color and a background color. You can get the attribute value corresponding to a color pair with the \function{color_pair()} function; this can be bitwise-OR'ed with other attributes such as \constant{A_REVERSE}, but again, such combinations are not guaranteed to work on all terminals. An example, which displays a line of text using color pair 1: \begin{verbatim} stdscr.addstr( "Pretty text", curses.color_pair(1) ) stdscr.refresh() \end{verbatim} As I said before, a color pair consists of a foreground and background color. \function{start_color()} initializes 8 basic colors when it activates color mode. They are: 0:black, 1:red, 2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and 7:white. The curses module defines named constants for each of these colors: \constant{curses.COLOR_BLACK}, \constant{curses.COLOR_RED}, and so forth. The \function{init_pair(\var{n, f, b})} function changes the definition of color pair \var{n}, to foreground color {f} and background color {b}. Color pair 0 is hard-wired to white on black, and cannot be changed. Let's put all this together. To change color 1 to red text on a white background, you would call: \begin{verbatim} curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE) \end{verbatim} When you change a color pair, any text already displayed using that color pair will change to the new colors. You can also display new text in this color with: \begin{verbatim} stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) ) \end{verbatim} Very fancy terminals can change the definitions of the actual colors to a given RGB value. This lets you change color 1, which is usually red, to purple or blue or any other color you like. Unfortunately, the Linux console doesn't support this, so I'm unable to try it out, and can't provide any examples. You can check if your terminal can do this by calling \function{can_change_color()}, which returns TRUE if the capability is there. If you're lucky enough to have such a talented terminal, consult your system's man pages for more information. \section{User Input} The curses library itself offers only very simple input mechanisms. Python's support adds a text-input widget that makes up some of the lack. The most common way to get input to a window is to use its \method{getch()} method. that pauses, and waits for the user to hit a key, displaying it if \function{echo()} has been called earlier. You can optionally specify a coordinate to which the cursor should be moved before pausing. It's possible to change this behavior with the method \method{nodelay()}. After \method{nodelay(1)}, \method{getch()} for the window becomes non-blocking and returns ERR (-1) when no input is ready. There's also a \function{halfdelay()} function, which can be used to (in effect) set a timer on each \method{getch()}; if no input becomes available within the number of milliseconds specified as the argument to \function{halfdelay()}, curses throws an exception. The \method{getch()} method returns an integer; if it's between 0 and 255, it represents the ASCII code of the key pressed. Values greater than 255 are special keys such as Page Up, Home, or the cursor keys. You can compare the value returned to constants such as \constant{curses.KEY_PPAGE}, \constant{curses.KEY_HOME}, or \constant{curses.KEY_LEFT}. Usually the main loop of your program will look something like this: \begin{verbatim} while 1: c = stdscr.getch() if c == ord('p'): PrintDocument() elif c == ord('q'): break # Exit the while() elif c == curses.KEY_HOME: x = y = 0 \end{verbatim} The \module{curses.ascii} module supplies ASCII class membership functions that take either integer or 1-character-string arguments; these may be useful in writing more readable tests for your command interpreters. It also supplies conversion functions that take either integer or 1-character-string arguments and return the same type. For example, \function{curses.ascii.ctrl()} returns the control character corresponding to its argument. There's also a method to retrieve an entire string, \constant{getstr()}. It isn't used very often, because its functionality is quite limited; the only editing keys available are the backspace key and the Enter key, which terminates the string. It can optionally be limited to a fixed number of characters. \begin{verbatim} curses.echo() # Enable echoing of characters # Get a 15-character string, with the cursor on the top line s = stdscr.getstr(0,0, 15) \end{verbatim} The Python \module{curses.textpad} module supplies something better. With it, you can turn a window into a text box that supports an Emacs-like set of keybindings. Various methods of \class{Textbox} class support editing with input validation and gathering the edit results either with or without trailing spaces. See the library documentation on \module{curses.textpad} for the details. \section{For More Information} This HOWTO didn't cover some advanced topics, such as screen-scraping or capturing mouse events from an xterm instance. But the Python library page for the curses modules is now pretty complete. You should browse it next. If you're in doubt about the detailed behavior of any of the ncurses entry points, consult the manual pages for your curses implementation, whether it's ncurses or a proprietary Unix vendor's. The manual pages will document any quirks, and provide complete lists of all the functions, attributes, and \constant{ACS_*} characters available to you. Because the curses API is so large, some functions aren't supported in the Python interface, not because they're difficult to implement, but because no one has needed them yet. Feel free to add them and then submit a patch. Also, we don't yet have support for the menus or panels libraries associated with ncurses; feel free to add that. If you write an interesting little program, feel free to contribute it as another demo. We can always use more of them! The ncurses FAQ: \url{http://dickey.his.com/ncurses/ncurses.faq.html} \end{document} --- NEW FILE: doanddont.tex --- \documentclass{howto} \title{Idioms and Anti-Idioms in Python} \release{0.00} \author{Moshe Zadka} \authoraddress{howto@zadka.site.co.il} \begin{document} \maketitle This document is placed in the public doman. \begin{abstract} \noindent This document can be considered a companion to the tutorial. It shows how to use Python, and even more importantly, how {\em not} to use Python. \end{abstract} \tableofcontents \section{Language Constructs You Should Not Use} While Python has relatively few gotchas compared to other languages, it still has some constructs which are only useful in corner cases, or are plain dangerous. \subsection{from module import *} \subsubsection{Inside Function Definitions} \code{from module import *} is {\em invalid} inside function definitions. While many versions of Python do no check for the invalidity, it does not make it more valid, no more then having a smart lawyer makes a man innocent. Do not use it like that ever. Even in versions where it was accepted, it made the function execution slower, because the compiler could not be certain which names are local and which are global. In Python 2.1 this construct causes warnings, and sometimes even errors. \subsubsection{At Module Level} While it is valid to use \code{from module import *} at module level it is usually a bad idea. For one, this loses an important property Python otherwise has --- you can know where each toplevel name is defined by a simple "search" function in your favourite editor. You also open yourself to trouble in the future, if some module grows additional functions or classes. One of the most awful question asked on the newsgroup is why this code: \begin{verbatim} f = open("www") f.read() \end{verbatim} does not work. Of course, it works just fine (assuming you have a file called "www".) But it does not work if somewhere in the module, the statement \code{from os import *} is present. The \module{os} module has a function called \function{open()} which returns an integer. While it is very useful, shadowing builtins is one of its least useful properties. Remember, you can never know for sure what names a module exports, so either take what you need --- \code{from module import name1, name2}, or keep them in the module and access on a per-need basis --- \code{import module;print module.name}. \subsubsection{When It Is Just Fine} There are situations in which \code{from module import *} is just fine: \begin{itemize} \item The interactive prompt. For example, \code{from math import *} makes Python an amazing scientific calculator. \item When extending a module in C with a module in Python. \item When the module advertises itself as \code{from import *} safe. \end{itemize} \subsection{Unadorned \keyword{exec}, \function{execfile} and friends} The word ``unadorned'' refers to the use without an explicit dictionary, in which case those constructs evaluate code in the {\em current} environment. This is dangerous for the same reasons \code{from import *} is dangerous --- it might step over variables you are counting on and mess up things for the rest of your code. Simply do not do that. Bad examples: \begin{verbatim}
for name in sys.argv[1:]: exec "%s=1" % name def func(s, **kw): for var, val in kw.items(): exec "s.%s=val" % var # invalid! execfile("handler.py") handle() \end{verbatim}
Good examples: \begin{verbatim}
d = {} for name in sys.argv[1:]: d[name] = 1 def func(s, **kw): for var, val in kw.items(): setattr(s, var, val) d={} execfile("handle.py", d, d) handle = d['handle'] handle() \end{verbatim}
\subsection{from module import name1, name2} This is a ``don't'' which is much weaker then the previous ``don't''s but is still something you should not do if you don't have good reasons to do that. The reason it is usually bad idea is because you suddenly have an object which lives in two seperate namespaces. When the binding in one namespace changes, the binding in the other will not, so there will be a discrepancy between them. This happens when, for example, one module is reloaded, or changes the definition of a function at runtime. Bad example: \begin{verbatim} # foo.py a = 1 # bar.py from foo import a if something(): a = 2 # danger: foo.a != a \end{verbatim} Good example: \begin{verbatim} # foo.py a = 1 # bar.py import foo if something(): foo.a = 2 \end{verbatim} \subsection{except:} Python has the \code{except:} clause, which catches all exceptions. Since {\em every} error in Python raises an exception, this makes many programming errors look like runtime problems, and hinders the debugging process. The following code shows a great example: \begin{verbatim} try: foo = opne("file") # misspelled "open" except: sys.exit("could not open file!") \end{verbatim} The second line triggers a \exception{NameError} which is caught by the except clause. The program will exit, and you will have no idea that this has nothing to do with the readability of \code{"file"}. The example above is better written \begin{verbatim} try: foo = opne("file") # will be changed to "open" as soon as we run it except IOError: sys.exit("could not open file") \end{verbatim} There are some situations in which the \code{except:} clause is useful: for example, in a framework when running callbacks, it is good not to let any callback disturb the framework. \section{Exceptions} Exceptions are a useful feature of Python. You should learn to raise them whenever something unexpected occurs, and catch them only where you can do something about them. The following is a very popular anti-idiom \begin{verbatim} def get_status(file): if not os.path.exists(file): print "file not found" sys.exit(1) return open(file).readline() \end{verbatim} Consider the case the file gets deleted between the time the call to \function{os.path.exists} is made and the time \function{open} is called. That means the last line will throw an \exception{IOError}. The same would happen if \var{file} exists but has no read permission. Since testing this on a normal machine on existing and non-existing files make it seem bugless, that means in testing the results will seem fine, and the code will get shipped. Then an unhandled \exception{IOError} escapes to the user, who has to watch the ugly traceback. Here is a better way to do it. \begin{verbatim} def get_status(file): try: return open(file).readline() except (IOError, OSError): print "file not found" sys.exit(1) \end{verbatim} In this version, *either* the file gets opened and the line is read (so it works even on flaky NFS or SMB connections), or the message is printed and the application aborted. Still, \function{get_status} makes too many assumptions --- that it will only be used in a short running script, and not, say, in a long running server. Sure, the caller could do something like \begin{verbatim} try: status = get_status(log) except SystemExit: status = None \end{verbatim} So, try to make as few \code{except} clauses in your code --- those will usually be a catch-all in the \function{main}, or inside calls which should always succeed. So, the best version is probably \begin{verbatim} def get_status(file): return open(file).readline() \end{verbatim} The caller can deal with the exception if it wants (for example, if it tries several files in a loop), or just let the exception filter upwards to {\em its} caller. The last version is not very good either --- due to implementation details, the file would not be closed when an exception is raised until the handler finishes, and perhaps not at all in non-C implementations (e.g., Jython). \begin{verbatim} def get_status(file): fp = open(file) try: return fp.readline() finally: fp.close() \end{verbatim} \section{Using the Batteries} Every so often, people seem to be writing stuff in the Python library again, usually poorly. While the occasional module has a poor interface, it is usually much better to use the rich standard library and data types that come with Python then inventing your own. A useful module very few people know about is \module{os.path}. It always has the correct path arithmetic for your operating system, and will usually be much better then whatever you come up with yourself. Compare: \begin{verbatim} # ugh! return dir+"/"+file # better return os.path.join(dir, file) \end{verbatim} More useful functions in \module{os.path}: \function{basename}, \function{dirname} and \function{splitext}. There are also many useful builtin functions people seem not to be aware of for some reason: \function{min()} and \function{max()} can find the minimum/maximum of any sequence with comparable semantics, for example, yet many people write they own max/min. Another highly useful function is \function{reduce()}. Classical use of \function{reduce()} is something like \begin{verbatim} import sys, operator nums = map(float, sys.argv[1:]) print reduce(operator.add, nums)/len(nums) \end{verbatim} This cute little script prints the average of all numbers given on the command line. The \function{reduce()} adds up all the numbers, and the rest is just some pre- and postprocessing. On the same note, note that \function{float()}, \function{int()} and \function{long()} all accept arguments of type string, and so are suited to parsing --- assuming you are ready to deal with the \exception{ValueError} they raise. \section{Using Backslash to Continue Statements} Since Python treats a newline as a statement terminator, and since statements are often more then is comfortable to put in one line, many people do: \begin{verbatim} if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \ calculate_number(10, 20) != forbulate(500, 360): pass \end{verbatim} You should realize that this is dangerous: a stray space after the \code{\\} would make this line wrong, and stray spaces are notoriously hard to see in editors. In this case, at least it would be a syntax error, but if the code was: \begin{verbatim} value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \ + calculate_number(10, 20)*forbulate(500, 360) \end{verbatim} then it would just be subtly wrong. It is usually much better to use the implicit continuation inside parenthesis: This version is bulletproof: \begin{verbatim} value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9] + calculate_number(10, 20)*forbulate(500, 360)) \end{verbatim} \end{document} --- NEW FILE: regex.tex --- \documentclass{howto} % TODO: % Document lookbehind assertions % Better way of displaying a RE, a string, and what it matches % Mention optional argument to match.groups() % Unicode (at least a reference) \title{Regular Expression HOWTO} \release{0.05} \author{A.M. Kuchling} \authoraddress{\email{amk@amk.ca}} \begin{document} \maketitle \begin{abstract} [...1427 lines suppressed...] % $ \section{Feedback} Regular expressions are a complicated topic. Did this document help you understand them? Were there parts that were unclear, or Problems you encountered that weren't covered here? If so, please send suggestions for improvements to the author. The most complete book on regular expressions is almost certainly Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published by O'Reilly. Unfortunately, it exclusively concentrates on Perl and Java's flavours of regular expressions, and doesn't contain any Python material at all, so it won't be useful as a reference for programming in Python. (The first edition covered Python's now-obsolete \module{regex} module, which won't help you much.) Consider checking it out from your library. \end{document} --- NEW FILE: rexec.tex --- \documentclass{howto} \title{Restricted Execution HOWTO} \release{2.1} \author{A.M. Kuchling} \authoraddress{\email{amk@amk.ca}} \begin{document} \maketitle \begin{abstract} \noindent Python 2.2.2 and earlier provided a \module{rexec} module running untrusted code. However, it's never been exhaustively audited for security and it hasn't been updated to take into account recent changes to Python such as new-style classes. Therefore, the \module{rexec} module should not be trusted. To discourage use of \module{rexec}, this HOWTO has been withdrawn. The \module{rexec} and \module{Bastion} modules have been disabled in the Python CVS tree, both on the trunk (which will eventually become Python 2.3alpha2 and later 2.3final) and on the release22-maint branch (which will become Python 2.2.3, if someone ever volunteers to issue 2.2.3). For discussion of the problems with \module{rexec}, see the python-dev threads starting at the following URLs: \url{http://mail.python.org/pipermail/python-dev/2002-December/031160.html}, and \url{http://mail.python.org/pipermail/python-dev/2003-January/031848.html}. \end{abstract} \section{Version History} Sep. 12, 1998: Minor revisions and added the reference to the Janus project. Feb. 26, 1998: First version. Suggestions are welcome. Mar. 16, 1998: Made some revisions suggested by Jeff Rush. Some minor changes and clarifications, and a sizable section on exceptions added. Oct. 4, 2000: Checked with Python 2.0. Minor rewrites and fixes made. Version number increased to 2.0. Dec. 17, 2002: Withdrawn. Jan. 8, 2003: Mention that \module{rexec} will be disabled in Python 2.3, and added links to relevant python-dev threads. \end{document} --- NEW FILE: sockets.tex --- \documentclass{howto} \title{Socket Programming HOWTO} \release{0.00} \author{Gordon McMillan} \authoraddress{\email{gmcm@hypernet.com}} \begin{document} \maketitle \begin{abstract} \noindent Sockets are used nearly everywhere, but are one of the most severely misunderstood technologies around. This is a 10,000 foot overview of sockets. It's not really a tutorial - you'll still have work to do in getting things operational. It doesn't cover the fine points (and there are a lot of them), but I hope it will give you enough background to begin using them decently. This document is available from the Python HOWTO page at \url{http://www.python.org/doc/howto}. \end{abstract} \tableofcontents \section{Sockets} Sockets are used nearly everywhere, but are one of the most severely misunderstood technologies around. This is a 10,000 foot overview of sockets. It's not really a tutorial - you'll still have work to do in getting things working. It doesn't cover the fine points (and there are a lot of them), but I hope it will give you enough background to begin using them decently. I'm only going to talk about INET sockets, but they account for at least 99\% of the sockets in use. And I'll only talk about STREAM sockets - unless you really know what you're doing (in which case this HOWTO isn't for you!), you'll get better behavior and performance from a STREAM socket than anything else. I will try to clear up the mystery of what a socket is, as well as some hints on how to work with blocking and non-blocking sockets. But I'll start by talking about blocking sockets. You'll need to know how they work before dealing with non-blocking sockets. Part of the trouble with understanding these things is that "socket" can mean a number of subtly different things, depending on context. So first, let's make a distinction between a "client" socket - an endpoint of a conversation, and a "server" socket, which is more like a switchboard operator. The client application (your browser, for example) uses "client" sockets exclusively; the web server it's talking to uses both "server" sockets and "client" sockets. \subsection{History} Of the various forms of IPC (\emph{Inter Process Communication}), sockets are by far the most popular. On any given platform, there are likely to be other forms of IPC that are faster, but for cross-platform communication, sockets are about the only game in town. They were invented in Berkeley as part of the BSD flavor of Unix. They spread like wildfire with the Internet. With good reason --- the combination of sockets with INET makes talking to arbitrary machines around the world unbelievably easy (at least compared to other schemes). \section{Creating a Socket} Roughly speaking, when you clicked on the link that brought you to this page, your browser did something like the following: \begin{verbatim} #create an INET, STREAMing socket s = socket.socket( socket.AF_INET, socket.SOCK_STREAM) #now connect to the web server on port 80 # - the normal http port s.connect(("www.mcmillan-inc.com", 80)) \end{verbatim} When the \code{connect} completes, the socket \code{s} can now be used to send in a request for the text of this page. The same socket will read the reply, and then be destroyed. That's right - destroyed. Client sockets are normally only used for one exchange (or a small set of sequential exchanges). What happens in the web server is a bit more complex. First, the web server creates a "server socket". \begin{verbatim} #create an INET, STREAMing socket serversocket = socket.socket( socket.AF_INET, socket.SOCK_STREAM) #bind the socket to a public host, # and a well-known port serversocket.bind((socket.gethostname(), 80)) #become a server socket serversocket.listen(5) \end{verbatim} A couple things to notice: we used \code{socket.gethostname()} so that the socket would be visible to the outside world. If we had used \code{s.bind(('', 80))} or \code{s.bind(('localhost', 80))} or \code{s.bind(('127.0.0.1', 80))} we would still have a "server" socket, but one that was only visible within the same machine. A second thing to note: low number ports are usually reserved for "well known" services (HTTP, SNMP etc). If you're playing around, use a nice high number (4 digits). Finally, the argument to \code{listen} tells the socket library that we want it to queue up as many as 5 connect requests (the normal max) before refusing outside connections. If the rest of the code is written properly, that should be plenty. OK, now we have a "server" socket, listening on port 80. Now we enter the mainloop of the web server: \begin{verbatim} while 1: #accept connections from outside (clientsocket, address) = serversocket.accept() #now do something with the clientsocket #in this case, we'll pretend this is a threaded server ct = client_thread(clientsocket) ct.run() \end{verbatim} There's actually 3 general ways in which this loop could work - dispatching a thread to handle \code{clientsocket}, create a new process to handle \code{clientsocket}, or restructure this app to use non-blocking sockets, and mulitplex between our "server" socket and any active \code{clientsocket}s using \code{select}. More about that later. The important thing to understand now is this: this is \emph{all} a "server" socket does. It doesn't send any data. It doesn't receive any data. It just produces "client" sockets. Each \code{clientsocket} is created in response to some \emph{other} "client" socket doing a \code{connect()} to the host and port we're bound to. As soon as we've created that \code{clientsocket}, we go back to listening for more connections. The two "clients" are free to chat it up - they are using some dynamically allocated port which will be recycled when the conversation ends. \subsection{IPC} If you need fast IPC between two processes on one machine, you should look into whatever form of shared memory the platform offers. A simple protocol based around shared memory and locks or semaphores is by far the fastest technique. If you do decide to use sockets, bind the "server" socket to \code{'localhost'}. On most platforms, this will take a shortcut around a couple of layers of network code and be quite a bit faster. \section{Using a Socket} The first thing to note, is that the web browser's "client" socket and the web server's "client" socket are identical beasts. That is, this is a "peer to peer" conversation. Or to put it another way, \emph{as the designer, you will have to decide what the rules of etiquette are for a conversation}. Normally, the \code{connect}ing socket starts the conversation, by sending in a request, or perhaps a signon. But that's a design decision - it's not a rule of sockets. Now there are two sets of verbs to use for communication. You can use \code{send} and \code{recv}, or you can transform your client socket into a file-like beast and use \code{read} and \code{write}. The latter is the way Java presents their sockets. I'm not going to talk about it here, except to warn you that you need to use \code{flush} on sockets. These are buffered "files", and a common mistake is to \code{write} something, and then \code{read} for a reply. Without a \code{flush} in there, you may wait forever for the reply, because the request may still be in your output buffer. Now we come the major stumbling block of sockets - \code{send} and \code{recv} operate on the network buffers. They do not necessarily handle all the bytes you hand them (or expect from them), because their major focus is handling the network buffers. In general, they return when the associated network buffers have been filled (\code{send}) or emptied (\code{recv}). They then tell you how many bytes they handled. It is \emph{your} responsibility to call them again until your message has been completely dealt with. When a \code{recv} returns 0 bytes, it means the other side has closed (or is in the process of closing) the connection. You will not receive any more data on this connection. Ever. You may be able to send data successfully; I'll talk about that some on the next page. A protocol like HTTP uses a socket for only one transfer. The client sends a request, the reads a reply. That's it. The socket is discarded. This means that a client can detect the end of the reply by receiving 0 bytes. But if you plan to reuse your socket for further transfers, you need to realize that \emph{there is no "EOT" (End of Transfer) on a socket.} I repeat: if a socket \code{send} or \code{recv} returns after handling 0 bytes, the connection has been broken. If the connection has \emph{not} been broken, you may wait on a \code{recv} forever, because the socket will \emph{not} tell you that there's nothing more to read (for now). Now if you think about that a bit, you'll come to realize a fundamental truth of sockets: \emph{messages must either be fixed length} (yuck), \emph{or be delimited} (shrug), \emph{or indicate how long they are} (much better), \emph{or end by shutting down the connection}. The choice is entirely yours, (but some ways are righter than others). Assuming you don't want to end the connection, the simplest solution is a fixed length message: \begin{verbatim} class mysocket: '''demonstration class only - coded for clarity, not efficiency''' def __init__(self, sock=None): if sock is None: self.sock = socket.socket( socket.AF_INET, socket.SOCK_STREAM) else: self.sock = sock def connect(host, port): self.sock.connect((host, port)) def mysend(msg): totalsent = 0 while totalsent < MSGLEN: sent = self.sock.send(msg[totalsent:]) if sent == 0: raise RuntimeError, \\ "socket connection broken" totalsent = totalsent + sent def myreceive(): msg = '' while len(msg) < MSGLEN: chunk = self.sock.recv(MSGLEN-len(msg)) if chunk == '': raise RuntimeError, \\ "socket connection broken" msg = msg + chunk return msg \end{verbatim} The sending code here is usable for almost any messaging scheme - in Python you send strings, and you can use \code{len()} to determine its length (even if it has embedded \code{\e 0} characters). It's mostly the receiving code that gets more complex. (And in C, it's not much worse, except you can't use \code{strlen} if the message has embedded \code{\e 0}s.) The easiest enhancement is to make the first character of the message an indicator of message type, and have the type determine the length. Now you have two \code{recv}s - the first to get (at least) that first character so you can look up the length, and the second in a loop to get the rest. If you decide to go the delimited route, you'll be receiving in some arbitrary chunk size, (4096 or 8192 is frequently a good match for network buffer sizes), and scanning what you've received for a delimiter. One complication to be aware of: if your conversational protocol allows multiple messages to be sent back to back (without some kind of reply), and you pass \code{recv} an arbitrary chunk size, you may end up reading the start of a following message. You'll need to put that aside and hold onto it, until it's needed. Prefixing the message with it's length (say, as 5 numeric characters) gets more complex, because (believe it or not), you may not get all 5 characters in one \code{recv}. In playing around, you'll get away with it; but in high network loads, your code will very quickly break unless you use two \code{recv} loops - the first to determine the length, the second to get the data part of the message. Nasty. This is also when you'll discover that \code{send} does not always manage to get rid of everything in one pass. And despite having read this, you will eventually get bit by it! In the interests of space, building your character, (and preserving my competitive position), these enhancements are left as an exercise for the reader. Lets move on to cleaning up. \subsection{Binary Data} It is perfectly possible to send binary data over a socket. The major problem is that not all machines use the same formats for binary data. For example, a Motorola chip will represent a 16 bit integer with the value 1 as the two hex bytes 00 01. Intel and DEC, however, are byte-reversed - that same 1 is 01 00. Socket libraries have calls for converting 16 and 32 bit integers - \code{ntohl, htonl, ntohs, htons} where "n" means \emph{network} and "h" means \emph{host}, "s" means \emph{short} and "l" means \emph{long}. Where network order is host order, these do nothing, but where the machine is byte-reversed, these swap the bytes around appropriately. In these days of 32 bit machines, the ascii representation of binary data is frequently smaller than the binary representation. That's because a surprising amount of the time, all those longs have the value 0, or maybe 1. The string "0" would be two bytes, while binary is four. Of course, this doesn't fit well with fixed-length messages. Decisions, decisions. \section{Disconnecting} Strictly speaking, you're supposed to use \code{shutdown} on a socket before you \code{close} it. The \code{shutdown} is an advisory to the socket at the other end. Depending on the argument you pass it, it can mean "I'm not going to send anymore, but I'll still listen", or "I'm not listening, good riddance!". Most socket libraries, however, are so used to programmers neglecting to use this piece of etiquette that normally a \code{close} is the same as \code{shutdown(); close()}. So in most situations, an explicit \code{shutdown} is not needed. One way to use \code{shutdown} effectively is in an HTTP-like exchange. The client sends a request and then does a \code{shutdown(1)}. This tells the server "This client is done sending, but can still receive." The server can detect "EOF" by a receive of 0 bytes. It can assume it has the complete request. The server sends a reply. If the \code{send} completes successfully then, indeed, the client was still receiving. Python takes the automatic shutdown a step further, and says that when a socket is garbage collected, it will automatically do a \code{close} if it's needed. But relying on this is a very bad habit. If your socket just disappears without doing a \code{close}, the socket at the other end may hang indefinitely, thinking you're just being slow. \emph{Please} \code{close} your sockets when you're done. \subsection{When Sockets Die} Probably the worst thing about using blocking sockets is what happens when the other side comes down hard (without doing a \code{close}). Your socket is likely to hang. SOCKSTREAM is a reliable protocol, and it will wait a long, long time before giving up on a connection. If you're using threads, the entire thread is essentially dead. There's not much you can do about it. As long as you aren't doing something dumb, like holding a lock while doing a blocking read, the thread isn't really consuming much in the way of resources. Do \emph{not} try to kill the thread - part of the reason that threads are more efficient than processes is that they avoid the overhead associated with the automatic recycling of resources. In other words, if you do manage to kill the thread, your whole process is likely to be screwed up. \section{Non-blocking Sockets} If you've understood the preceeding, you already know most of what you need to know about the mechanics of using sockets. You'll still use the same calls, in much the same ways. It's just that, if you do it right, your app will be almost inside-out. In Python, you use \code{socket.setblocking(0)} to make it non-blocking. In C, it's more complex, (for one thing, you'll need to choose between the BSD flavor \code{O_NONBLOCK} and the almost indistinguishable Posix flavor \code{O_NDELAY}, which is completely different from \code{TCP_NODELAY}), but it's the exact same idea. You do this after creating the socket, but before using it. (Actually, if you're nuts, you can switch back and forth.) The major mechanical difference is that \code{send}, \code{recv}, \code{connect} and \code{accept} can return without having done anything. You have (of course) a number of choices. You can check return code and error codes and generally drive yourself crazy. If you don't believe me, try it sometime. Your app will grow large, buggy and suck CPU. So let's skip the brain-dead solutions and do it right. Use \code{select}. In C, coding \code{select} is fairly complex. In Python, it's a piece of cake, but it's close enough to the C version that if you understand \code{select} in Python, you'll have little trouble with it in C. \begin{verbatim} ready_to_read, ready_to_write, in_error = \\ select.select( potential_readers, potential_writers, potential_errs, timeout) \end{verbatim} You pass \code{select} three lists: the first contains all sockets that you might want to try reading; the second all the sockets you might want to try writing to, and the last (normally left empty) those that you want to check for errors. You should note that a socket can go into more than one list. The \code{select} call is blocking, but you can give it a timeout. This is generally a sensible thing to do - give it a nice long timeout (say a minute) unless you have good reason to do otherwise. In return, you will get three lists. They have the sockets that are actually readable, writable and in error. Each of these lists is a subset (possbily empty) of the corresponding list you passed in. And if you put a socket in more than one input list, it will only be (at most) in one output list. If a socket is in the output readable list, you can be as-close-to-certain-as-we-ever-get-in-this-business that a \code{recv} on that socket will return \emph{something}. Same idea for the writable list. You'll be able to send \emph{something}. Maybe not all you want to, but \emph{something} is better than nothing. (Actually, any reasonably healthy socket will return as writable - it just means outbound network buffer space is available.) If you have a "server" socket, put it in the potential_readers list. If it comes out in the readable list, your \code{accept} will (almost certainly) work. If you have created a new socket to \code{connect} to someone else, put it in the ptoential_writers list. If it shows up in the writable list, you have a decent chance that it has connected. One very nasty problem with \code{select}: if somewhere in those input lists of sockets is one which has died a nasty death, the \code{select} will fail. You then need to loop through every single damn socket in all those lists and do a \code{select([sock],[],[],0)} until you find the bad one. That timeout of 0 means it won't take long, but it's ugly. Actually, \code{select} can be handy even with blocking sockets. It's one way of determining whether you will block - the socket returns as readable when there's something in the buffers. However, this still doesn't help with the problem of determining whether the other end is done, or just busy with something else. \textbf{Portability alert}: On Unix, \code{select} works both with the sockets and files. Don't try this on Windows. On Windows, \code{select} works with sockets only. Also note that in C, many of the more advanced socket options are done differently on Windows. In fact, on Windows I usually use threads (which work very, very well) with my sockets. Face it, if you want any kind of performance, your code will look very different on Windows than on Unix. (I haven't the foggiest how you do this stuff on a Mac.) \subsection{Performance} There's no question that the fastest sockets code uses non-blocking sockets and select to multiplex them. You can put together something that will saturate a LAN connection without putting any strain on the CPU. The trouble is that an app written this way can't do much of anything else - it needs to be ready to shuffle bytes around at all times. Assuming that your app is actually supposed to do something more than that, threading is the optimal solution, (and using non-blocking sockets will be faster than using blocking sockets). Unfortunately, threading support in Unixes varies both in API and quality. So the normal Unix solution is to fork a subprocess to deal with each connection. The overhead for this is significant (and don't do this on Windows - the overhead of process creation is enormous there). It also means that unless each subprocess is completely independent, you'll need to use another form of IPC, say a pipe, or shared memory and semaphores, to communicate between the parent and child processes. Finally, remember that even though blocking sockets are somewhat slower than non-blocking, in many cases they are the "right" solution. After all, if your app is driven by the data it receives over a socket, there's not much sense in complicating the logic just so your app can wait on \code{select} instead of \code{recv}. \end{document} --- NEW FILE: sorting.tex --- \documentclass{howto} \title{Sorting Mini-HOWTO} % Increment the release number whenever significant changes are made. % The author and/or editor can define 'significant' however they like. \release{0.01} \author{Andrew Dalke} \authoraddress{\email{dalke@bioreason.com}} \begin{document} \maketitle \begin{abstract} \noindent This document is a little tutorial showing a half dozen ways to sort a list with the built-in \method{sort()} method. This document is available from the Python HOWTO page at \url{http://www.python.org/doc/howto}. \end{abstract} \tableofcontents Python lists have a built-in \method{sort()} method. There are many ways to use it to sort a list and there doesn't appear to be a single, central place in the various manuals describing them, so I'll do so here. \section{Sorting basic data types} A simple ascending sort is easy; just call the \method{sort()} method of a list. \begin{verbatim}
a = [5, 2, 3, 1, 4] a.sort() print a [1, 2, 3, 4, 5] \end{verbatim}
Sort takes an optional function which can be called for doing the comparisons. The default sort routine is equivalent to \begin{verbatim}
a = [5, 2, 3, 1, 4] a.sort(cmp) print a [1, 2, 3, 4, 5] \end{verbatim}
where \function{cmp} is the built-in function which compares two objects, \code{x} and \code{y}, and returns -1, 0 or 1 depending on whether $x<y$, $x==y$, or $x>y$. During the course of the sort the relationships must stay the same for the final list to make sense. If you want, you can define your own function for the comparison. For integers (and numbers in general) we can do: \begin{verbatim}
def numeric_compare(x, y): return x-y
a = [5, 2, 3, 1, 4] a.sort(numeric_compare) print a [1, 2, 3, 4, 5] \end{verbatim}
By the way, this function won't work if result of the subtraction is out of range, as in \code{sys.maxint - (-1)}. Or, if you don't want to define a new named function you can create an anonymous one using \keyword{lambda}, as in: \begin{verbatim}
a = [5, 2, 3, 1, 4] a.sort(lambda x, y: x-y) print a [1, 2, 3, 4, 5] \end{verbatim}
If you want the numbers sorted in reverse you can do \begin{verbatim}
a = [5, 2, 3, 1, 4] def reverse_numeric(x, y): return y-x
a.sort(reverse_numeric) print a [5, 4, 3, 2, 1] \end{verbatim}
(a more general implementation could return \code{cmp(y,x)} or \code{-cmp(x,y)}). However, it's faster if Python doesn't have to call a function for every comparison, so if you want a reverse-sorted list of basic data types, do the forward sort first, then use the \method{reverse()} method. \begin{verbatim}
a = [5, 2, 3, 1, 4] a.sort() a.reverse() print a [5, 4, 3, 2, 1] \end{verbatim}
Here's a case-insensitive string comparison using a \keyword{lambda} function: \begin{verbatim}
import string a = string.split("This is a test string from Andrew.") a.sort(lambda x, y: cmp(string.lower(x), string.lower(y))) print a ['a', 'Andrew.', 'from', 'is', 'string', 'test', 'This'] \end{verbatim}
This goes through the overhead of converting a word to lower case every time it must be compared. At times it may be faster to compute these once and use those values, and the following example shows how. \begin{verbatim}
words = string.split("This is a test string from Andrew.") offsets = [] for i in range(len(words)): offsets.append( (string.lower(words[i]), i) )
offsets.sort() new_words = [] for dontcare, i in offsets: new_words.append(words[i])
print new_words \end{verbatim}
The \code{offsets} list is initialized to a tuple of the lower-case string and its position in the \code{words} list. It is then sorted. Python's sort method sorts tuples by comparing terms; given \code{x} and \code{y}, compare \code{x[0]} to \code{y[0]}, then \code{x[1]} to \code{y[1]}, etc. until there is a difference. The result is that the \code{offsets} list is ordered by its first term, and the second term can be used to figure out where the original data was stored. (The \code{for} loop assigns \code{dontcare} and \code{i} to the two fields of each term in the list, but we only need the index value.) Another way to implement this is to store the original data as the second term in the \code{offsets} list, as in: \begin{verbatim}
words = string.split("This is a test string from Andrew.") offsets = [] for word in words: offsets.append( (string.lower(word), word) )
offsets.sort() new_words = [] for word in offsets: new_words.append(word[1])
print new_words \end{verbatim}
This isn't always appropriate because the second terms in the list (the word, in this example) will be compared when the first terms are the same. If this happens many times, then there will be the unneeded performance hit of comparing the two objects. This can be a large cost if most terms are the same and the objects define their own \method{__cmp__} method, but there will still be some overhead to determine if \method{__cmp__} is defined. Still, for large lists, or for lists where the comparison information is expensive to calculate, the last two examples are likely to be the fastest way to sort a list. It will not work on weakly sorted data, like complex numbers, but if you don't know what that means, you probably don't need to worry about it. \section{Comparing classes} The comparison for two basic data types, like ints to ints or string to string, is built into Python and makes sense. There is a default way to compare class instances, but the default manner isn't usually very useful. You can define your own comparison with the \method{__cmp__} method, as in: \begin{verbatim}
class Spam: def __init__(self, spam, eggs): self.spam = spam self.eggs = eggs def __cmp__(self, other): return cmp(self.spam+self.eggs, other.spam+other.eggs) def __str__(self): return str(self.spam + self.eggs)
a = [Spam(1, 4), Spam(9, 3), Spam(4,6)] a.sort() for spam in a: print str(spam) 5 10 12 \end{verbatim}
Sometimes you may want to sort by a specific attribute of a class. If appropriate you should just define the \method{__cmp__} method to compare those values, but you cannot do this if you want to compare between different attributes at different times. Instead, you'll need to go back to passing a comparison function to sort, as in: \begin{verbatim}
a = [Spam(1, 4), Spam(9, 3), Spam(4,6)] a.sort(lambda x, y: cmp(x.eggs, y.eggs)) for spam in a: print spam.eggs, str(spam) 3 12 4 5 6 10 \end{verbatim}
If you want to compare two arbitrary attributes (and aren't overly concerned about performance) you can even define your own comparison function object. This uses the ability of a class instance to emulate an function by defining the \method{__call__} method, as in: \begin{verbatim}
class CmpAttr: def __init__(self, attr): self.attr = attr def __call__(self, x, y): return cmp(getattr(x, self.attr), getattr(y, self.attr))
a = [Spam(1, 4), Spam(9, 3), Spam(4,6)] a.sort(CmpAttr("spam")) # sort by the "spam" attribute for spam in a: print spam.spam, spam.eggs, str(spam) 1 4 5 4 6 10 9 3 12
a.sort(CmpAttr("eggs")) # re-sort by the "eggs" attribute for spam in a: print spam.spam, spam.eggs, str(spam) 9 3 12 1 4 5 4 6 10 \end{verbatim}
Of course, if you want a faster sort you can extract the attributes into an intermediate list and sort that list. So, there you have it; about a half-dozen different ways to define how to sort a list: \begin{itemize} \item sort using the default method \item sort using a comparison function \item reverse sort not using a comparison function \item sort on an intermediate list (two forms) \item sort using class defined __cmp__ method \item sort using a sort function object \end{itemize} \end{document} % LocalWords: maxint --- NEW FILE: unicode.rst --- Unicode HOWTO ================ **Version 1.02** This HOWTO discusses Python's support for Unicode, and explains various problems that people commonly encounter when trying to work with Unicode. Introduction to Unicode ------------------------------ History of Character Codes '''''''''''''''''''''''''''''' In 1968, the American Standard Code for Information Interchange, better known by its acronym ASCII, was standardized. ASCII defined numeric codes for various characters, with the numeric values running from 0 to 127. For example, the lowercase letter 'a' is assigned 97 as its code value. ASCII was an American-developed standard, so it only defined unaccented characters. There was an 'e', but no 'é' or 'Í'. This meant that languages which required accented characters couldn't be faithfully represented in ASCII. (Actually the missing accents matter for English, too, which contains words such as 'naïve' and 'café', and some publications have house styles which require spellings such as 'coöperate'.) For a while people just wrote programs that didn't display accents. I remember looking at Apple ][ BASIC programs, published in French-language publications in the mid-1980s, that had lines like these:: PRINT "FICHER EST COMPLETE." PRINT "CARACTERE NON ACCEPTE." Those messages should contain accents, and they just look wrong to someone who can read French. In the 1980s, almost all personal computers were 8-bit, meaning that bytes could hold values ranging from 0 to 255. ASCII codes only went up to 127, so some machines assigned values between 128 and 255 to accented characters. Different machines had different codes, however, which led to problems exchanging files. Eventually various commonly used sets of values for the 128-255 range emerged. Some were true standards, defined by the International Standards Organization, and some were **de facto** conventions that were invented by one company or another and managed to catch on. 255 characters aren't very many. For example, you can't fit both the accented characters used in Western Europe and the Cyrillic alphabet used for Russian into the 128-255 range because there are more than 127 such characters. You could write files using different codes (all your Russian files in a coding system called KOI8, all your French files in a different coding system called Latin1), but what if you wanted to write a French document that quotes some Russian text? In the 1980s people began to want to solve this problem, and the Unicode standardization effort began. Unicode started out using 16-bit characters instead of 8-bit characters. 16 bits means you have 2^16 = 65,536 distinct values available, making it possible to represent many different characters from many different alphabets; an initial goal was to have Unicode contain the alphabets for every single human language. It turns out that even 16 bits isn't enough to meet that goal, and the modern Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in base-16). There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were originally separate efforts, but the specifications were merged with the 1.1 revision of Unicode. (This discussion of Unicode's history is highly simplified. I don't think the average Python programmer needs to worry about the historical details; consult the Unicode consortium site listed in the References for more information.) Definitions '''''''''''''''''''''''' A **character** is the smallest possible component of a text. 'A', 'B', 'C', etc., are all different characters. So are 'È' and 'Í'. Characters are abstractions, and vary depending on the language or context you're talking about. For example, the symbol for ohms (Ω) is usually drawn much like the capital letter omega (Ω) in the Greek alphabet (they may even be the same in some fonts), but these are two different characters that have different meanings. The Unicode standard describes how characters are represented by **code points**. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:: 0061 'a'; LATIN SMALL LETTER A 0062 'b'; LATIN SMALL LETTER B 0063 'c'; LATIN SMALL LETTER C ... 007B '{'; LEFT CURLY BRACKET Strictly, these definitions imply that it's meaningless to say 'this is character U+12ca'. U+12ca is a code point, which represents some particular character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In informal contexts, this distinction between code points and characters will sometimes be forgotten. A character is represented on a screen or on paper by a set of graphical elements that's called a **glyph**. The glyph for an uppercase A, for example, is two diagonal strokes and a horizontal stroke, though the exact details will depend on the font being used. Most Python code doesn't need to worry about glyphs; figuring out the correct glyph to display is generally the job of a GUI toolkit or a terminal's font renderer. Encodings ''''''''' To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 to 0x10ffff. This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an **encoding**. The first encoding you might think of is an array of 32-bit integers. In this representation, the string "Python" would look like this:: P y t h o n 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 This representation is straightforward but using it presents a number of problems. 1. It's not portable; different processors order the bytes differently. 2. It's very wasteful of space. In most texts, the majority of the code points are less than 127, or less than 255, so a lot of space is occupied by zero bytes. The above string takes 24 bytes compared to the 6 bytes needed for an ASCII representation. Increased RAM usage doesn't matter too much (desktop computers have megabytes of RAM, and strings aren't usually that large), but expanding our usage of disk and network bandwidth by a factor of 4 is intolerable. 3. It's not compatible with existing C functions such as ``strlen()``, so a new family of wide string functions would need to be used. 4. Many Internet standards are defined in terms of textual data, and can't handle content with embedded zero bytes. Generally people don't use this encoding, choosing other encodings that are more efficient and convenient. Encodings don't have to handle every possible Unicode character, and most encodings don't. For example, Python's default encoding is the 'ascii' encoding. The rules for converting a Unicode string into the ASCII encoding are are simple; for each code point: 1. If the code point is <128, each byte is the same as the value of the code point. 2. If the code point is 128 or greater, the Unicode string can't be represented in this encoding. (Python raises a ``UnicodeEncodeError`` exception in this case.) Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points 0-255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can't be encoded into Latin-1. Encodings don't have to be simple one-to-one mappings like Latin-1. Consider IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145 through 153. If you wanted to use EBCDIC as an encoding, you'd probably use some sort of lookup table to perform the conversion, but this is largely an internal detail. UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode Transformation Format", and the '8' means that 8-bit numbers are used in the encoding. (There's also a UTF-16 encoding, but it's less frequently used than UTF-8.) UTF-8 uses the following rules: 1. If the code point is <128, it's represented by the corresponding byte value. 2. If the code point is between 128 and 0x7ff, it's turned into two byte values between 128 and 255. 3. Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255. UTF-8 has several convenient properties: 1. It can handle any Unicode code point. 2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent through protocols that can't handle zero bytes. 3. A string of ASCII text is also valid UTF-8 text. 4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte. 5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize. It's also unlikely that random 8-bit data will look like valid UTF-8. References '''''''''''''' The Unicode Consortium site at <http://www.unicode.org> has character charts, a glossary, and PDF versions of the Unicode specification. Be prepared for some difficult reading. <http://www.unicode.org/history/> is a chronology of the origin and development of Unicode. To help understand the standard, Jukka Korpela has written an introductory guide to reading the Unicode character tables, available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>. Roman Czyborra wrote another explanation of Unicode's basic principles; it's at <http://czyborra.com/unicode/characters.html>. Czyborra has written a number of other Unicode-related documentation, available from <http://www.cyzborra.com>. Two other good introductory articles were written by Joel Spolsky <http://www.joelonsoftware.com/articles/Unicode.html> and Jason Orendorff <http://www.jorendorff.com/articles/unicode/>. If this introduction didn't make things clear to you, you should try reading one of these alternate articles before continuing. Wikipedia entries are often helpful; see the entries for "character encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8 <http://en.wikipedia.org/wiki/UTF-8>, for example. Python's Unicode Support ------------------------ Now that you've learned the rudiments of Unicode, we can look at Python's Unicode features. The Unicode Type ''''''''''''''''''' Unicode strings are expressed as instances of the ``unicode`` type, one of Python's repertoire of built-in types. It derives from an abstract type called ``basestring``, which is also an ancestor of the ``str`` type; you can therefore check if a value is a string type with ``isinstance(value, basestring)``. Under the hood, Python represents Unicode strings as either 16- or 32-bit integers, depending on how the Python interpreter was compiled, but this The ``unicode()`` constructor has the signature ``unicode(string[, encoding, errors])``. All of its arguments should be 8-bit strings. The first argument is converted to Unicode using the specified encoding; if you leave off the ``encoding`` argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors:: >>> unicode('abcdef') u'abcdef' >>> s = unicode('abcdef') >>> type(s) <type 'unicode'> >>> unicode('abcdef' + chr(255)) Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: ordinal not in range(128) The ``errors`` argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument are 'strict' (raise a ``UnicodeDecodeError`` exception), 'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the Unicode result). The following examples show the differences:: >>> unicode('\x80abc', errors='strict') Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128) >>> unicode('\x80abc', errors='replace') u'\ufffdabc' >>> unicode('\x80abc', errors='ignore') u'abc' Encodings are specified as strings containing the encoding's name. Python 2.4 comes with roughly 100 different encodings; see the Python Library Reference at <http://docs.python.org/lib/standard-encodings.html> for a list. Some encodings have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all synonyms for the same encoding. One-character Unicode strings can also be created with the ``unichr()`` built-in function, which takes integers and returns a Unicode string of length 1 that contains the corresponding code point. The reverse operation is the built-in `ord()` function that takes a one-character Unicode string and returns the code point value:: >>> unichr(40960) u'\ua000' >>> ord(u'\ua000') 40960 Instances of the ``unicode`` type have many of the same methods as the 8-bit string type for operations such as searching and formatting:: >>> s = u'Was ever feather so lightly blown to and fro as this multitude?' >>> s.count('e') 5 >>> s.find('feather') 9 >>> s.find('bird') -1 >>> s.replace('feather', 'sand') u'Was ever sand so lightly blown to and fro as this multitude?' >>> s.upper() u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?' Note that the arguments to these methods can be Unicode strings or 8-bit strings. 8-bit strings will be converted to Unicode before carrying out the operation; Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception:: >>> s.find('Was\x9f') Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128) >>> s.find(u'Was\x9f') -1 Much Python code that operates on strings will therefore work with Unicode strings without requiring any changes to the code. (Input and output code needs more updating for Unicode; more on this later.) Another important method is ``.encode([encoding], [errors='strict'])``, which returns an 8-bit string version of the Unicode string, encoded in the requested encoding. The ``errors`` parameter is the same as the parameter of the ``unicode()`` constructor, with one additional possibility; as well as 'strict', 'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's character references. The following example shows the different results:: >>> u = unichr(40960) + u'abcd' + unichr(1972) >>> u.encode('utf-8') '\xea\x80\x80abcd\xde\xb4' >>> u.encode('ascii') Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128) >>> u.encode('ascii', 'ignore') 'abcd' >>> u.encode('ascii', 'replace') '?abcd?' >>> u.encode('ascii', 'xmlcharrefreplace') 'ꀀabcd' Python's 8-bit strings have a ``.decode([encoding], [errors])`` method that interprets the string using the given encoding:: >>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string >>> utf8_version = u.encode('utf-8') # Encode as UTF-8 >>> type(utf8_version), utf8_version (<type 'str'>, '\xea\x80\x80abcd\xde\xb4') >>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8 >>> u == u2 # The two strings match True The low-level routines for registering and accessing the available encodings are found in the ``codecs`` module. However, the encoding and decoding functions returned by this module are usually more low-level than is comfortable, so I'm not going to describe the ``codecs`` module here. If you need to implement a completely new encoding, you'll need to learn about the ``codecs`` module interfaces, but implementing encodings is a specialized task that also won't be covered here. Consult the Python documentation to learn more about this module. The most commonly used part of the ``codecs`` module is the ``codecs.open()`` function which will be discussed in the section on input and output. Unicode Literals in Python Source Code '''''''''''''''''''''''''''''''''''''''''' In Python source code, Unicode literals are written as strings prefixed with the 'u' or 'U' character: ``u'abcdefghijk'``. Specific code points can be written using the ``\u`` escape sequence, which is followed by four hex digits giving the code point. The ``\U`` escape sequence is similar, but expects 8 hex digits, not 4. Unicode literals can also use the same escape sequences as 8-bit strings, including ``\x``, but ``\x`` only takes two hex digits so it can't express an arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777. :: >>> s = u"a\xac\u1234\u20ac\U00008000" ^^^^ two-digit hex escape ^^^^^^ four-digit Unicode escape ^^^^^^^^^^ eight-digit Unicode escape >>> for c in s: print ord(c), ... 97 172 4660 8364 32768 Using escape sequences for code points greater than 127 is fine in small doses, but becomes an annoyance if you're using many accented characters, as you would in a program with messages in French or some other accent-using language. You can also assemble strings using the ``unichr()`` built-in function, but this is even more tedious. Ideally, you'd want to be able to write literals in your language's natural encoding. You could then edit Python source code with your favorite editor which would display the accented characters naturally, and have the right characters used at runtime. Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:: #!/usr/bin/env python # -*- coding: latin-1 -*- u = u'abcdé' print ord(u[-1]) The syntax is inspired by Emacs's notation for specifying variables local to a file. Emacs supports many different variables, but Python only supports 'coding'. The ``-*-`` symbols indicate that the comment is special; within them, you must supply the name ``coding`` and the name of your chosen encoding, separated by ``':'``. If you don't include such a comment, the default encoding used will be ASCII. Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default encoding for string literals; in Python 2.4, characters greater than 127 still work but result in a warning. For example, the following program has no encoding declaration:: #!/usr/bin/env python u = u'abcdé' print ord(u[-1]) When you run it with Python 2.4, it will output the following warning:: amk:~$ python p263.py sys:1: DeprecationWarning: Non-ASCII character '\xe9' in file p263.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details Unicode Properties ''''''''''''''''''' The Unicode specification includes a database of information about code points. For each code point that's defined, the information includes the character's name, its category, the numeric value if applicable (Unicode has characters representing the Roman numerals and fractions such as one-third and four-fifths). There are also properties related to the code point's use in bidirectional text and other display-related properties. The following program displays some information about several characters, and prints the numeric value of one particular character:: import unicodedata u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231) for i, c in enumerate(u): print i, '%04x' % ord(c), unicodedata.category(c), print unicodedata.name(c) # Get numeric value of second character print unicodedata.numeric(u[1]) When run, this prints:: 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE 1 0bf2 No TAMIL NUMBER ONE THOUSAND 2 0f84 Mn TIBETAN MARK HALANTA 3 1770 Lo TAGBANWA LETTER SA 4 33af So SQUARE RAD OVER S SQUARED 1000.0 The category codes are abbreviations describing the nature of the character. These are grouped into categories such as "Letter", "Number", "Punctuation", or "Symbol", which in turn are broken up into subcategories. To take the codes from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol, other". See <http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values> for a list of category codes. References '''''''''''''' The Unicode and 8-bit string types are described in the Python library reference at <http://docs.python.org/lib/typesseq.html>. The documentation for the ``unicodedata`` module is at <http://docs.python.org/lib/module-unicodedata.html>. The documentation for the ``codecs`` module is at <http://docs.python.org/lib/module-codecs.html>. Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and Unicode". A PDF version of his slides is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>, and is an excellent overview of the design of Python's Unicode features. Reading and Writing Unicode Data ---------------------------------------- Once you've written some code that works with Unicode data, the next problem is input/output. How do you get Unicode strings into your program, and how do you convert Unicode into a form suitable for storage or transmission? It's possible that you may not need to do anything depending on your input sources and output destinations; you should check whether the libraries used in your application support Unicode natively. XML parsers often return Unicode data, for example. Many relational databases also support Unicode-valued columns and can return Unicode values from an SQL query. Unicode data is usually converted to a particular encoding before it gets written to disk or sent over a socket. It's possible to do all the work yourself: open a file, read an 8-bit string from it, and convert the string with ``unicode(str, encoding)``. However, the manual approach is not recommended. One problem is the multi-byte nature of encodings; one Unicode character can be represented by several bytes. If you want to read the file in arbitrary-sized chunks (say, 1K or 4K), you need to write error-handling code to catch the case where only part of the bytes encoding a single Unicode character are read at the end of a chunk. One solution would be to read the entire file into memory and then perform the decoding, but that prevents you from working with files that are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM. (More, really, since for at least a moment you'd need to have both the encoded string and its Unicode version in memory.) The solution would be to use the low-level decoding interface to catch the case of partial coding sequences. The work of implementing this has already been done for you: the ``codecs`` module includes a version of the ``open()`` function that returns a file-like object that assumes the file's contents are in a specified encoding and accepts Unicode parameters for methods such as ``.read()`` and ``.write()``. The function's parameters are ``open(filename, mode='rb', encoding=None, errors='strict', buffering=1)``. ``mode`` can be ``'r'``, ``'w'``, or ``'a'``, just like the corresponding parameter to the regular built-in ``open()`` function; add a ``'+'`` to update the file. ``buffering`` is similarly parallel to the standard function's parameter. ``encoding`` is a string giving the encoding to use; if it's left as ``None``, a regular Python file object that accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and data written to or read from the wrapper object will be converted as needed. ``errors`` specifies the action for encoding errors and can be one of the usual values of 'strict', 'ignore', and 'replace'. Reading Unicode from a file is therefore simple:: import codecs f = codecs.open('unicode.rst', encoding='utf-8') for line in f: print repr(line) It's also possible to open files in update mode, allowing both reading and writing:: f = codecs.open('test', encoding='utf-8', mode='w+') f.write(u'\u4500 blah blah blah\n') f.seek(0) print repr(f.readline()[:1]) f.close() Unicode character U+FEFF is used as a byte-order mark (BOM), and is often written as the first character of a file in order to assist with autodetection of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be present at the start of a file; when such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as 'utf-16-le' and 'utf-16-be' for little-endian and big-endian encodings, that specify one particular byte ordering and don't skip the BOM. Unicode filenames ''''''''''''''''''''''''' Most of the operating systems in common use today support filenames that contain arbitrary Unicode characters. Usually this is implemented by converting the Unicode string into some encoding that varies depending on the system. For example, MacOS X uses UTF-8 while Windows uses a configurable encoding; on Windows, Python uses the name "mbcs" to refer to whatever the currently configured encoding is. On Unix systems, there will only be a filesystem encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if you haven't, the default encoding is ASCII. The ``sys.getfilesystemencoding()`` function returns the encoding to use on your current system, in case you want to do the encoding manually, but there's not much reason to bother. When opening a file for reading or writing, you can usually just provide the Unicode string as the filename, and it will be automatically converted to the right encoding for you:: filename = u'filename\u4500abc' f = open(filename, 'w') f.write('blah\n') f.close() Functions in the ``os`` module such as ``os.stat()`` will also accept Unicode filenames. ``os.listdir()``, which returns filenames, raises an issue: should it return the Unicode version of filenames, or should it return 8-bit strings containing the encoded versions? ``os.listdir()`` will do both, depending on whether you provided the directory path as an 8-bit string or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem's encoding and a list of Unicode strings will be returned, while passing an 8-bit path will return the 8-bit versions of the filenames. For example, assuming the default filesystem encoding is UTF-8, running the following program:: fn = u'filename\u4500abc' f = open(fn, 'w') f.close() import os print os.listdir('.') print os.listdir(u'.') will produce the following output:: amk:~$ python t.py ['.svn', 'filename\xe4\x94\x80abc', ...] [u'.svn', u'filename\u4500abc', ...] The first list contains UTF-8-encoded filenames, and the second list contains the Unicode versions. Tips for Writing Unicode-aware Programs '''''''''''''''''''''''''''''''''''''''''''' This section provides some suggestions on writing software that deals with Unicode. The most important tip is: Software should only work with Unicode strings internally, converting to a particular encoding on output. If you attempt to write processing functions that accept both Unicode and 8-bit strings, you will find your program vulnerable to bugs wherever you combine the two different kinds of strings. Python's default encoding is ASCII, so whenever a character with an ASCII value >127 is in the input data, you'll get a ``UnicodeDecodeError`` because that character can't be handled by the ASCII encoding. It's easy to miss such problems if you only test your software with data that doesn't contain any accents; everything will seem to work, but there's actually a bug in your program waiting for the first user who attempts to use characters >127. A second tip, therefore, is: Include characters >127 and, even better, characters >255 in your test data. When using data coming from a web browser or some other untrusted source, a common technique is to check for illegal characters in a string before using the string in a generated command line or storing it in a database. If you're doing this, be careful to check the string once it's in the form that will be used or stored; it's possible for encodings to be used to disguise characters. This is especially true if the input data also specifies the encoding; many encodings leave the commonly checked-for characters alone, but Python includes some encodings such as ``'base64'`` that modify every single character. For example, let's say you have a content management system that takes a Unicode filename, and you want to disallow paths with a '/' character. You might write this code:: def read_file (filename, encoding): if '/' in filename: raise ValueError("'/' not allowed in filenames") unicode_name = filename.decode(encoding) f = open(unicode_name, 'r') # ... return contents of file ... However, if an attacker could specify the ``'base64'`` encoding, they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string ``'/etc/passwd'``, to read a system file. The above code looks for ``'/'`` characters in the encoded form and misses the dangerous character in the resulting decoded form. References '''''''''''''' The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware Applications in Python" are available at <http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf> and discuss questions of character encodings as well as how to internationalize and localize an application. Revision History and Acknowledgements ------------------------------------------ Thanks to the following people who have noted errors or offered suggestions on this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André Lemburg, Martin von Löwis. Version 1.0: posted August 5 2005. Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds several links. Version 1.02: posted August 16 2005. Corrects factual errors. .. comment Additional topic: building Python w/ UCS2 or UCS4 support .. comment Describe obscure -U switch somewhere? .. comment Original outline: - [ ] Unicode introduction - [ ] ASCII - [ ] Terms - [ ] Character - [ ] Code point - [ ] Encodings - [ ] Common encodings: ASCII, Latin-1, UTF-8 - [ ] Unicode Python type - [ ] Writing unicode literals - [ ] Obscurity: -U switch - [ ] Built-ins - [ ] unichr() - [ ] ord() - [ ] unicode() constructor - [ ] Unicode type - [ ] encode(), decode() methods - [ ] Unicodedata module for character properties - [ ] I/O - [ ] Reading/writing Unicode data into files - [ ] Byte-order marks - [ ] Unicode filenames - [ ] Writing Unicode programs - [ ] Do everything in Unicode - [ ] Declaring source code encodings (PEP 263) - [ ] Other issues - [ ] Building Python (UCS2, UCS4)