[Python-checkins] python/dist/src/Doc/howto Makefile, NONE, 1.1 advocacy.tex, NONE, 1.1 curses.tex, NONE, 1.1 doanddont.tex, NONE, 1.1 regex.tex, NONE, 1.1 rexec.tex, NONE, 1.1 sockets.tex, NONE, 1.1 sorting.tex, NONE, 1.1 unicode.rst, NONE, 1.1

akuchling@users.sourceforge.net akuchling at users.sourceforge.net
Tue Aug 30 03:25:18 CEST 2005


Update of /cvsroot/python/python/dist/src/Doc/howto
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv32499

Added Files:
	Makefile advocacy.tex curses.tex doanddont.tex regex.tex 
	rexec.tex sockets.tex sorting.tex unicode.rst 
Log Message:
Commit the howto source to the main Python repository, with Fred's approval

--- NEW FILE: Makefile ---

MKHOWTO=../tools/mkhowto
WEBDIR=.
RSTARGS = --input-encoding=utf-8
VPATH=.:dvi:pdf:ps:txt

# List of HOWTOs that aren't to be processed

REMOVE_HOWTO =

# Determine list of files to be built

HOWTO=$(filter-out $(REMOVE_HOWTO),$(wildcard *.tex))
RST_SOURCES =	$(shell echo *.rst)
DVI  =$(patsubst %.tex,%.dvi,$(HOWTO))
PDF  =$(patsubst %.tex,%.pdf,$(HOWTO))
PS   =$(patsubst %.tex,%.ps,$(HOWTO))
TXT  =$(patsubst %.tex,%.txt,$(HOWTO))
HTML =$(patsubst %.tex,%,$(HOWTO))

# Rules for building various formats
%.dvi : %.tex
	$(MKHOWTO) --dvi $<
	mv $@ dvi

%.pdf : %.tex
	$(MKHOWTO) --pdf $<
	mv $@ pdf

%.ps : %.tex
	$(MKHOWTO) --ps $<
	mv $@ ps

%.txt : %.tex
	$(MKHOWTO) --text $<
	mv $@ txt

% : %.tex
	$(MKHOWTO) --html --iconserver="." $<
	tar -zcvf html/$*.tgz $*
	#zip -r html/$*.zip $*

default:
	@echo "'all'    -- build all files"
	@echo "'dvi', 'pdf', 'ps', 'txt', 'html' -- build one format"

all: $(HTML)

.PHONY : dvi pdf ps txt html rst
dvi: $(DVI)

pdf: $(PDF)
ps:  $(PS)
txt: $(TXT)
html:$(HTML)

# Rule to build collected tar files
dist: #all
	for i in dvi pdf ps txt ; do \
	    cd $$i ; \
	    tar -zcf All.tgz *.$$i ;\
	    cd .. ;\
	done

# Rule to copy files to the Web tree on AMK's machine
web: dist
	cp dvi/* $(WEBDIR)/dvi
	cp ps/* $(WEBDIR)/ps
	cp pdf/* $(WEBDIR)/pdf
	cp txt/* $(WEBDIR)/txt
	for dir in $(HTML) ; do cp -rp $$dir $(WEBDIR) ; done
	for ltx in $(HOWTO) ; do cp -p $$ltx $(WEBDIR)/latex ; done

rst: unicode.html

%.html: %.rst
	rst2html $(RSTARGS) $< >$@

clean:
	rm -f *~ *.log *.ind *.l2h *.aux *.toc *.how
	rm -f *.dvi *.ps *.pdf *.bkm
	rm -f unicode.html

clobber:
	rm dvi/* ps/* pdf/* txt/* html/*




--- NEW FILE: advocacy.tex ---

\documentclass{howto}

\title{Python Advocacy HOWTO}

\release{0.03}

\author{A.M. Kuchling}
\authoraddress{\email{amk at amk.ca}}

\begin{document}
\maketitle

\begin{abstract}
\noindent
It's usually difficult to get your management to accept open source
software, and Python is no exception to this rule.  This document
discusses reasons to use Python, strategies for winning acceptance,
facts and arguments you can use, and cases where you \emph{shouldn't}
try to use Python.

This document is available from the Python HOWTO page at
\url{http://www.python.org/doc/howto}.

\end{abstract}

\tableofcontents

\section{Reasons to Use Python}

There are several reasons to incorporate a scripting language into
your development process, and this section will discuss them, and why
Python has some properties that make it a particularly good choice.

 \subsection{Programmability}

Programs are often organized in a modular fashion.  Lower-level
operations are grouped together, and called by higher-level functions,
which may in turn be used as basic operations by still further upper
levels.  

For example, the lowest level might define a very low-level
set of functions for accessing a hash table.  The next level might use
hash tables to store the headers of a mail message, mapping a header
name like \samp{Date} to a value such as \samp{Tue, 13 May 1997
20:00:54 -0400}.  A yet higher level may operate on message objects,
without knowing or caring that message headers are stored in a hash
table, and so forth.  

Often, the lowest levels do very simple things; they implement a data
structure such as a binary tree or hash table, or they perform some
simple computation, such as converting a date string to a number.  The
higher levels then contain logic connecting these primitive
operations.  Using the approach, the primitives can be seen as basic
building blocks which are then glued together to produce the complete
product.  

Why is this design approach relevant to Python?  Because Python is
well suited to functioning as such a glue language.  A common approach
is to write a Python module that implements the lower level
operations; for the sake of speed, the implementation might be in C,
Java, or even Fortran.  Once the primitives are available to Python
programs, the logic underlying higher level operations is written in
the form of Python code.  The high-level logic is then more
understandable, and easier to modify.

John Ousterhout wrote a paper that explains this idea at greater
length, entitled ``Scripting: Higher Level Programming for the 21st
Century''.  I recommend that you read this paper; see the references
for the URL.  Ousterhout is the inventor of the Tcl language, and
therefore argues that Tcl should be used for this purpose; he only
briefly refers to other languages such as Python, Perl, and
Lisp/Scheme, but in reality, Ousterhout's argument applies to
scripting languages in general, since you could equally write
extensions for any of the languages mentioned above.

 \subsection{Prototyping}

In \emph{The Mythical Man-Month}, Fredrick Brooks suggests the
following rule when planning software projects: ``Plan to throw one
away; you will anyway.''  Brooks is saying that the first attempt at a
software design often turns out to be wrong; unless the problem is
very simple or you're an extremely good designer, you'll find that new
requirements and features become apparent once development has
actually started.  If these new requirements can't be cleanly
incorporated into the program's structure, you're presented with two
unpleasant choices: hammer the new features into the program somehow,
or scrap everything and write a new version of the program, taking the
new features into account from the beginning.

Python provides you with a good environment for quickly developing an
initial prototype.  That lets you get the overall program structure
and logic right, and you can fine-tune small details in the fast
development cycle that Python provides.  Once you're satisfied with
the GUI interface or program output, you can translate the Python code
into C++, Fortran, Java, or some other compiled language.

Prototyping means you have to be careful not to use too many Python
features that are hard to implement in your other language.  Using
\code{eval()}, or regular expressions, or the \module{pickle} module,
means that you're going to need C or Java libraries for formula
evaluation, regular expressions, and serialization, for example.  But
it's not hard to avoid such tricky code, and in the end the
translation usually isn't very difficult.  The resulting code can be
rapidly debugged, because any serious logical errors will have been
removed from the prototype, leaving only more minor slip-ups in the
translation to track down.  

This strategy builds on the earlier discussion of programmability.
Using Python as glue to connect lower-level components has obvious
relevance for constructing prototype systems.  In this way Python can
help you with development, even if end users never come in contact
with Python code at all.  If the performance of the Python version is
adequate and corporate politics allow it, you may not need to do a
translation into C or Java, but it can still be faster to develop a
prototype and then translate it, instead of attempting to produce the
final version immediately.

One example of this development strategy is Microsoft Merchant Server.
Version 1.0 was written in pure Python, by a company that subsequently
was purchased by Microsoft.  Version 2.0 began to translate the code
into \Cpp, shipping with some \Cpp code and some Python code.  Version
3.0 didn't contain any Python at all; all the code had been translated
into \Cpp.  Even though the product doesn't contain a Python
interpreter, the Python language has still served a useful purpose by
speeding up development.  

This is a very common use for Python.  Past conference papers have
also described this approach for developing high-level numerical
algorithms; see David M. Beazley and Peter S. Lomdahl's paper
``Feeding a Large-scale Physics Application to Python'' in the
references for a good example.  If an algorithm's basic operations are
things like "Take the inverse of this 4000x4000 matrix", and are
implemented in some lower-level language, then Python has almost no
additional performance cost; the extra time required for Python to
evaluate an expression like \code{m.invert()} is dwarfed by the cost
of the actual computation.  It's particularly good for applications
where seemingly endless tweaking is required to get things right. GUI
interfaces and Web sites are prime examples.

The Python code is also shorter and faster to write (once you're
familiar with Python), so it's easier to throw it away if you decide
your approach was wrong; if you'd spent two weeks working on it
instead of just two hours, you might waste time trying to patch up
what you've got out of a natural reluctance to admit that those two
weeks were wasted.  Truthfully, those two weeks haven't been wasted,
since you've learnt something about the problem and the technology
you're using to solve it, but it's human nature to view this as a
failure of some sort.

 \subsection{Simplicity and Ease of Understanding}

Python is definitely \emph{not} a toy language that's only usable for
small tasks.  The language features are general and powerful enough to
enable it to be used for many different purposes.  It's useful at the
small end, for 10- or 20-line scripts, but it also scales up to larger
systems that contain thousands of lines of code.

However, this expressiveness doesn't come at the cost of an obscure or
tricky syntax.  While Python has some dark corners that can lead to
obscure code, there are relatively few such corners, and proper design
can isolate their use to only a few classes or modules.  It's
certainly possible to write confusing code by using too many features
with too little concern for clarity, but most Python code can look a
lot like a slightly-formalized version of human-understandable
pseudocode.

In \emph{The New Hacker's Dictionary}, Eric S. Raymond gives the following
definition for "compact":

\begin{quotation}
	Compact \emph{adj.}  Of a design, describes the valuable property
	that it can all be apprehended at once in one's head. This
	generally means the thing created from the design can be used
	with greater facility and fewer errors than an equivalent tool
	that is not compact. Compactness does not imply triviality or
	lack of power; for example, C is compact and FORTRAN is not,
	but C is more powerful than FORTRAN. Designs become
	non-compact through accreting features and cruft that don't
	merge cleanly into the overall design scheme (thus, some fans
	of Classic C maintain that ANSI C is no longer compact).
\end{quotation}

(From \url{http://sagan.earthspace.net/jargon/jargon_18.html\#SEC25})

In this sense of the word, Python is quite compact, because the
language has just a few ideas, which are used in lots of places.  Take
namespaces, for example.  Import a module with \code{import math}, and
you create a new namespace called \samp{math}.  Classes are also
namespaces that share many of the properties of modules, and have a
few of their own; for example, you can create instances of a class.
Instances?  They're yet another namespace.  Namespaces are currently
implemented as Python dictionaries, so they have the same methods as
the standard dictionary data type: .keys() returns all the keys, and
so forth.

This simplicity arises from Python's development history.  The
language syntax derives from different sources; ABC, a relatively
obscure teaching language, is one primary influence, and Modula-3 is
another.  (For more information about ABC and Modula-3, consult their
respective Web sites at \url{http://www.cwi.nl/~steven/abc/} and
\url{http://www.m3.org}.)  Other features have come from C, Icon,
Algol-68, and even Perl.  Python hasn't really innovated very much,
but instead has tried to keep the language small and easy to learn,
building on ideas that have been tried in other languages and found
useful.

Simplicity is a virtue that should not be underestimated.  It lets you
learn the language more quickly, and then rapidly write code, code
that often works the first time you run it.

 \subsection{Java Integration}

If you're working with Java, Jython
(\url{http://www.jython.org/}) is definitely worth your
attention.  Jython is a re-implementation of Python in Java that
compiles Python code into Java bytecodes.  The resulting environment
has very tight, almost seamless, integration with Java.  It's trivial
to access Java classes from Python, and you can write Python classes
that subclass Java classes.  Jython can be used for prototyping Java
applications in much the same way CPython is used, and it can also be
used for test suites for Java code, or embedded in a Java application
to add scripting capabilities.  

\section{Arguments and Rebuttals}

Let's say that you've decided upon Python as the best choice for your
application.  How can you convince your management, or your fellow
developers, to use Python?  This section lists some common arguments
against using Python, and provides some possible rebuttals.

\emph{Python is freely available software that doesn't cost anything.
How good can it be?}

Very good, indeed.  These days Linux and Apache, two other pieces of
open source software, are becoming more respected as alternatives to
commercial software, but Python hasn't had all the publicity.

Python has been around for several years, with many users and
developers.  Accordingly, the interpreter has been used by many
people, and has gotten most of the bugs shaken out of it.  While bugs
are still discovered at intervals, they're usually either quite
obscure (they'd have to be, for no one to have run into them before)
or they involve interfaces to external libraries.  The internals of
the language itself are quite stable.

Having the source code should be viewed as making the software
available for peer review; people can examine the code, suggest (and
implement) improvements, and track down bugs.  To find out more about
the idea of open source code, along with arguments and case studies
supporting it, go to \url{http://www.opensource.org}.

\emph{Who's going to support it?}

Python has a sizable community of developers, and the number is still
growing.  The Internet community surrounding the language is an active
one, and is worth being considered another one of Python's advantages.
Most questions posted to the comp.lang.python newsgroup are quickly
answered by someone.

Should you need to dig into the source code, you'll find it's clear
and well-organized, so it's not very difficult to write extensions and
track down bugs yourself.  If you'd prefer to pay for support, there
are companies and individuals who offer commercial support for Python.

\emph{Who uses Python for serious work?}

Lots of people; one interesting thing about Python is the surprising
diversity of applications that it's been used for.  People are using
Python to:

\begin{itemize}
\item Run Web sites
\item Write GUI interfaces
\item Control
number-crunching code on supercomputers
\item Make a commercial application scriptable by embedding the Python
interpreter inside it
\item Process large XML data sets
\item Build test suites for C or Java code
\end{itemize}

Whatever your application domain is, there's probably someone who's
used Python for something similar.  Yet, despite being useable for
such high-end applications, Python's still simple enough to use for
little jobs.

See \url{http://www.python.org/psa/Users.html} for a list of some of the 
organizations that use Python.

\emph{What are the restrictions on Python's use?}

They're practically nonexistent.  Consult the \file{Misc/COPYRIGHT}
file in the source distribution, or
\url{http://www.python.org/doc/Copyright.html} for the full language,
but it boils down to three conditions.

\begin{itemize}

\item You have to leave the copyright notice on the software; if you
don't include the source code in a product, you have to put the
copyright notice in the supporting documentation.  

\item Don't claim that the institutions that have developed Python
endorse your product in any way.

\item If something goes wrong, you can't sue for damages.  Practically
all software licences contain this condition.

\end{itemize}

Notice that you don't have to provide source code for anything that
contains Python or is built with it.  Also, the Python interpreter and
accompanying documentation can be modified and redistributed in any
way you like, and you don't have to pay anyone any licensing fees at
all.

\emph{Why should we use an obscure language like Python instead of
well-known language X?}

I hope this HOWTO, and the documents listed in the final section, will
help convince you that Python isn't obscure, and has a healthily
growing user base.  One word of advice: always present Python's
positive advantages, instead of concentrating on language X's
failings.  People want to know why a solution is good, rather than why
all the other solutions are bad.  So instead of attacking a competing
solution on various grounds, simply show how Python's virtues can
help.


\section{Useful Resources}

\begin{definitions}

\term{\url{http://www.fsbassociates.com/books/pythonchpt1.htm}}

The first chapter of \emph{Internet Programming with Python} also
examines some of the reasons for using Python.  The book is well worth
buying, but the publishers have made the first chapter available on
the Web.

\term{\url{http://home.pacbell.net/ouster/scripting.html}}
 
John Ousterhout's white paper on scripting is a good argument for the
utility of scripting languages, though naturally enough, he emphasizes
Tcl, the language he developed.  Most of the arguments would apply to
any scripting language.

\term{\url{http://www.python.org/workshops/1997-10/proceedings/beazley.html}}

The authors, David M. Beazley and Peter S. Lomdahl, 
describe their use of Python at Los Alamos National Laboratory.
It's another good example of how Python can help get real work done.
This quotation from the paper has been echoed by many people:

\begin{quotation}
       Originally developed as a large monolithic application for
       massively parallel processing systems, we have used Python to
       transform our application into a flexible, highly modular, and
       extremely powerful system for performing simulation, data
       analysis, and visualization. In addition, we describe how Python
       has solved a number of important problems related to the
       development, debugging, deployment, and maintenance of scientific
       software.
\end{quotation}

%\term{\url{http://www.pythonjournal.com/volume1/art-interview/}}
 
%This interview with Andy Feit, discussing Infoseek's use of Python, can be
%used to show that choosing Python didn't introduce any difficulties
%into a company's development process, and provided some substantial benefits.

\term{\url{http://www.python.org/psa/Commercial.html}} 

Robin Friedrich wrote this document on how to support Python's use in
commercial projects.

\term{\url{http://www.python.org/workshops/1997-10/proceedings/stein.ps}}

For the 6th Python conference, Greg Stein presented a paper that
traced Python's adoption and usage at a startup called eShop, and
later at Microsoft.

\term{\url{http://www.opensource.org}} 

Management may be doubtful of the reliability and usefulness of
software that wasn't written commercially.  This site presents
arguments that show how open source software can have considerable
advantages over closed-source software.

\term{\url{http://sunsite.unc.edu/LDP/HOWTO/mini/Advocacy.html}}

The Linux Advocacy mini-HOWTO was the inspiration for this document,
and is also well worth reading for general suggestions on winning
acceptance for a new technology, such as Linux or Python.  In general,
you won't make much progress by simply attacking existing systems and
complaining about their inadequacies; this often ends up looking like
unfocused whining.  It's much better to point out some of the many
areas where Python is an improvement over other systems.  

\end{definitions}

\end{document}



--- NEW FILE: curses.tex ---
\documentclass{howto}

\title{Curses Programming with Python}

\release{2.01}

\author{A.M. Kuchling, Eric S. Raymond}
\authoraddress{\email{amk at amk.ca}, \email{esr at thyrsus.com}}

\begin{document}
\maketitle

\begin{abstract}
\noindent
This document describes how to write text-mode programs with Python 2.x,
using the \module{curses} extension module to control the display.   

This document is available from the Python HOWTO page at
\url{http://www.python.org/doc/howto}.
\end{abstract}

\tableofcontents

\section{What is curses?}

The curses library supplies a terminal-independent screen-painting and
keyboard-handling facility for text-based terminals; such terminals
include VT100s, the Linux console, and the simulated terminal provided
by X11 programs such as xterm and rxvt.  Display terminals support
various control codes to perform common operations such as moving the
cursor, scrolling the screen, and erasing areas.  Different terminals
use widely differing codes, and often have their own minor quirks.

In a world of X displays, one might ask ``why bother''?  It's true
that character-cell display terminals are an obsolete technology, but
there are niches in which being able to do fancy things with them are
still valuable.  One is on small-footprint or embedded Unixes that 
don't carry an X server.  Another is for tools like OS installers
and kernel configurators that may have to run before X is available.

The curses library hides all the details of different terminals, and
provides the programmer with an abstraction of a display, containing
multiple non-overlapping windows.  The contents of a window can be
changed in various ways--adding text, erasing it, changing its
appearance--and the curses library will automagically figure out what
control codes need to be sent to the terminal to produce the right
output.

The curses library was originally written for BSD Unix; the later System V
versions of Unix from AT\&T added many enhancements and new functions.
BSD curses is no longer maintained, having been replaced by ncurses,
which is an open-source implementation of the AT\&T interface.  If you're
using an open-source Unix such as Linux or FreeBSD, your system almost
certainly uses ncurses.  Since most current commercial Unix versions
are based on System V code, all the functions described here will
probably be available.  The older versions of curses carried by some
proprietary Unixes may not support everything, though.

No one has made a Windows port of the curses module.  On a Windows
platform, try the Console module written by Fredrik Lundh.  The
Console module provides cursor-addressable text output, plus full
support for mouse and keyboard input, and is available from
\url{http://effbot.org/efflib/console}.

\subsection{The Python curses module}

Thy Python module is a fairly simple wrapper over the C functions
provided by curses; if you're already familiar with curses programming
in C, it's really easy to transfer that knowledge to Python.  The
biggest difference is that the Python interface makes things simpler,
by merging different C functions such as \function{addstr},
\function{mvaddstr}, \function{mvwaddstr}, into a single
\method{addstr()} method.  You'll see this covered in more detail
later.

This HOWTO is simply an introduction to writing text-mode programs
with curses and Python. It doesn't attempt to be a complete guide to
the curses API; for that, see the Python library guide's serction on
ncurses, and the C manual pages for ncurses.  It will, however, give
you the basic ideas.

\section{Starting and ending a curses application}

Before doing anything, curses must be initialized.  This is done by
calling the \function{initscr()} function, which will determine the
terminal type, send any required setup codes to the terminal, and
create various internal data structures.  If successful,
\function{initscr()} returns a window object representing the entire
screen; this is usually called \code{stdscr}, after the name of the
corresponding C
variable.

\begin{verbatim}
import curses
stdscr = curses.initscr()
\end{verbatim}

Usually curses applications turn off automatic echoing of keys to the
screen, in order to be able to read keys and only display them under
certain circumstances.  This requires calling the \function{noecho()}
function.

\begin{verbatim}
curses.noecho()
\end{verbatim}

Applications will also commonly need to react to keys instantly,
without requiring the Enter key to be pressed; this is called cbreak
mode, as opposed to the usual buffered input mode.

\begin{verbatim}
curses.cbreak()
\end{verbatim}

Terminals usually return special keys, such as the cursor keys or
navigation keys such as Page Up and Home, as a multibyte escape
sequence.  While you could write your application to expect such
sequences and process them accordingly, curses can do it for you,
returning a special value such as \constant{curses.KEY_LEFT}.  To get
curses to do the job, you'll have to enable keypad mode.

\begin{verbatim}
stdscr.keypad(1)
\end{verbatim}

Terminating a curses application is much easier than starting one.
You'll need to call 

\begin{verbatim}
curses.nocbreak(); stdscr.keypad(0); curses.echo()
\end{verbatim}

to reverse the curses-friendly terminal settings. Then call the
\function{endwin()} function to restore the terminal to its original
operating mode.

\begin{verbatim}
curses.endwin()
\end{verbatim}

A common problem when debugging a curses application is to get your
terminal messed up when the application dies without restoring the
terminal to its previous state.  In Python this commonly happens when
your code is buggy and raises an uncaught exception.  Keys are no
longer be echoed to the screen when you type them, for example, which
makes using the shell difficult.

In Python you can avoid these complications and make debugging much
easier by importing the module \module{curses.wrapper}.  It supplies a
function \function{wrapper} that takes a hook argument.  It does the
initializations described above, and also initializes colors if color
support is present.  It then runs your hook, and then finally
deinitializes appropriately.  The hook is called inside a try-catch
clause which catches exceptions, performs curses deinitialization, and
then passes the exception upwards.  Thus, your terminal won't be left
in a funny state on exception.

\section{Windows and Pads}

Windows are the basic abstraction in curses.  A window object
represents a rectangular area of the screen, and supports various
 methods to display text, erase it, allow the user to input strings,
and so forth.

The \code{stdscr} object returned by the \function{initscr()} function
is a window object that covers the entire screen.  Many programs may
need only this single window, but you might wish to divide the screen
into smaller windows, in order to redraw or clear them separately.
The \function{newwin()} function creates a new window of a given size,
returning the new window object.

\begin{verbatim}
begin_x = 20 ; begin_y = 7
height = 5 ; width = 40
win = curses.newwin(height, width, begin_y, begin_x)
\end{verbatim}

A word about the coordinate system used in curses: coordinates are
always passed in the order \emph{y,x}, and the top-left corner of a
window is coordinate (0,0).  This breaks a common convention for
handling coordinates, where the \emph{x} coordinate usually comes
first.  This is an unfortunate difference from most other computer
applications, but it's been part of curses since it was first written,
and it's too late to change things now.

When you call a method to display or erase text, the effect doesn't
immediately show up on the display.  This is because curses was
originally written with slow 300-baud terminal connections in mind;
with these terminals, minimizing the time required to redraw the
screen is very important.  This lets curses accumulate changes to the
screen, and display them in the most efficient manner.  For example,
if your program displays some characters in a window, and then clears
the window, there's no need to send the original characters because
they'd never be visible.  

Accordingly, curses requires that you explicitly tell it to redraw
windows, using the \function{refresh()} method of window objects.  In
practice, this doesn't really complicate programming with curses much.
Most programs go into a flurry of activity, and then pause waiting for
a keypress or some other action on the part of the user.  All you have
to do is to be sure that the screen has been redrawn before pausing to
wait for user input, by simply calling \code{stdscr.refresh()} or the
\function{refresh()} method of some other relevant window.

A pad is a special case of a window; it can be larger than the actual
display screen, and only a portion of it displayed at a time.
Creating a pad simply requires the pad's height and width, while
refreshing a pad requires giving the coordinates of the on-screen
area where a subsection of the pad will be displayed.  

\begin{verbatim}
pad = curses.newpad(100, 100)
#  These loops fill the pad with letters; this is
# explained in the next section
for y in range(0, 100):
    for x in range(0, 100):
        try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 )
        except curses.error: pass

#  Displays a section of the pad in the middle of the screen
pad.refresh( 0,0, 5,5, 20,75)
\end{verbatim}

The \function{refresh()} call displays a section of the pad in the
rectangle extending from coordinate (5,5) to coordinate (20,75) on the
screen;the upper left corner of the displayed section is coordinate
(0,0) on the pad.  Beyond that difference, pads are exactly like
ordinary windows and support the same methods.

If you have multiple windows and pads on screen there is a more
efficient way to go, which will prevent annoying screen flicker at
refresh time.  Use the methods \method{noutrefresh()} and/or
\method{noutrefresh()} of each window to update the data structure
representing the desired state of the screen; then change the physical
screen to match the desired state in one go with the function
\function{doupdate()}.  The normal \method{refresh()} method calls
\function{doupdate()} as its last act.

\section{Displaying Text}

{}From a C programmer's point of view, curses may sometimes look like
a twisty maze of functions, all subtly different.  For example,
\function{addstr()} displays a string at the current cursor location
in the \code{stdscr} window, while \function{mvaddstr()} moves to a
given y,x coordinate first before displaying the string.
\function{waddstr()} is just like \function{addstr()}, but allows
specifying a window to use, instead of using \code{stdscr} by default.
\function{mvwaddstr()} follows similarly.

Fortunately the Python interface hides all these details;
\code{stdscr} is a window object like any other, and methods like
\function{addstr()} accept multiple argument forms.  Usually there are
four different forms.

\begin{tableii}{|c|l|}{textrm}{Form}{Description}
\lineii{\var{str} or \var{ch}}{Display the string \var{str} or
character \var{ch}}
\lineii{\var{str} or \var{ch}, \var{attr}}{Display the string \var{str} or
character \var{ch}, using attribute \var{attr}}
\lineii{\var{y}, \var{x}, \var{str} or \var{ch}}
{Move to position \var{y,x} within the window, and display \var{str}
or \var{ch}}
\lineii{\var{y}, \var{x}, \var{str} or \var{ch}, \var{attr}}
{Move to position \var{y,x} within the window, and display \var{str}
or \var{ch}, using attribute \var{attr}}
\end{tableii}

Attributes allow displaying text in highlighted forms, such as in
boldface, underline, reverse code, or in color.  They'll be explained
in more detail in the next subsection.

The \function{addstr()} function takes a Python string as the value to
be displayed, while the \function{addch()} functions take a character,
which can be either a Python string of length 1, or an integer.  If
it's a string, you're limited to displaying characters between 0 and
255.  SVr4 curses provides constants for extension characters; these
constants are integers greater than 255.  For example,
\constant{ACS_PLMINUS} is a +/- symbol, and \constant{ACS_ULCORNER} is
the upper left corner of a box (handy for drawing borders).

Windows remember where the cursor was left after the last operation,
so if you leave out the \var{y,x} coordinates, the string or character
will be displayed wherever the last operation left off.  You can also
move the cursor with the \function{move(\var{y,x})} method.  Because
some terminals always display a flashing cursor, you may want to
ensure that the cursor is positioned in some location where it won't
be distracting; it can be confusing to have the cursor blinking at
some apparently random location.  

If your application doesn't need a blinking cursor at all, you can
call \function{curs_set(0)} to make it invisible.  Equivalently, and
for compatibility with older curses versions, there's a
\function{leaveok(\var{bool})} function.  When \var{bool} is true, the
curses library will attempt to suppress the flashing cursor, and you
won't need to worry about leaving it in odd locations.

\subsection{Attributes and Color}

Characters can be displayed in different ways.  Status lines in a
text-based application are commonly shown in reverse video; a text
viewer may need to highlight certain words.  curses supports this by
allowing you to specify an attribute for each cell on the screen.

An attribute is a integer, each bit representing a different
attribute.  You can try to display text with multiple attribute bits
set, but curses doesn't guarantee that all the possible combinations
are available, or that they're all visually distinct.  That depends on
the ability of the terminal being used, so it's safest to stick to the
most commonly available attributes, listed here.

\begin{tableii}{|c|l|}{constant}{Attribute}{Description}
\lineii{A_BLINK}{Blinking text}
\lineii{A_BOLD}{Extra bright or bold text}
\lineii{A_DIM}{Half bright text}
\lineii{A_REVERSE}{Reverse-video text}
\lineii{A_STANDOUT}{The best highlighting mode available}
\lineii{A_UNDERLINE}{Underlined text}
\end{tableii}

So, to display a reverse-video status line on the top line of the
screen,
you could code:

\begin{verbatim}
stdscr.addstr(0, 0, "Current mode: Typing mode",
	      curses.A_REVERSE)
stdscr.refresh()
\end{verbatim}

The curses library also supports color on those terminals that
provide it, The most common such terminal is probably the Linux
console, followed by color xterms.

To use color, you must call the \function{start_color()} function
soon after calling \function{initscr()}, to initialize the default
color set (the \function{curses.wrapper.wrapper()} function does this
automatically).  Once that's done, the \function{has_colors()}
function returns TRUE if the terminal in use can actually display
color.  (Note from AMK:  curses uses the American spelling
'color', instead of the Canadian/British spelling 'colour'.  If you're
like me, you'll have to resign yourself to misspelling it for the sake
of these functions.)

The curses library maintains a finite number of color pairs,
containing a foreground (or text) color and a background color.  You
can get the attribute value corresponding to a color pair with the
\function{color_pair()} function; this can be bitwise-OR'ed with other
attributes such as \constant{A_REVERSE}, but again, such combinations
are not guaranteed to work on all terminals.

An example, which displays a line of text using color pair 1:

\begin{verbatim}
stdscr.addstr( "Pretty text", curses.color_pair(1) )
stdscr.refresh()
\end{verbatim}

As I said before, a color pair consists of a foreground and
background color.  \function{start_color()} initializes 8 basic
colors when it activates color mode.  They are: 0:black, 1:red,
2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and 7:white.  The curses
module defines named constants for each of these colors:
\constant{curses.COLOR_BLACK}, \constant{curses.COLOR_RED}, and so
forth.

The \function{init_pair(\var{n, f, b})} function changes the
definition of color pair \var{n}, to foreground color {f} and
background color {b}.  Color pair 0 is hard-wired to white on black,
and cannot be changed.  

Let's put all this together. To change color 1 to red
text on a white background, you would call:

\begin{verbatim}
curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
\end{verbatim}

When you change a color pair, any text already displayed using that
color pair will change to the new colors.  You can also display new
text in this color with:

\begin{verbatim}
stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) )
\end{verbatim}

Very fancy terminals can change the definitions of the actual colors
to a given RGB value.  This lets you change color 1, which is usually
red, to purple or blue or any other color you like.  Unfortunately,
the Linux console doesn't support this, so I'm unable to try it out,
and can't provide any examples.  You can check if your terminal can do
this by calling \function{can_change_color()}, which returns TRUE if
the capability is there.  If you're lucky enough to have such a
talented terminal, consult your system's man pages for more
information.

\section{User Input}

The curses library itself offers only very simple input mechanisms.
Python's support adds a text-input widget that makes up some of the
lack.

The most common way to get input to a window is to use its
\method{getch()} method. that pauses, and waits for the user to hit
a key, displaying it if \function{echo()} has been called earlier.
You can optionally specify a coordinate to which the cursor should be
moved before pausing.

It's possible to change this behavior with the method
\method{nodelay()}. After \method{nodelay(1)}, \method{getch()} for
the window becomes non-blocking and returns ERR (-1) when no input is
ready.  There's also a \function{halfdelay()} function, which can be
used to (in effect) set a timer on each \method{getch()}; if no input
becomes available within the number of milliseconds specified as the
argument to \function{halfdelay()}, curses throws an exception.

The \method{getch()} method returns an integer; if it's between 0 and
255, it represents the ASCII code of the key pressed.  Values greater
than 255 are special keys such as Page Up, Home, or the cursor keys.
You can compare the value returned to constants such as
\constant{curses.KEY_PPAGE}, \constant{curses.KEY_HOME}, or
\constant{curses.KEY_LEFT}.  Usually the main loop of your program
will look something like this:

\begin{verbatim}
while 1:
    c = stdscr.getch()
    if c == ord('p'): PrintDocument()
    elif c == ord('q'): break  # Exit the while()
    elif c == curses.KEY_HOME: x = y = 0
\end{verbatim}

The \module{curses.ascii} module supplies ASCII class membership
functions that take either integer or 1-character-string
arguments; these may be useful in writing more readable tests for
your command interpreters.  It also supplies conversion functions 
that take either integer or 1-character-string arguments and return
the same type.  For example, \function{curses.ascii.ctrl()} returns
the control character corresponding to its argument.

There's also a method to retrieve an entire string,
\constant{getstr()}.  It isn't used very often, because its
functionality is quite limited; the only editing keys available are
the backspace key and the Enter key, which terminates the string.  It
can optionally be limited to a fixed number of characters.

\begin{verbatim}
curses.echo()            # Enable echoing of characters

# Get a 15-character string, with the cursor on the top line 
s = stdscr.getstr(0,0, 15)  
\end{verbatim}

The Python \module{curses.textpad} module supplies something better.
With it, you can turn a window into a text box that supports an
Emacs-like set of keybindings.  Various methods of \class{Textbox}
class support editing with input validation and gathering the edit
results either with or without trailing spaces.   See the library
documentation on \module{curses.textpad} for the details.

\section{For More Information}

This HOWTO didn't cover some advanced topics, such as screen-scraping
or capturing mouse events from an xterm instance.  But the Python
library page for the curses modules is now pretty complete.  You
should browse it next.

If you're in doubt about the detailed behavior of any of the ncurses
entry points, consult the manual pages for your curses implementation,
whether it's ncurses or a proprietary Unix vendor's.  The manual pages
will document any quirks, and provide complete lists of all the
functions, attributes, and \constant{ACS_*} characters available to
you.

Because the curses API is so large, some functions aren't supported in
the Python interface, not because they're difficult to implement, but
because no one has needed them yet.  Feel free to add them and then
submit a patch.  Also, we don't yet have support for the menus or
panels libraries associated with ncurses; feel free to add that.

If you write an interesting little program, feel free to contribute it
as another demo.  We can always use more of them!

The ncurses FAQ: \url{http://dickey.his.com/ncurses/ncurses.faq.html}

\end{document}

--- NEW FILE: doanddont.tex ---
\documentclass{howto}

\title{Idioms and Anti-Idioms in Python}

\release{0.00}

\author{Moshe Zadka}
\authoraddress{howto at zadka.site.co.il}

\begin{document}
\maketitle

This document is placed in the public doman.

\begin{abstract}
\noindent
This document can be considered a companion to the tutorial. It
shows how to use Python, and even more importantly, how {\em not}
to use Python. 
\end{abstract}

\tableofcontents

\section{Language Constructs You Should Not Use}

While Python has relatively few gotchas compared to other languages, it
still has some constructs which are only useful in corner cases, or are
plain dangerous. 

\subsection{from module import *}

\subsubsection{Inside Function Definitions}

\code{from module import *} is {\em invalid} inside function definitions.
While many versions of Python do no check for the invalidity, it does not
make it more valid, no more then having a smart lawyer makes a man innocent.
Do not use it like that ever. Even in versions where it was accepted, it made
the function execution slower, because the compiler could not be certain
which names are local and which are global. In Python 2.1 this construct
causes warnings, and sometimes even errors.

\subsubsection{At Module Level}

While it is valid to use \code{from module import *} at module level it
is usually a bad idea. For one, this loses an important property Python
otherwise has --- you can know where each toplevel name is defined by
a simple "search" function in your favourite editor. You also open yourself
to trouble in the future, if some module grows additional functions or
classes. 

One of the most awful question asked on the newsgroup is why this code:

\begin{verbatim}
f = open("www")
f.read()
\end{verbatim}

does not work. Of course, it works just fine (assuming you have a file
called "www".) But it does not work if somewhere in the module, the
statement \code{from os import *} is present. The \module{os} module
has a function called \function{open()} which returns an integer. While
it is very useful, shadowing builtins is one of its least useful properties.

Remember, you can never know for sure what names a module exports, so either
take what you need --- \code{from module import name1, name2}, or keep them in
the module and access on a per-need basis --- 
\code{import module;print module.name}.

\subsubsection{When It Is Just Fine}

There are situations in which \code{from module import *} is just fine:

\begin{itemize}

\item The interactive prompt. For example, \code{from math import *} makes
      Python an amazing scientific calculator.

\item When extending a module in C with a module in Python.

\item When the module advertises itself as \code{from import *} safe.

\end{itemize}

\subsection{Unadorned \keyword{exec}, \function{execfile} and friends}

The word ``unadorned'' refers to the use without an explicit dictionary,
in which case those constructs evaluate code in the {\em current} environment.
This is dangerous for the same reasons \code{from import *} is dangerous ---
it might step over variables you are counting on and mess up things for
the rest of your code. Simply do not do that.

Bad examples:

\begin{verbatim}
>>> for name in sys.argv[1:]:
>>>     exec "%s=1" % name
>>> def func(s, **kw):
>>>     for var, val in kw.items():
>>>         exec "s.%s=val" % var  # invalid!
>>> execfile("handler.py")
>>> handle()
\end{verbatim}

Good examples:

\begin{verbatim}
>>> d = {}
>>> for name in sys.argv[1:]:
>>>     d[name] = 1
>>> def func(s, **kw):
>>>     for var, val in kw.items():
>>>         setattr(s, var, val)
>>> d={}
>>> execfile("handle.py", d, d)
>>> handle = d['handle']
>>> handle()
\end{verbatim}

\subsection{from module import name1, name2}

This is a ``don't'' which is much weaker then the previous ``don't''s
but is still something you should not do if you don't have good reasons
to do that. The reason it is usually bad idea is because you suddenly
have an object which lives in two seperate namespaces. When the binding
in one namespace changes, the binding in the other will not, so there
will be a discrepancy between them. This happens when, for example,
one module is reloaded, or changes the definition of a function at runtime. 

Bad example:

\begin{verbatim}
# foo.py
a = 1

# bar.py
from foo import a
if something():
    a = 2 # danger: foo.a != a 
\end{verbatim}

Good example:

\begin{verbatim}
# foo.py
a = 1

# bar.py
import foo
if something():
    foo.a = 2
\end{verbatim}

\subsection{except:}

Python has the \code{except:} clause, which catches all exceptions.
Since {\em every} error in Python raises an exception, this makes many
programming errors look like runtime problems, and hinders
the debugging process.

The following code shows a great example:

\begin{verbatim}
try:
    foo = opne("file") # misspelled "open"
except:
    sys.exit("could not open file!")
\end{verbatim}

The second line triggers a \exception{NameError} which is caught by the
except clause. The program will exit, and you will have no idea that
this has nothing to do with the readability of \code{"file"}.

The example above is better written

\begin{verbatim}
try:
    foo = opne("file") # will be changed to "open" as soon as we run it
except IOError:
    sys.exit("could not open file")
\end{verbatim}

There are some situations in which the \code{except:} clause is useful:
for example, in a framework when running callbacks, it is good not to
let any callback disturb the framework.

\section{Exceptions}

Exceptions are a useful feature of Python. You should learn to raise
them whenever something unexpected occurs, and catch them only where
you can do something about them.

The following is a very popular anti-idiom

\begin{verbatim}
def get_status(file):
    if not os.path.exists(file):
        print "file not found"
        sys.exit(1)
    return open(file).readline()
\end{verbatim}

Consider the case the file gets deleted between the time the call to 
\function{os.path.exists} is made and the time \function{open} is called.
That means the last line will throw an \exception{IOError}. The same would
happen if \var{file} exists but has no read permission. Since testing this
on a normal machine on existing and non-existing files make it seem bugless,
that means in testing the results will seem fine, and the code will get
shipped. Then an unhandled \exception{IOError} escapes to the user, who
has to watch the ugly traceback.

Here is a better way to do it.

\begin{verbatim}
def get_status(file):
    try:
        return open(file).readline()
    except (IOError, OSError):
        print "file not found"
        sys.exit(1)
\end{verbatim}

In this version, *either* the file gets opened and the line is read
(so it works even on flaky NFS or SMB connections), or the message
is printed and the application aborted.

Still, \function{get_status} makes too many assumptions --- that it
will only be used in a short running script, and not, say, in a long
running server. Sure, the caller could do something like

\begin{verbatim}
try:
    status = get_status(log)
except SystemExit:
    status = None
\end{verbatim}

So, try to make as few \code{except} clauses in your code --- those will
usually be a catch-all in the \function{main}, or inside calls which
should always succeed.

So, the best version is probably

\begin{verbatim}
def get_status(file):
    return open(file).readline()
\end{verbatim}

The caller can deal with the exception if it wants (for example, if it 
tries several files in a loop), or just let the exception filter upwards
to {\em its} caller.

The last version is not very good either --- due to implementation details,
the file would not be closed when an exception is raised until the handler
finishes, and perhaps not at all in non-C implementations (e.g., Jython).

\begin{verbatim}
def get_status(file):
    fp = open(file)
    try:
        return fp.readline()
    finally:
        fp.close()
\end{verbatim}

\section{Using the Batteries}

Every so often, people seem to be writing stuff in the Python library
again, usually poorly. While the occasional module has a poor interface,
it is usually much better to use the rich standard library and data
types that come with Python then inventing your own.

A useful module very few people know about is \module{os.path}. It 
always has the correct path arithmetic for your operating system, and
will usually be much better then whatever you come up with yourself.

Compare:

\begin{verbatim}
# ugh!
return dir+"/"+file
# better
return os.path.join(dir, file)
\end{verbatim}

More useful functions in \module{os.path}: \function{basename}, 
\function{dirname} and \function{splitext}.

There are also many useful builtin functions people seem not to be
aware of for some reason: \function{min()} and \function{max()} can
find the minimum/maximum of any sequence with comparable semantics,
for example, yet many people write they own max/min. Another highly
useful function is \function{reduce()}. Classical use of \function{reduce()}
is something like

\begin{verbatim}
import sys, operator
nums = map(float, sys.argv[1:])
print reduce(operator.add, nums)/len(nums)
\end{verbatim}

This cute little script prints the average of all numbers given on the
command line. The \function{reduce()} adds up all the numbers, and
the rest is just some pre- and postprocessing.

On the same note, note that \function{float()}, \function{int()} and
\function{long()} all accept arguments of type string, and so are
suited to parsing --- assuming you are ready to deal with the
\exception{ValueError} they raise.

\section{Using Backslash to Continue Statements}

Since Python treats a newline as a statement terminator,
and since statements are often more then is comfortable to put
in one line, many people do:

\begin{verbatim}
if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \
   calculate_number(10, 20) != forbulate(500, 360):
      pass
\end{verbatim}

You should realize that this is dangerous: a stray space after the
\code{\\} would make this line wrong, and stray spaces are notoriously
hard to see in editors. In this case, at least it would be a syntax
error, but if the code was:

\begin{verbatim}
value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \
        + calculate_number(10, 20)*forbulate(500, 360)
\end{verbatim}

then it would just be subtly wrong.

It is usually much better to use the implicit continuation inside parenthesis:

This version is bulletproof:

\begin{verbatim}
value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9] 
        + calculate_number(10, 20)*forbulate(500, 360))
\end{verbatim}

\end{document}

--- NEW FILE: regex.tex ---
\documentclass{howto}

% TODO:
% Document lookbehind assertions
% Better way of displaying a RE, a string, and what it matches
% Mention optional argument to match.groups()
% Unicode (at least a reference)

\title{Regular Expression HOWTO}

\release{0.05}

\author{A.M. Kuchling}
\authoraddress{\email{amk at amk.ca}}

\begin{document}
\maketitle

\begin{abstract}
[...1427 lines suppressed...]
% $

\section{Feedback}

Regular expressions are a complicated topic.  Did this document help
you understand them?  Were there parts that were unclear, or Problems
you encountered that weren't covered here?  If so, please send
suggestions for improvements to the author.

The most complete book on regular expressions is almost certainly
Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published
by O'Reilly.  Unfortunately, it exclusively concentrates on Perl and
Java's flavours of regular expressions, and doesn't contain any Python
material at all, so it won't be useful as a reference for programming
in Python.  (The first edition covered Python's now-obsolete
\module{regex} module, which won't help you much.)  Consider checking
it out from your library.

\end{document}


--- NEW FILE: rexec.tex ---
\documentclass{howto}

\title{Restricted Execution HOWTO}

\release{2.1}

\author{A.M. Kuchling}
\authoraddress{\email{amk at amk.ca}}

\begin{document}

\maketitle

\begin{abstract}
\noindent

Python 2.2.2 and earlier provided a \module{rexec} module running
untrusted code.  However, it's never been exhaustively audited for
security and it hasn't been updated to take into account recent
changes to Python such as new-style classes. Therefore, the
\module{rexec} module should not be trusted.  To discourage use of 
\module{rexec}, this HOWTO has been withdrawn.

The \module{rexec} and \module{Bastion} modules have been disabled in
the Python CVS tree, both on the trunk (which will eventually become
Python 2.3alpha2 and later 2.3final) and on the release22-maint branch
(which will become Python 2.2.3, if someone ever volunteers to issue
2.2.3).

For discussion of the problems with \module{rexec}, see the python-dev
threads starting at the following URLs:
\url{http://mail.python.org/pipermail/python-dev/2002-December/031160.html},
and
\url{http://mail.python.org/pipermail/python-dev/2003-January/031848.html}.

\end{abstract}


\section{Version History}

Sep. 12, 1998: Minor revisions and added the reference to the Janus
project.

Feb. 26, 1998: First version.  Suggestions are welcome.

Mar. 16, 1998: Made some revisions suggested by Jeff Rush.  Some minor
changes and clarifications, and a sizable section on exceptions added.

Oct. 4, 2000: Checked with Python 2.0.  Minor rewrites and fixes made.
Version number increased to 2.0.

Dec. 17, 2002: Withdrawn.

Jan. 8, 2003: Mention that \module{rexec} will be disabled in Python 2.3,
and added links to relevant python-dev threads.

\end{document}





--- NEW FILE: sockets.tex ---
\documentclass{howto}

\title{Socket Programming HOWTO}

\release{0.00}

\author{Gordon McMillan}
\authoraddress{\email{gmcm at hypernet.com}}

\begin{document}
\maketitle

\begin{abstract}
\noindent
Sockets are used nearly everywhere, but are one of the most severely
misunderstood technologies around. This is a 10,000 foot overview of
sockets. It's not really a tutorial - you'll still have work to do in
getting things operational. It doesn't cover the fine points (and there
are a lot of them), but I hope it will give you enough background to
begin using them decently.

This document is available from the Python HOWTO page at
\url{http://www.python.org/doc/howto}.

\end{abstract}

\tableofcontents

\section{Sockets}

Sockets are used nearly everywhere, but are one of the most severely
misunderstood technologies around. This is a 10,000 foot overview of
sockets. It's not really a tutorial - you'll still have work to do in
getting things working. It doesn't cover the fine points (and there
are a lot of them), but I hope it will give you enough background to
begin using them decently.

I'm only going to talk about INET sockets, but they account for at
least 99\% of the sockets in use. And I'll only talk about STREAM
sockets - unless you really know what you're doing (in which case this
HOWTO isn't for you!), you'll get better behavior and performance from
a STREAM socket than anything else. I will try to clear up the mystery
of what a socket is, as well as some hints on how to work with
blocking and non-blocking sockets. But I'll start by talking about
blocking sockets. You'll need to know how they work before dealing
with non-blocking sockets.

Part of the trouble with understanding these things is that "socket"
can mean a number of subtly different things, depending on context. So
first, let's make a distinction between a "client" socket - an
endpoint of a conversation, and a "server" socket, which is more like
a switchboard operator. The client application (your browser, for
example) uses "client" sockets exclusively; the web server it's
talking to uses both "server" sockets and "client" sockets.


\subsection{History}

Of the various forms of IPC (\emph{Inter Process Communication}),
sockets are by far the most popular.  On any given platform, there are
likely to be other forms of IPC that are faster, but for
cross-platform communication, sockets are about the only game in town.

They were invented in Berkeley as part of the BSD flavor of Unix. They
spread like wildfire with the Internet. With good reason --- the
combination of sockets with INET makes talking to arbitrary machines
around the world unbelievably easy (at least compared to other
schemes).  

\section{Creating a Socket}

Roughly speaking, when you clicked on the link that brought you to
this page, your browser did something like the following:

\begin{verbatim}
    #create an INET, STREAMing socket
    s = socket.socket(
        socket.AF_INET, socket.SOCK_STREAM)
    #now connect to the web server on port 80 
    # - the normal http port
    s.connect(("www.mcmillan-inc.com", 80))
\end{verbatim}

When the \code{connect} completes, the socket \code{s} can
now be used to send in a request for the text of this page. The same
socket will read the reply, and then be destroyed. That's right -
destroyed. Client sockets are normally only used for one exchange (or
a small set of sequential exchanges).

What happens in the web server is a bit more complex. First, the web
server creates a "server socket".

\begin{verbatim}
    #create an INET, STREAMing socket
    serversocket = socket.socket(
        socket.AF_INET, socket.SOCK_STREAM)
    #bind the socket to a public host, 
    # and a well-known port
    serversocket.bind((socket.gethostname(), 80))
    #become a server socket
    serversocket.listen(5)
\end{verbatim}

A couple things to notice: we used \code{socket.gethostname()}
so that the socket would be visible to the outside world. If we had
used \code{s.bind(('', 80))} or \code{s.bind(('localhost',
80))} or \code{s.bind(('127.0.0.1', 80))} we would still
have a "server" socket, but one that was only visible within the same
machine.

A second thing to note: low number ports are usually reserved for
"well known" services (HTTP, SNMP etc). If you're playing around, use
a nice high number (4 digits).

Finally, the argument to \code{listen} tells the socket library that
we want it to queue up as many as 5 connect requests (the normal max)
before refusing outside connections. If the rest of the code is
written properly, that should be plenty.

OK, now we have a "server" socket, listening on port 80. Now we enter
the mainloop of the web server:

\begin{verbatim}
    while 1:
        #accept connections from outside
        (clientsocket, address) = serversocket.accept()
        #now do something with the clientsocket
        #in this case, we'll pretend this is a threaded server
        ct = client_thread(clientsocket)
        ct.run()
\end{verbatim}

There's actually 3 general ways in which this loop could work -
dispatching a thread to handle \code{clientsocket}, create a new
process to handle \code{clientsocket}, or restructure this app
to use non-blocking sockets, and mulitplex between our "server" socket
and any active \code{clientsocket}s using
\code{select}. More about that later. The important thing to
understand now is this: this is \emph{all} a "server" socket
does. It doesn't send any data. It doesn't receive any data. It just
produces "client" sockets. Each \code{clientsocket} is created
in response to some \emph{other} "client" socket doing a
\code{connect()} to the host and port we're bound to. As soon as
we've created that \code{clientsocket}, we go back to listening
for more connections. The two "clients" are free to chat it up - they
are using some dynamically allocated port which will be recycled when
the conversation ends.

\subsection{IPC} If you need fast IPC between two processes
on one machine, you should look into whatever form of shared memory
the platform offers. A simple protocol based around shared memory and
locks or semaphores is by far the fastest technique.

If you do decide to use sockets, bind the "server" socket to
\code{'localhost'}. On most platforms, this will take a shortcut
around a couple of layers of network code and be quite a bit faster.


\section{Using a Socket}

The first thing to note, is that the web browser's "client" socket and
the web server's "client" socket are identical beasts. That is, this
is a "peer to peer" conversation. Or to put it another way, \emph{as the
designer, you will have to decide what the rules of etiquette are for
a conversation}. Normally, the \code{connect}ing socket
starts the conversation, by sending in a request, or perhaps a
signon. But that's a design decision - it's not a rule of sockets.

Now there are two sets of verbs to use for communication. You can use
\code{send} and \code{recv}, or you can transform your
client socket into a file-like beast and use \code{read} and
\code{write}. The latter is the way Java presents their
sockets. I'm not going to talk about it here, except to warn you that
you need to use \code{flush} on sockets. These are buffered
"files", and a common mistake is to \code{write} something, and
then \code{read} for a reply. Without a \code{flush} in
there, you may wait forever for the reply, because the request may
still be in your output buffer.

Now we come the major stumbling block of sockets - \code{send}
and \code{recv} operate on the network buffers. They do not
necessarily handle all the bytes you hand them (or expect from them),
because their major focus is handling the network buffers. In general,
they return when the associated network buffers have been filled
(\code{send}) or emptied (\code{recv}). They then tell you
how many bytes they handled. It is \emph{your} responsibility to call
them again until your message has been completely dealt with.

When a \code{recv} returns 0 bytes, it means the other side has
closed (or is in the process of closing) the connection.  You will not
receive any more data on this connection. Ever.  You may be able to
send data successfully; I'll talk about that some on the next page.

A protocol like HTTP uses a socket for only one transfer. The client
sends a request, the reads a reply.  That's it. The socket is
discarded. This means that a client can detect the end of the reply by
receiving 0 bytes.

But if you plan to reuse your socket for further transfers, you need
to realize that \emph{there is no "EOT" (End of Transfer) on a
socket.} I repeat: if a socket \code{send} or
\code{recv} returns after handling 0 bytes, the connection has
been broken.  If the connection has \emph{not} been broken, you may
wait on a \code{recv} forever, because the socket will
\emph{not} tell you that there's nothing more to read (for now).  Now
if you think about that a bit, you'll come to realize a fundamental
truth of sockets: \emph{messages must either be fixed length} (yuck),
\emph{or be delimited} (shrug), \emph{or indicate how long they are}
(much better), \emph{or end by shutting down the connection}. The
choice is entirely yours, (but some ways are righter than others).

Assuming you don't want to end the connection, the simplest solution
is a fixed length message:

\begin{verbatim}
    class mysocket:
        '''demonstration class only 
          - coded for clarity, not efficiency'''
        def __init__(self, sock=None):
            if sock is None:
                self.sock = socket.socket(
                    socket.AF_INET, socket.SOCK_STREAM)
            else:
                self.sock = sock
        def connect(host, port):
            self.sock.connect((host, port))
        def mysend(msg):
            totalsent = 0
            while totalsent < MSGLEN:
                sent = self.sock.send(msg[totalsent:])
                if sent == 0:
                    raise RuntimeError, \\
                        "socket connection broken"
                totalsent = totalsent + sent
        def myreceive():
            msg = ''
            while len(msg) < MSGLEN:
                chunk = self.sock.recv(MSGLEN-len(msg))
                if chunk == '':
                    raise RuntimeError, \\
                        "socket connection broken"
                msg = msg + chunk
            return msg
\end{verbatim}

The sending code here is usable for almost any messaging scheme - in
Python you send strings, and you can use \code{len()} to
determine its length (even if it has embedded \code{\e 0}
characters). It's mostly the receiving code that gets more
complex. (And in C, it's not much worse, except you can't use
\code{strlen} if the message has embedded \code{\e 0}s.)

The easiest enhancement is to make the first character of the message
an indicator of message type, and have the type determine the
length. Now you have two \code{recv}s - the first to get (at
least) that first character so you can look up the length, and the
second in a loop to get the rest. If you decide to go the delimited
route, you'll be receiving in some arbitrary chunk size, (4096 or 8192
is frequently a good match for network buffer sizes), and scanning
what you've received for a delimiter.

One complication to be aware of: if your conversational protocol
allows multiple messages to be sent back to back (without some kind of
reply), and you pass \code{recv} an arbitrary chunk size, you
may end up reading the start of a following message. You'll need to
put that aside and hold onto it, until it's needed.

Prefixing the message with it's length (say, as 5 numeric characters)
gets more complex, because (believe it or not), you may not get all 5
characters in one \code{recv}. In playing around, you'll get
away with it; but in high network loads, your code will very quickly
break unless you use two \code{recv} loops - the first to
determine the length, the second to get the data part of the
message. Nasty. This is also when you'll discover that
\code{send} does not always manage to get rid of everything in
one pass. And despite having read this, you will eventually get bit by
it!

In the interests of space, building your character, (and preserving my
competitive position), these enhancements are left as an exercise for
the reader. Lets move on to cleaning up.

\subsection{Binary Data}

It is perfectly possible to send binary data over a socket. The major
problem is that not all machines use the same formats for binary
data. For example, a Motorola chip will represent a 16 bit integer
with the value 1 as the two hex bytes 00 01. Intel and DEC, however,
are byte-reversed - that same 1 is 01 00. Socket libraries have calls
for converting 16 and 32 bit integers - \code{ntohl, htonl, ntohs,
htons} where "n" means \emph{network} and "h" means \emph{host},
"s" means \emph{short} and "l" means \emph{long}. Where network order
is host order, these do nothing, but where the machine is
byte-reversed, these swap the bytes around appropriately.

In these days of 32 bit machines, the ascii representation of binary
data is frequently smaller than the binary representation. That's
because a surprising amount of the time, all those longs have the
value 0, or maybe 1. The string "0" would be two bytes, while binary
is four. Of course, this doesn't fit well with fixed-length
messages. Decisions, decisions.

\section{Disconnecting}

Strictly speaking, you're supposed to use \code{shutdown} on a
socket before you \code{close} it.  The \code{shutdown} is
an advisory to the socket at the other end.  Depending on the argument
you pass it, it can mean "I'm not going to send anymore, but I'll
still listen", or "I'm not listening, good riddance!".  Most socket
libraries, however, are so used to programmers neglecting to use this
piece of etiquette that normally a \code{close} is the same as
\code{shutdown(); close()}.  So in most situations, an explicit
\code{shutdown} is not needed.

One way to use \code{shutdown} effectively is in an HTTP-like
exchange. The client sends a request and then does a
\code{shutdown(1)}. This tells the server "This client is done
sending, but can still receive."  The server can detect "EOF" by a
receive of 0 bytes. It can assume it has the complete request.  The
server sends a reply. If the \code{send} completes successfully
then, indeed, the client was still receiving.

Python takes the automatic shutdown a step further, and says that when a socket is garbage collected, it will automatically do a \code{close} if it's needed. But relying on this is a very bad habit. If your socket just disappears without doing a \code{close}, the socket at the other end may hang indefinitely, thinking you're just being slow. \emph{Please} \code{close} your sockets when you're done.


\subsection{When Sockets Die}

Probably the worst thing about using blocking sockets is what happens
when the other side comes down hard (without doing a
\code{close}). Your socket is likely to hang. SOCKSTREAM is a
reliable protocol, and it will wait a long, long time before giving up
on a connection. If you're using threads, the entire thread is
essentially dead. There's not much you can do about it. As long as you
aren't doing something dumb, like holding a lock while doing a
blocking read, the thread isn't really consuming much in the way of
resources. Do \emph{not} try to kill the thread - part of the reason
that threads are more efficient than processes is that they avoid the
overhead associated with the automatic recycling of resources. In
other words, if you do manage to kill the thread, your whole process
is likely to be screwed up.  

\section{Non-blocking Sockets}

If you've understood the preceeding, you already know most of what you
need to know about the mechanics of using sockets. You'll still use
the same calls, in much the same ways. It's just that, if you do it
right, your app will be almost inside-out.

In Python, you use \code{socket.setblocking(0)} to make it
non-blocking. In C, it's more complex, (for one thing, you'll need to
choose between the BSD flavor \code{O_NONBLOCK} and the almost
indistinguishable Posix flavor \code{O_NDELAY}, which is
completely different from \code{TCP_NODELAY}), but it's the
exact same idea. You do this after creating the socket, but before
using it. (Actually, if you're nuts, you can switch back and forth.)

The major mechanical difference is that \code{send},
\code{recv}, \code{connect} and \code{accept} can
return without having done anything. You have (of course) a number of
choices. You can check return code and error codes and generally drive
yourself crazy. If you don't believe me, try it sometime. Your app
will grow large, buggy and suck CPU. So let's skip the brain-dead
solutions and do it right.

Use \code{select}.

In C, coding \code{select} is fairly complex. In Python, it's a
piece of cake, but it's close enough to the C version that if you
understand \code{select} in Python, you'll have little trouble
with it in C.

\begin{verbatim}    ready_to_read, ready_to_write, in_error = \\
                   select.select(
                      potential_readers, 
                      potential_writers, 
                      potential_errs, 
                      timeout)
\end{verbatim}

You pass \code{select} three lists: the first contains all
sockets that you might want to try reading; the second all the sockets
you might want to try writing to, and the last (normally left empty)
those that you want to check for errors.  You should note that a
socket can go into more than one list. The \code{select} call is
blocking, but you can give it a timeout. This is generally a sensible
thing to do - give it a nice long timeout (say a minute) unless you
have good reason to do otherwise.

In return, you will get three lists. They have the sockets that are
actually readable, writable and in error. Each of these lists is a
subset (possbily empty) of the corresponding list you passed in. And
if you put a socket in more than one input list, it will only be (at
most) in one output list.

If a socket is in the output readable list, you can be
as-close-to-certain-as-we-ever-get-in-this-business that a
\code{recv} on that socket will return \emph{something}. Same
idea for the writable list. You'll be able to send
\emph{something}. Maybe not all you want to, but \emph{something} is
better than nothing. (Actually, any reasonably healthy socket will
return as writable - it just means outbound network buffer space is
available.)

If you have a "server" socket, put it in the potential_readers
list. If it comes out in the readable list, your \code{accept}
will (almost certainly) work. If you have created a new socket to
\code{connect} to someone else, put it in the ptoential_writers
list. If it shows up in the writable list, you have a decent chance
that it has connected.

One very nasty problem with \code{select}: if somewhere in those
input lists of sockets is one which has died a nasty death, the
\code{select} will fail. You then need to loop through every
single damn socket in all those lists and do a
\code{select([sock],[],[],0)} until you find the bad one. That
timeout of 0 means it won't take long, but it's ugly.

Actually, \code{select} can be handy even with blocking sockets.
It's one way of determining whether you will block - the socket
returns as readable when there's something in the buffers.  However,
this still doesn't help with the problem of determining whether the
other end is done, or just busy with something else.

\textbf{Portability alert}: On Unix, \code{select} works both with
the sockets and files. Don't try this on Windows. On Windows,
\code{select} works with sockets only. Also note that in C, many
of the more advanced socket options are done differently on
Windows. In fact, on Windows I usually use threads (which work very,
very well) with my sockets. Face it, if you want any kind of
performance, your code will look very different on Windows than on
Unix. (I haven't the foggiest how you do this stuff on a Mac.)

\subsection{Performance}

There's no question that the fastest sockets code uses non-blocking
sockets and select to multiplex them. You can put together something
that will saturate a LAN connection without putting any strain on the
CPU. The trouble is that an app written this way can't do much of
anything else - it needs to be ready to shuffle bytes around at all
times.

Assuming that your app is actually supposed to do something more than
that, threading is the optimal solution, (and using non-blocking
sockets will be faster than using blocking sockets). Unfortunately,
threading support in Unixes varies both in API and quality. So the
normal Unix solution is to fork a subprocess to deal with each
connection. The overhead for this is significant (and don't do this on
Windows - the overhead of process creation is enormous there). It also
means that unless each subprocess is completely independent, you'll
need to use another form of IPC, say a pipe, or shared memory and
semaphores, to communicate between the parent and child processes.

Finally, remember that even though blocking sockets are somewhat
slower than non-blocking, in many cases they are the "right"
solution. After all, if your app is driven by the data it receives
over a socket, there's not much sense in complicating the logic just
so your app can wait on \code{select} instead of
\code{recv}.

\end{document}

--- NEW FILE: sorting.tex ---
\documentclass{howto}

\title{Sorting Mini-HOWTO}

% Increment the release number whenever significant changes are made.
% The author and/or editor can define 'significant' however they like.
\release{0.01}

\author{Andrew Dalke}
\authoraddress{\email{dalke at bioreason.com}}

\begin{document}
\maketitle

\begin{abstract}
\noindent
This document is a little tutorial
showing a half dozen ways to sort a list with the built-in
\method{sort()} method.  

This document is available from the Python HOWTO page at
\url{http://www.python.org/doc/howto}.
\end{abstract}

\tableofcontents

Python lists have a built-in \method{sort()} method.  There are many
ways to use it to sort a list and there doesn't appear to be a single,
central place in the various manuals describing them, so I'll do so
here.

\section{Sorting basic data types}

A simple ascending sort is easy; just call the \method{sort()} method of a list.

\begin{verbatim}
>>> a = [5, 2, 3, 1, 4]
>>> a.sort()
>>> print a
[1, 2, 3, 4, 5]
\end{verbatim}

Sort takes an optional function which can be called for doing the
comparisons.  The default sort routine is equivalent to

\begin{verbatim}
>>> a = [5, 2, 3, 1, 4]
>>> a.sort(cmp)
>>> print a
[1, 2, 3, 4, 5]
\end{verbatim}

where \function{cmp} is the built-in function which compares two objects, \code{x} and
\code{y}, and returns -1, 0 or 1 depending on whether $x<y$, $x==y$, or $x>y$.  During
the course of the sort the relationships must stay the same for the
final list to make sense.

If you want, you can define your own function for the comparison.  For 
integers (and numbers in general) we can do:

\begin{verbatim}
>>> def numeric_compare(x, y):
>>>    return x-y
>>> 
>>> a = [5, 2, 3, 1, 4]
>>> a.sort(numeric_compare)
>>> print a
[1, 2, 3, 4, 5]
\end{verbatim}

By the way, this function won't work if result of the subtraction
is out of range, as in \code{sys.maxint - (-1)}.

Or, if you don't want to define a new named function you can create an
anonymous one using \keyword{lambda}, as in:

\begin{verbatim}
>>> a = [5, 2, 3, 1, 4]
>>> a.sort(lambda x, y: x-y)
>>> print a
[1, 2, 3, 4, 5]
\end{verbatim}

If you want the numbers sorted in reverse you can do

\begin{verbatim}
>>> a = [5, 2, 3, 1, 4]
>>> def reverse_numeric(x, y):
>>>     return y-x
>>> 
>>> a.sort(reverse_numeric)
>>> print a
[5, 4, 3, 2, 1]
\end{verbatim}

(a more general implementation could return \code{cmp(y,x)} or \code{-cmp(x,y)}).

However, it's faster if Python doesn't have to call a function for
every comparison, so if you want a reverse-sorted list of basic data
types, do the forward sort first, then use the \method{reverse()} method.

\begin{verbatim}
>>> a = [5, 2, 3, 1, 4]
>>> a.sort()
>>> a.reverse()
>>> print a
[5, 4, 3, 2, 1]
\end{verbatim}

Here's a case-insensitive string comparison using a \keyword{lambda} function:

\begin{verbatim}
>>> import string
>>> a = string.split("This is a test string from Andrew.")
>>> a.sort(lambda x, y: cmp(string.lower(x), string.lower(y)))
>>> print a
['a', 'Andrew.', 'from', 'is', 'string', 'test', 'This']
\end{verbatim}

This goes through the overhead of converting a word to lower case
every time it must be compared.  At times it may be faster to compute
these once and use those values, and the following example shows how.

\begin{verbatim}
>>> words = string.split("This is a test string from Andrew.")
>>> offsets = []
>>> for i in range(len(words)):
>>>     offsets.append( (string.lower(words[i]), i) )
>>> 
>>> offsets.sort()
>>> new_words = []
>>> for dontcare, i in offsets:
>>>      new_words.append(words[i])
>>> 
>>> print new_words
\end{verbatim}

The \code{offsets} list is initialized to a tuple of the lower-case string
and its position in the \code{words} list.  It is then sorted.  Python's
sort method sorts tuples by comparing terms; given \code{x} and \code{y}, compare
\code{x[0]} to \code{y[0]}, then \code{x[1]} to \code{y[1]}, etc. until there is a difference.

The result is that the \code{offsets} list is ordered by its first
term, and the second term can be used to figure out where the original
data was stored.  (The \code{for} loop assigns \code{dontcare} and
\code{i} to the two fields of each term in the list, but we only need the
index value.)

Another way to implement this is to store the original data as the
second term in the \code{offsets} list, as in:

\begin{verbatim}
>>> words = string.split("This is a test string from Andrew.")
>>> offsets = []
>>> for word in words:
>>>     offsets.append( (string.lower(word), word) )
>>> 
>>> offsets.sort()
>>> new_words = []
>>> for word in offsets:
>>>     new_words.append(word[1])
>>> 
>>> print new_words
\end{verbatim}

This isn't always appropriate because the second terms in the list
(the word, in this example) will be compared when the first terms are
the same.  If this happens many times, then there will be the unneeded
performance hit of comparing the two objects.  This can be a large
cost if most terms are the same and the objects define their own
\method{__cmp__} method, but there will still be some overhead to determine if
\method{__cmp__} is defined.

Still, for large lists, or for lists where the comparison information
is expensive to calculate, the last two examples are likely to be the
fastest way to sort a list.  It will not work on weakly sorted data,
like complex numbers, but if you don't know what that means, you
probably don't need to worry about it.

\section{Comparing classes}

The comparison for two basic data types, like ints to ints or string to
string, is built into Python and makes sense.  There is a default way
to compare class instances, but the default manner isn't usually very
useful.  You can define your own comparison with the \method{__cmp__} method,
as in:

\begin{verbatim}
>>> class Spam:
>>>     def __init__(self, spam, eggs):
>>>         self.spam = spam
>>>         self.eggs = eggs
>>>     def __cmp__(self, other):
>>>         return cmp(self.spam+self.eggs, other.spam+other.eggs)
>>>     def __str__(self):
>>>         return str(self.spam + self.eggs)
>>> 
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
>>> a.sort()
>>> for spam in a:
>>>   print str(spam)
5
10
12
\end{verbatim}

Sometimes you may want to sort by a specific attribute of a class.  If
appropriate you should just define the \method{__cmp__} method to compare
those values, but you cannot do this if you want to compare between
different attributes at different times.  Instead, you'll need to go
back to passing a comparison function to sort, as in:

\begin{verbatim}
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
>>> a.sort(lambda x, y: cmp(x.eggs, y.eggs))
>>> for spam in a:
>>>   print spam.eggs, str(spam)
3 12
4 5
6 10
\end{verbatim}

If you want to compare two arbitrary attributes (and aren't overly
concerned about performance) you can even define your own comparison
function object.  This uses the ability of a class instance to emulate
an function by defining the \method{__call__} method, as in:

\begin{verbatim}
>>> class CmpAttr:
>>>     def __init__(self, attr):
>>>         self.attr = attr
>>>     def __call__(self, x, y):
>>>         return cmp(getattr(x, self.attr), getattr(y, self.attr))
>>> 
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
>>> a.sort(CmpAttr("spam"))  # sort by the "spam" attribute
>>> for spam in a:
>>>    print spam.spam, spam.eggs, str(spam)
1 4 5
4 6 10
9 3 12

>>> a.sort(CmpAttr("eggs"))   # re-sort by the "eggs" attribute
>>> for spam in a:
>>>    print spam.spam, spam.eggs, str(spam)
9 3 12
1 4 5
4 6 10
\end{verbatim}

Of course, if you want a faster sort you can extract the attributes
into an intermediate list and sort that list.


So, there you have it; about a half-dozen different ways to define how
to sort a list:
\begin{itemize}
 \item sort using the default method
 \item sort using a comparison function
 \item reverse sort not using a comparison function
 \item sort on an intermediate list (two forms)
 \item sort using class defined __cmp__ method
 \item sort using a sort function object
\end{itemize}

\end{document}
% LocalWords:  maxint

--- NEW FILE: unicode.rst ---
Unicode HOWTO
================

**Version 1.02**

This HOWTO discusses Python's support for Unicode, and explains various 
problems that people commonly encounter when trying to work with Unicode.

Introduction to Unicode
------------------------------

History of Character Codes
''''''''''''''''''''''''''''''

In 1968, the American Standard Code for Information Interchange,
better known by its acronym ASCII, was standardized.  ASCII defined
numeric codes for various characters, with the numeric values running from 0 to
127.  For example, the lowercase letter 'a' is assigned 97 as its code
value.

ASCII was an American-developed standard, so it only defined
unaccented characters.  There was an 'e', but no 'é' or 'Í'.  This
meant that languages which required accented characters couldn't be
faithfully represented in ASCII.  (Actually the missing accents matter
for English, too, which contains words such as 'naïve' and 'café', and some
publications have house styles which require spellings such as
'coöperate'.)

For a while people just wrote programs that didn't display accents.  I
remember looking at Apple ][ BASIC programs, published in French-language
publications in the mid-1980s, that had lines like these::

	PRINT "FICHER EST COMPLETE."
	PRINT "CARACTERE NON ACCEPTE."

Those messages should contain accents, and they just look wrong to
someone who can read French.  

In the 1980s, almost all personal computers were 8-bit, meaning that
bytes could hold values ranging from 0 to 255.  ASCII codes only went
up to 127, so some machines assigned values between 128 and 255 to
accented characters.  Different machines had different codes, however,
which led to problems exchanging files.  Eventually various commonly
used sets of values for the 128-255 range emerged.  Some were true
standards, defined by the International Standards Organization, and
some were **de facto** conventions that were invented by one company
or another and managed to catch on.

255 characters aren't very many.  For example, you can't fit
both the accented characters used in Western Europe and the Cyrillic
alphabet used for Russian into the 128-255 range because there are more than
127 such characters.

You could write files using different codes (all your Russian
files in a coding system called KOI8, all your French files in 
a different coding system called Latin1), but what if you wanted
to write a French document that quotes some Russian text?  In the
1980s people began to want to solve this problem, and the Unicode
standardization effort began.

Unicode started out using 16-bit characters instead of 8-bit characters.  16
bits means you have 2^16 = 65,536 distinct values available, making it
possible to represent many different characters from many different
alphabets; an initial goal was to have Unicode contain the alphabets for
every single human language.  It turns out that even 16 bits isn't enough to
meet that goal, and the modern Unicode specification uses a wider range of
codes, 0-1,114,111 (0x10ffff in base-16).

There's a related ISO standard, ISO 10646.  Unicode and ISO 10646 were
originally separate efforts, but the specifications were merged with
the 1.1 revision of Unicode.  

(This discussion of Unicode's history is highly simplified.  I don't
think the average Python programmer needs to worry about the
historical details; consult the Unicode consortium site listed in the
References for more information.)


Definitions
''''''''''''''''''''''''

A **character** is the smallest possible component of a text.  'A',
'B', 'C', etc., are all different characters.  So are 'È' and
'Í'.  Characters are abstractions, and vary depending on the
language or context you're talking about.  For example, the symbol for
ohms (Ω) is usually drawn much like the capital letter
omega (Ω) in the Greek alphabet (they may even be the same in
some fonts), but these are two different characters that have
different meanings.

The Unicode standard describes how characters are represented by
**code points**.  A code point is an integer value, usually denoted in
base 16.  In the standard, a code point is written using the notation
U+12ca to mean the character with value 0x12ca (4810 decimal).  The
Unicode standard contains a lot of tables listing characters and their
corresponding code points::

	0061    'a'; LATIN SMALL LETTER A
	0062    'b'; LATIN SMALL LETTER B
	0063    'c'; LATIN SMALL LETTER C
        ...
	007B	'{'; LEFT CURLY BRACKET

Strictly, these definitions imply that it's meaningless to say 'this is
character U+12ca'.  U+12ca is a code point, which represents some particular
character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'.
In informal contexts, this distinction between code points and characters will
sometimes be forgotten.

A character is represented on a screen or on paper by a set of graphical
elements that's called a **glyph**.  The glyph for an uppercase A, for
example, is two diagonal strokes and a horizontal stroke, though the exact
details will depend on the font being used.  Most Python code doesn't need
to worry about glyphs; figuring out the correct glyph to display is
generally the job of a GUI toolkit or a terminal's font renderer.


Encodings
'''''''''

To summarize the previous section: 
a Unicode string is a sequence of code points, which are
numbers from 0 to 0x10ffff.  This sequence needs to be represented as
a set of bytes (meaning, values from 0-255) in memory.  The rules for
translating a Unicode string into a sequence of bytes are called an 
**encoding**.

The first encoding you might think of is an array of 32-bit integers.  
In this representation, the string "Python" would look like this::

       P           y           t           h           o           n
    0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 
       0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

This representation is straightforward but using
it presents a number of problems.

1. It's not portable; different processors order the bytes 
   differently. 

2. It's very wasteful of space.  In most texts, the majority of the code 
   points are less than 127, or less than 255, so a lot of space is occupied
   by zero bytes.  The above string takes 24 bytes compared to the 6
   bytes needed for an ASCII representation.  Increased RAM usage doesn't
   matter too much (desktop computers have megabytes of RAM, and strings
   aren't usually that large), but expanding our usage of disk and
   network bandwidth by a factor of 4 is intolerable.

3. It's not compatible with existing C functions such as ``strlen()``,
   so a new family of wide string functions would need to be used.

4. Many Internet standards are defined in terms of textual data, and 
   can't handle content with embedded zero bytes.

Generally people don't use this encoding, choosing other encodings
that are more efficient and convenient.

Encodings don't have to handle every possible Unicode character, and
most encodings don't.  For example, Python's default encoding is the
'ascii' encoding.  The rules for converting a Unicode string into the
ASCII encoding are are simple; for each code point:

1. If the code point is <128, each byte is the same as the value of the 
   code point.

2. If the code point is 128 or greater, the Unicode string can't 
   be represented in this encoding.  (Python raises  a 
   ``UnicodeEncodeError`` exception in this case.)

Latin-1, also known as ISO-8859-1, is a similar encoding.  Unicode
code points 0-255 are identical to the Latin-1 values, so converting
to this encoding simply requires converting code points to byte
values; if a code point larger than 255 is encountered, the string
can't be encoded into Latin-1.

Encodings don't have to be simple one-to-one mappings like Latin-1.
Consider IBM's EBCDIC, which was used on IBM mainframes.  Letter
values weren't in one block: 'a' through 'i' had values from 129 to
137, but 'j' through 'r' were 145 through 153.  If you wanted to use
EBCDIC as an encoding, you'd probably use some sort of lookup table to
perform the conversion, but this is largely an internal detail.

UTF-8 is one of the most commonly used encodings.  UTF stands for
"Unicode Transformation Format", and the '8' means that 8-bit numbers
are used in the encoding.  (There's also a UTF-16 encoding, but it's
less frequently used than UTF-8.)  UTF-8 uses the following rules:

1. If the code point is <128, it's represented by the corresponding byte value.
2. If the code point is between 128 and 0x7ff, it's turned into two byte values
   between 128 and 255.
3. Code points >0x7ff are turned into three- or four-byte sequences, where
   each byte of the sequence is between 128 and 255.
    
UTF-8 has several convenient properties:

1. It can handle any Unicode code point.
2. A Unicode string is turned into a string of bytes containing no embedded zero bytes.  This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent through protocols that can't handle zero bytes.
3. A string of ASCII text is also valid UTF-8 text. 
4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize.  It's also unlikely that random 8-bit data will look like valid UTF-8.



References
''''''''''''''

The Unicode Consortium site at <http://www.unicode.org> has character
charts, a glossary, and PDF versions of the Unicode specification.  Be
prepared for some difficult reading.
<http://www.unicode.org/history/> is a chronology of the origin and
development of Unicode.

To help understand the standard, Jukka Korpela has written an
introductory guide to reading the Unicode character tables, 
available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.

Roman Czyborra wrote another explanation of Unicode's basic principles; 
it's at <http://czyborra.com/unicode/characters.html>.
Czyborra has written a number of other Unicode-related documentation, 
available from <http://www.cyzborra.com>.

Two other good introductory articles were written by Joel Spolsky
<http://www.joelonsoftware.com/articles/Unicode.html> and Jason
Orendorff <http://www.jorendorff.com/articles/unicode/>.  If this
introduction didn't make things clear to you, you should try reading
one of these alternate articles before continuing.

Wikipedia entries are often helpful; see the entries for "character
encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
<http://en.wikipedia.org/wiki/UTF-8>, for example.


Python's Unicode Support
------------------------

Now that you've learned the rudiments of Unicode, we can look at
Python's Unicode features.


The Unicode Type
'''''''''''''''''''

Unicode strings are expressed as instances of the ``unicode`` type,
one of Python's repertoire of built-in types.  It derives from an
abstract type called ``basestring``, which is also an ancestor of the
``str`` type; you can therefore check if a value is a string type with
``isinstance(value, basestring)``.  Under the hood, Python represents
Unicode strings as either 16- or 32-bit integers, depending on how the
Python interpreter was compiled, but this 

The ``unicode()`` constructor has the signature ``unicode(string[, encoding, errors])``.
All of its arguments should be 8-bit strings.  The first argument is converted 
to Unicode using the specified encoding; if you leave off the ``encoding`` argument, 
the ASCII encoding is used for the conversion, so characters greater than 127 will 
be treated as errors::

    >>> unicode('abcdef')
    u'abcdef'
    >>> s = unicode('abcdef')
    >>> type(s)
    <type 'unicode'>
    >>> unicode('abcdef' + chr(255))
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: 
                        ordinal not in range(128)

The ``errors`` argument specifies the response when the input string can't be converted according to the encoding's rules.  Legal values for this argument 
are 'strict' (raise a ``UnicodeDecodeError`` exception), 
'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'), 
or 'ignore' (just leave the character out of the Unicode result).  
The following examples show the differences::

    >>> unicode('\x80abc', errors='strict')
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: 
                        ordinal not in range(128)
    >>> unicode('\x80abc', errors='replace')
    u'\ufffdabc'
    >>> unicode('\x80abc', errors='ignore')
    u'abc'

Encodings are specified as strings containing the encoding's name.
Python 2.4 comes with roughly 100 different encodings; see the Python
Library Reference at
<http://docs.python.org/lib/standard-encodings.html> for a list.  Some
encodings have multiple names; for example, 'latin-1', 'iso_8859_1'
and '8859' are all synonyms for the same encoding.

One-character Unicode strings can also be created with the
``unichr()`` built-in function, which takes integers and returns a
Unicode string of length 1 that contains the corresponding code point.
The reverse operation is the built-in `ord()` function that takes a
one-character Unicode string and returns the code point value::

    >>> unichr(40960)
    u'\ua000'
    >>> ord(u'\ua000')
    40960

Instances of the ``unicode`` type have many of the same methods as 
the 8-bit string type for operations such as searching and formatting::

    >>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
    >>> s.count('e')
    5
    >>> s.find('feather')
    9
    >>> s.find('bird')
    -1
    >>> s.replace('feather', 'sand')
    u'Was ever sand so lightly blown to and fro as this multitude?'
    >>> s.upper()
    u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'

Note that the arguments to these methods can be Unicode strings or 8-bit strings.  
8-bit strings will be converted to Unicode before carrying out the operation;
Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception::

    >>> s.find('Was\x9f')
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
    >>> s.find(u'Was\x9f')
    -1

Much Python code that operates on strings will therefore work with
Unicode strings without requiring any changes to the code.  (Input and
output code needs more updating for Unicode; more on this later.)

Another important method is ``.encode([encoding], [errors='strict'])``, 
which returns an 8-bit string version of the
Unicode string, encoded in the requested encoding.  The ``errors``
parameter is the same as the parameter of the ``unicode()``
constructor, with one additional possibility; as well as 'strict',
'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which
uses XML's character references.  The following example shows the
different results::

    >>> u = unichr(40960) + u'abcd' + unichr(1972)
    >>> u.encode('utf-8')
    '\xea\x80\x80abcd\xde\xb4'
    >>> u.encode('ascii')
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
    >>> u.encode('ascii', 'ignore')
    'abcd'
    >>> u.encode('ascii', 'replace')
    '?abcd?'
    >>> u.encode('ascii', 'xmlcharrefreplace')
    '&#40960;abcd&#1972;'

Python's 8-bit strings have a ``.decode([encoding], [errors])`` method 
that interprets the string using the given encoding::

    >>> u = unichr(40960) + u'abcd' + unichr(1972)   # Assemble a string
    >>> utf8_version = u.encode('utf-8')             # Encode as UTF-8
    >>> type(utf8_version), utf8_version
    (<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
    >>> u2 = utf8_version.decode('utf-8')            # Decode using UTF-8
    >>> u == u2                                      # The two strings match
    True
 
The low-level routines for registering and accessing the available
encodings are found in the ``codecs`` module.  However, the encoding
and decoding functions returned by this module are usually more
low-level than is comfortable, so I'm not going to describe the
``codecs`` module here.  If you need to implement a completely new
encoding, you'll need to learn about the ``codecs`` module interfaces,
but implementing encodings is a specialized task that also won't be
covered here.  Consult the Python documentation to learn more about
this module.

The most commonly used part of the ``codecs`` module is the 
``codecs.open()`` function which will be discussed in the section
on input and output.
            
            
Unicode Literals in Python Source Code
''''''''''''''''''''''''''''''''''''''''''

In Python source code, Unicode literals are written as strings
prefixed with the 'u' or 'U' character: ``u'abcdefghijk'``.  Specific
code points can be written using the ``\u`` escape sequence, which is
followed by four hex digits giving the code point.  The ``\U`` escape
sequence is similar, but expects 8 hex digits, not 4.  

Unicode literals can also use the same escape sequences as 8-bit
strings, including ``\x``, but ``\x`` only takes two hex digits so it
can't express an arbitrary code point.  Octal escapes can go up to
U+01ff, which is octal 777.

::

    >>> s = u"a\xac\u1234\u20ac\U00008000"
               ^^^^ two-digit hex escape
                   ^^^^^^ four-digit Unicode escape 
                               ^^^^^^^^^^ eight-digit Unicode escape
    >>> for c in s:  print ord(c),
    ... 
    97 172 4660 8364 32768

Using escape sequences for code points greater than 127 is fine in
small doses, but becomes an annoyance if you're using many accented
characters, as you would in a program with messages in French or some
other accent-using language.  You can also assemble strings using the
``unichr()`` built-in function, but this is even more tedious.

Ideally, you'd want to be able to write literals in your language's
natural encoding.  You could then edit Python source code with your
favorite editor which would display the accented characters naturally,
and have the right characters used at runtime.

Python supports writing Unicode literals in any encoding, but you have
to declare the encoding being used.  This is done by including a
special comment as either the first or second line of the source
file::

    #!/usr/bin/env python
    # -*- coding: latin-1 -*-
    
    u = u'abcdé'
    print ord(u[-1])
    
The syntax is inspired by Emacs's notation for specifying variables local to a file.
Emacs supports many different variables, but Python only supports 'coding'.  
The ``-*-`` symbols indicate that the comment is special; within them,
you must supply the name ``coding`` and the name of your chosen encoding, 
separated by ``':'``.  

If you don't include such a comment, the default encoding used will be
ASCII.  Versions of Python before 2.4 were Euro-centric and assumed
Latin-1 as a default encoding for string literals; in Python 2.4,
characters greater than 127 still work but result in a warning.  For
example, the following program has no encoding declaration::

    #!/usr/bin/env python
    u = u'abcdé'
    print ord(u[-1])

When you run it with Python 2.4, it will output the following warning::

    amk:~$ python p263.py
    sys:1: DeprecationWarning: Non-ASCII character '\xe9' 
         in file p263.py on line 2, but no encoding declared; 
         see http://www.python.org/peps/pep-0263.html for details
  

Unicode Properties
'''''''''''''''''''

The Unicode specification includes a database of information about
code points.  For each code point that's defined, the information
includes the character's name, its category, the numeric value if
applicable (Unicode has characters representing the Roman numerals and
fractions such as one-third and four-fifths).  There are also
properties related to the code point's use in bidirectional text and
other display-related properties.

The following program displays some information about several
characters, and prints the numeric value of one particular character::

    import unicodedata
    
    u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
    
    for i, c in enumerate(u):
        print i, '%04x' % ord(c), unicodedata.category(c),
        print unicodedata.name(c)
    
    # Get numeric value of second character
    print unicodedata.numeric(u[1])

When run, this prints::

    0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
    1 0bf2 No TAMIL NUMBER ONE THOUSAND
    2 0f84 Mn TIBETAN MARK HALANTA
    3 1770 Lo TAGBANWA LETTER SA
    4 33af So SQUARE RAD OVER S SQUARED
    1000.0

The category codes are abbreviations describing the nature of the
character.  These are grouped into categories such as "Letter",
"Number", "Punctuation", or "Symbol", which in turn are broken up into
subcategories.  To take the codes from the above output, ``'Ll'``
means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is
"Mark, nonspacing", and ``'So'`` is "Symbol, other".  See
<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values>
for a list of category codes.

References
''''''''''''''

The Unicode and 8-bit string types are described in the Python library
reference at <http://docs.python.org/lib/typesseq.html>.

The documentation for the ``unicodedata`` module is at 
<http://docs.python.org/lib/module-unicodedata.html>.

The documentation for the ``codecs`` module is at
<http://docs.python.org/lib/module-codecs.html>.

Marc-André Lemburg gave a presentation at EuroPython 2002
titled "Python and Unicode".  A PDF version of his slides
is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>,
and is an excellent overview of the design of Python's Unicode features.


Reading and Writing Unicode Data
----------------------------------------

Once you've written some code that works with Unicode data, the next
problem is input/output.  How do you get Unicode strings into your
program, and how do you convert Unicode into a form suitable for
storage or transmission?  

It's possible that you may not need to do anything depending on your
input sources and output destinations; you should check whether the
libraries used in your application support Unicode natively.  XML
parsers often return Unicode data, for example.  Many relational
databases also support Unicode-valued columns and can return Unicode
values from an SQL query.

Unicode data is usually converted to a particular encoding before it
gets written to disk or sent over a socket.  It's possible to do all
the work yourself: open a file, read an 8-bit string from it, and
convert the string with ``unicode(str, encoding)``.  However, the
manual approach is not recommended.

One problem is the multi-byte nature of encodings; one Unicode
character can be represented by several bytes.  If you want to read
the file in arbitrary-sized chunks (say, 1K or 4K), you need to write
error-handling code to catch the case where only part of the bytes
encoding a single Unicode character are read at the end of a chunk.
One solution would be to read the entire file into memory and then
perform the decoding, but that prevents you from working with files
that are extremely large; if you need to read a 2Gb file, you need 2Gb
of RAM.  (More, really, since for at least a moment you'd need to have 
both the encoded string and its Unicode version in memory.)

The solution would be to use the low-level decoding interface to catch
the case of partial coding sequences.   The work of implementing this
has already been done for you: the ``codecs`` module includes a
version of the ``open()`` function that returns a file-like object
that assumes the file's contents are in a specified encoding and
accepts Unicode parameters for methods such as ``.read()`` and
``.write()``.

The function's parameters are 
``open(filename, mode='rb', encoding=None, errors='strict', buffering=1)``.  ``mode`` can be
``'r'``, ``'w'``, or ``'a'``, just like the corresponding parameter to the
regular built-in ``open()`` function; add a ``'+'`` to 
update the file.  ``buffering`` is similarly
parallel to the standard function's parameter.  
``encoding`` is a string giving 
the encoding to use; if it's left as ``None``, a regular Python file
object that accepts 8-bit strings is returned.  Otherwise, a wrapper
object is returned, and data written to or read from the wrapper
object will be converted as needed.  ``errors`` specifies the action
for encoding errors and can be one of the usual values of 'strict',
'ignore', and 'replace'.

Reading Unicode from a file is therefore simple::

    import codecs
    f = codecs.open('unicode.rst', encoding='utf-8')
    for line in f:
        print repr(line)

It's also possible to open files in update mode, 
allowing both reading and writing::

    f = codecs.open('test', encoding='utf-8', mode='w+')
    f.write(u'\u4500 blah blah blah\n')
    f.seek(0)
    print repr(f.readline()[:1])
    f.close()

Unicode character U+FEFF is used as a byte-order mark (BOM), 
and is often written as the first character of a file in order
to assist with autodetection of the file's byte ordering.
Some encodings, such as UTF-16, expect a BOM to be present at 
the start of a file; when such an encoding is used,
the BOM will be automatically written as the first character 
and will be silently dropped when the file is read.  There are 
variants of these encodings, such as 'utf-16-le' and 'utf-16-be'
for little-endian and big-endian encodings, that specify 
one particular byte ordering and don't
skip the BOM.


Unicode filenames
'''''''''''''''''''''''''

Most of the operating systems in common use today support filenames
that contain arbitrary Unicode characters.  Usually this is
implemented by converting the Unicode string into some encoding that
varies depending on the system.  For example, MacOS X uses UTF-8 while
Windows uses a configurable encoding; on Windows, Python uses the name
"mbcs" to refer to whatever the currently configured encoding is.  On
Unix systems, there will only be a filesystem encoding if you've set
the ``LANG`` or ``LC_CTYPE`` environment variables; if you haven't,
the default encoding is ASCII.

The ``sys.getfilesystemencoding()`` function returns the encoding to
use on your current system, in case you want to do the encoding
manually, but there's not much reason to bother.  When opening a file
for reading or writing, you can usually just provide the Unicode
string as the filename, and it will be automatically converted to the
right encoding for you::

    filename = u'filename\u4500abc'
    f = open(filename, 'w')
    f.write('blah\n')
    f.close()

Functions in the ``os`` module such as ``os.stat()`` will also accept
Unicode filenames.

``os.listdir()``, which returns filenames, raises an issue: should it
return the Unicode version of filenames, or should it return 8-bit
strings containing the encoded versions?  ``os.listdir()`` will do
both, depending on whether you provided the directory path as an 8-bit
string or a Unicode string.  If you pass a Unicode string as the path,
filenames will be decoded using the filesystem's encoding and a list
of Unicode strings will be returned, while passing an 8-bit path will
return the 8-bit versions of the filenames.  For example, assuming the
default filesystem encoding is UTF-8, running the following program::

	fn = u'filename\u4500abc'
	f = open(fn, 'w')
	f.close()

	import os
	print os.listdir('.')
	print os.listdir(u'.')

will produce the following output::

	amk:~$ python t.py
	['.svn', 'filename\xe4\x94\x80abc', ...]
	[u'.svn', u'filename\u4500abc', ...]

The first list contains UTF-8-encoded filenames, and the second list
contains the Unicode versions.


	
Tips for Writing Unicode-aware Programs
''''''''''''''''''''''''''''''''''''''''''''

This section provides some suggestions on writing software that 
deals with Unicode.

The most important tip is: 

    Software should only work with Unicode strings internally, 
    converting to a particular encoding on output.  

If you attempt to write processing functions that accept both 
Unicode and 8-bit strings, you will find your program vulnerable to 
bugs wherever you combine the two different kinds of strings.  Python's 
default encoding is ASCII, so whenever a character with an ASCII value >127
is in the input data, you'll get a ``UnicodeDecodeError``
because that character can't be handled by the ASCII encoding.  

It's easy to miss such problems if you only test your software 
with data that doesn't contain any 
accents; everything will seem to work, but there's actually a bug in your
program waiting for the first user who attempts to use characters >127.
A second tip, therefore, is:

    Include characters >127 and, even better, characters >255 in your
    test data.

When using data coming from a web browser or some other untrusted source,
a common technique is to check for illegal characters in a string
before using the string in a generated command line or storing it in a 
database.  If you're doing this, be careful to check 
the string once it's in the form that will be used or stored; it's 
possible for encodings to be used to disguise characters.  This is especially
true if the input data also specifies the encoding; 
many encodings leave the commonly checked-for characters alone, 
but Python includes some encodings such as ``'base64'``
that modify every single character.

For example, let's say you have a content management system that takes a 
Unicode filename, and you want to disallow paths with a '/' character.
You might write this code::

    def read_file (filename, encoding):
        if '/' in filename:
            raise ValueError("'/' not allowed in filenames")
        unicode_name = filename.decode(encoding)
        f = open(unicode_name, 'r')
        # ... return contents of file ...
        
However, if an attacker could specify the ``'base64'`` encoding,
they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64
encoded form of the string ``'/etc/passwd'``, to read a 
system file.   The above code looks for ``'/'`` characters 
in the encoded form and misses the dangerous character 
in the resulting decoded form.

References
''''''''''''''

The PDF slides for Marc-André Lemburg's presentation "Writing
Unicode-aware Applications in Python" are available at
<http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
and discuss questions of character encodings as well as how to
internationalize and localize an application.


Revision History and Acknowledgements
------------------------------------------

Thanks to the following people who have noted errors or offered
suggestions on this article: Nicholas Bastin, 
Marius Gedminas, Kent Johnson, Ken Krugler,
Marc-André Lemburg, Martin von Löwis.

Version 1.0: posted August 5 2005.

Version 1.01: posted August 7 2005.  Corrects factual and markup
errors; adds several links.

Version 1.02: posted August 16 2005.  Corrects factual errors.


.. comment Additional topic: building Python w/ UCS2 or UCS4 support
.. comment Describe obscure -U switch somewhere?

.. comment 
   Original outline:

   - [ ] Unicode introduction
       - [ ] ASCII
       - [ ] Terms
	   - [ ] Character
	   - [ ] Code point
	 - [ ] Encodings
	    - [ ] Common encodings: ASCII, Latin-1, UTF-8
       - [ ] Unicode Python type
	   - [ ] Writing unicode literals
	       - [ ] Obscurity: -U switch
	   - [ ] Built-ins
	       - [ ] unichr()
	       - [ ] ord()
	       - [ ] unicode() constructor
	   - [ ] Unicode type
	       - [ ] encode(), decode() methods
       - [ ] Unicodedata module for character properties
       - [ ] I/O
	   - [ ] Reading/writing Unicode data into files
	       - [ ] Byte-order marks
	   - [ ] Unicode filenames
       - [ ] Writing Unicode programs
	   - [ ] Do everything in Unicode
	   - [ ] Declaring source code encodings (PEP 263)
       - [ ] Other issues
	   - [ ] Building Python (UCS2, UCS4)



More information about the Python-checkins mailing list