syntax philosophy

Tue Nov 18 00:20:45 EST 2003

"Dave Brueck" <dave at pythonapocrypha.com> wrote in message news:<mailman.814.1069106836.702.python-list at python.org>...
> Tuang wrote:

> > But I'm surprised at what you apparently have to go through to do
> > something as common as counting the frequency of elements in a
> > collection. For example, counting word frequency in a file in Perl
> > means looping over all the words with the following line:
> >
> > $histogram{$word}++;
> 
> Hi Tuang,
> 
> Here's my _opinion_: Perl is especially geared towards text processing, where
> maybe counting word frequency is fairly common. Python is more of a general
> purpose programming language, in which counting the word frequency is a pretty
> rare operation (I can't remember needing to do that more than once or twice in
> the past several years). As such, it probably doesn't make sense to support
> that feature at the language level - it would burden a lot of people with
> knowing syntax they'd rarely use.

Oops. I appear to have given the impression that word frequency was
what I was after. I was just using that as an easy to explain example
of a very common task: subtotaling.

Imagine that you have a list of records -- lines in a text file will
do fine. Let's say each record is a person and you're interested in
favorite colors.

You iterate thru the lines, regexing the "favorite color" field out of
each and put it in the variable $color. Then you just use the line:

$dict{$color}++

to count up what you find. The first time that line is called, it
creates the dictionary, then creates a key for $color, initializes its
value to zero, then increments it to 1.

As you continue iterating, each new color it encounters creates a new
key, initializing it to zero and incrementing. When it finds a color
that already has a key, it just increments the count.

When you get to the end, your dictionary has keys for all the favorite
colors listed by at least one person, along with a count for how many
people listed that as their favorite color.

You can then sort, either on keys or more likely on values, and list
the favorite colors in order of popularity.

The word frequency in a list (e.g. a file) of words is just another
similar operation. Create counters for each new word you encounter and
increment them each time you see them again.

And the actual application that brought this up is that I'm going thru
the Python online tutorial where it shows an algorithm for finding
primes. I just got curious about the distribution of gaps between
primes. It's the same problem: Find a prime. Then find the next higher
prime. Subtract the smaller from the larger to find the gap, then
increment the dictionary using that gap as the key.

This is a very common data analysis problem. It's the SQL database
operation of GROUP BY and then returning COUNT, but applied to any
sequence.

You can also subtotal nearly the same way. If you have sales records
for a bunch of salepeople, you just iterate thru the sales records,
plucking out a name for $salesperson and a sale amount ($amount), then
call:

$dict{$salesperson} += $amount

Instead of adding one, which "++" does, this increments by the amount
of the sale, resulting in subtotals for each salesperson.

This is such a common operation for Perl users that I'm surprised that
it's not easier to express in Python.

But I may be misunderstanding Python's philosophy a bit. I'm surprised
that value++ has to be spelled out as value = value+1, too, so I'm not
quite sure that I understand the philosophy.

> 
> [snip]
> > But I guess I'm making assumptions about what Python's philosophy
> > really is. I would expect that a language with something as nice as
> >
> > [x**3 for x in my_list]
> 
> Building a list out of another list, however, is far more common, hence (in my
> view at least) the appropriateness of syntax-level support.
> 
> > Is this just something that hasn't been done yet but is on the way, or
> > is it a violation of Python's philosphy in some way?
> 
> Python can automatically import custom modules and functions on startup (search
> for information on the site module), so if I were you I'd write a
> WordHistorgram function in my custom site module just once and never look back.
> The added benefit is that
> 
> histogram = WordHistogram(text)
> 
> is much more readable to me as well as others than
> 
> $histogram{$word}++;

I agree for word frequency, but not for something as general as GROUP
BY and (some operation, such as COUNT or SUM). Maybe using some of the
functional programming constructs of Python (before they're removed in
Python 3) would be the way to build my own.

And thanks for the tip on the "site module"! No matter what, that
sounds like something useful.

> 
> My impression is that features generally get added if (1) there is a good
> enough case for their broad usefulness and (2) they don't overly compromise the
> relatively clean syntax of the language. In this specific example, the
> histogram-builder function fails both tests, 

As I said, it shouldn't fail (1) if people understand it, unless
Python programmers are significantly different from Perl programmers.
They may actually be (which is why I'm asking), but it may just be
that its broad usefulness wasn't clear from my explanation.