Using dictionary key as a regular expression class

Sat Jan 23 04:22:39 EST 2010

On Sat, Jan 23, 2010 at 02:45:41AM EST, Terry Reedy wrote:
> On 1/22/2010 9:58 PM, Chris Jones wrote:

>> Do you mean I should just read the file one character at a time?
>
> Whoops, my misdirection (you can .read(1), but this is s  l   o   w.
> I meant to suggest processing it a char at a time.

Right.. that's how I understood it - i.e. asking python for the next
character, and not worrying about how much is retrieved from the disk in
one pass.

> 1. If not too big,
>
> for c in open(x, 'rb').read() # left .read() off
> # 'b' will get bytes, though ord(c) same for ascii chars for  byte or  
> unicode
>
> 2. If too big for that,
>
> for line in open():
>   for c in line:    # or left off this part

Well the script is not going to process anything larger that a few
KiloBytes, but all the same that's something I want to understand
better.

Isn't there any way I can tell python to retrieve a fairly large chunk
of the disk file, like 4-8K, maybe.. and increment a pointer behind the
scenes while I iterate so that I have access to characters one at a
time. I mean, that should be pretty fast, since disk access would be
minimal, and no data would actually be copied.. I would have thought
that 1. above would cause python to do something like that behind the
scenes.

[..]

>> Thanks much for the snippet, let me play with it and see if I can
>> come up with a Unicode/utf-8 version.. since while I'm at it I might
>> as well write something a bit more general than C code.
>>
>> Since utf-8 is backward-compatible with 7bit ASCII, this shouldn't be
>> a problem.

> For any extended ascii, 

You mean 8-bit encodings, like latin1 right?

> use larger array without decoding (until print,  if need be). For
> unicode, add encoding to open and 'c in line' will  return unicode
> chars. Then use *one* dict or defaultdict. I think  something like

> from collections import defaultdict
> d = defaultdict(int)
> ...
>     d[c] += 1 # if c is new, d[c] defaults to int() == 0

I don't know python, so I basically googled for the building blocks of
my little script.. and I remember seeing defaultdict(int) mentioned some
place or other, but somehow I didn't understand what it did. 

Cool feature.

Even if it's a bit wasteful, with unicode/utf-8, it looks like working
with the code points, either as dictionary keys or as index values into
an array might make the logic simpler - i.e. for each char, obtain its
code point 'cp' and add one to dict[cp] or array[cp] - and then loop and
print all non-zero values when the end-of-file condition is reached.

Food for thought in any case.

CJ