[Baypiggies] quick question: regex to stop naughty control characters
Daniel Yoo
dyoo at cs.wpi.edu
Wed Apr 25 22:22:51 CEST 2007
Hi JJ,
The question is slightly underdefined still; do you mind if I ask a few
more questions?
> I accept UTF-8 strings and decode them to unicode objects.
Ok, so what we really have are bytes whose intended interpretation is
utf-8, yes? Is the input a unicode string? Or is it rather a sequence of
bytes (which Python often uses a regular string for)?
> I would like to check that the strings are no longer than 128 characters
Unfortunately, "characters" is ambiguous and has at least two meanings
these days. Do you mean 128 bytes, or 128 unicode characters? There's a
slight ambiguity here that needs to be cleared up before this problem can
be attacked.
Also, what part of this really requires regular expressions here? What
you've shown so far restricts a string by length, but that's already a
simpler conditional:
len(some_string) < 128
I have to assume it has something to do with the definition of
reasonableness.
Does the check for reasonableness have to happen at the same time as the
test for length? Must the check for reasonableness happen before decoding
bytes assuming a utf-8 interpretation? Or can something like:
return (len(some_string < 128 and
is_reasonable(decode(some_string, 'utf-8')))
suffice?
> By "reasonable", I think the only thing I want to prevent are control
> characters.
What do you mean by a "control character"? Can you be more specific about
the context that you're trying to guard?
I apologize about being pedantic, but form validation needs to be handled
methodically to be valuable.
Best of wishes!
More information about the Baypiggies
mailing list