[Baypiggies] quick question: regex to stop naughty control characters

Daniel Yoo dyoo at cs.wpi.edu
Wed Apr 25 22:22:51 CEST 2007


Hi JJ,

The question is slightly underdefined still; do you mind if I ask a few 
more questions?


> I accept UTF-8 strings and decode them to unicode objects.

Ok, so what we really have are bytes whose intended interpretation is 
utf-8, yes?  Is the input a unicode string?  Or is it rather a sequence of 
bytes (which Python often uses a regular string for)?


> I would like to check that the strings are no longer than 128 characters

Unfortunately, "characters" is ambiguous and has at least two meanings 
these days.  Do you mean 128 bytes, or 128 unicode characters?  There's a 
slight ambiguity here that needs to be cleared up before this problem can 
be attacked.


Also, what part of this really requires regular expressions here?  What 
you've shown so far restricts a string by length, but that's already a 
simpler conditional:

     len(some_string) < 128

I have to assume it has something to do with the definition of 
reasonableness.


Does the check for reasonableness have to happen at the same time as the 
test for length?  Must the check for reasonableness happen before decoding 
bytes assuming a utf-8 interpretation?  Or can something like:

     return (len(some_string < 128 and
             is_reasonable(decode(some_string, 'utf-8')))

suffice?


> By "reasonable", I think the only thing I want to prevent are control 
> characters.

What do you mean by a "control character"?  Can you be more specific about 
the context that you're trying to guard?

I apologize about being pedantic, but form validation needs to be handled 
methodically to be valuable.



Best of wishes!


More information about the Baypiggies mailing list