[Baypiggies] quick question: regex to stop naughty control characters

Thu Apr 26 22:18:21 CEST 2007

Shannon -jj Behrens wrote::
> On 4/25/07, Chris Clark <Chris.Clark at ingres.com> wrote:
>> Shannon -jj Behrens wrote:
>>> I'm doing some form validation.  I accept UTF-8 strings and decode
>>> them to unicode objects.  I would like to check that the strings are
>>> no longer than 128 characters, and that they are "reasonable".  I'm
>>> using FormEncode with a regex that looks like r".{1,128}$".  By
>>> "reasonable", I think the only thing I want to prevent are control
>>> characters.  Now, I'm sure some Unicode whiz out there knows how to do
>>> this with some funky Unicode regex magic, but I don't know how.
>>> Anyone know the right way to do this?  Should I be worried about more
>>> than just control characters?  I'm already taking care of HTML
>>> escaping, SQL injection, etc.
>>>
>> Mailing you privately in case I completely misunderstood and made this
>> more complex than it needs to be :-)
> 
> Nope, I think you understand the problem completely, so I'm going to
> CC the list.  Your comments below are very helpful!
> 
>> I don't have an answer but a couple of things to research/consider:
>>
>>     * I'm assuming you mean validating unicode strings (not validating
>>       the utf-8 encoded bytes).
> 
> Yep.
> 
>>     * What is a control character, I think the Unicode standard says
>>       that codepoints in the range U+2400 to U+2421 are control
>>       characters BUT this doesn't include things like U+000D which is a
>>       carriage return.
> 
> Now we're on the same page! :)

   According to the Unicode glossary 
(http://unicode.org/glossary/index.html#control_codes), control 
characters are "U+0000..U+001F and U+007F..U+009F".  I believe the regex 
I posted yesterday covered that range exactly.

> 
>>     * What is a white space character; there are lots of white space
>>       characters, e.g. 0000, 200C..200F,202A..202E, 206A..206F, FEFF but
>>       if you check what some of these are they are defined as things
>>       that control behavior (e.g. U+200F) and they don't include CR/LF!.
> 
> :)
> 
>> You may find that what is more important is what should be the field
>> contain, what is it intended to be used for rather than what is not allowed.
> 
> I want to restrict the user to "reasonable things that a name might
> contain".  Clearly, a tab is invalid.  However, I don't know the regex
> / unicode syntax to express that I want "normal characters".
> 

   There is a well-defined regular expression syntax for matching 
unicode characters that would let you do what you want, except that 
Python does not implement it yet (as of 2.5):
	http://unicode.org/unicode/reports/tr18/#Categories

   Until it is implemented, I'm afraid you don't have any choice but to 
list the ranges of characters you want to accept or reject explicitely. 
  For example, the regex I posted yesterday matched strings of up to 128 
characters where none of the characters were Unicode control characters 
(see definition of a control character above):

	ur"(?u)^[^\u0000-\u001f\u007f-\u009f]{1,128}$"

   Kelly

Kelly Yancey
-- 
http://kbyanc.blogspot.com/