[Baypiggies] quick question: regex to stop naughty control characters

Shannon -jj Behrens jjinux at gmail.com
Thu Apr 26 20:49:19 CEST 2007

On 4/25/07, Chris Clark <Chris.Clark at ingres.com> wrote:
> Shannon -jj Behrens wrote:
> > I'm doing some form validation.  I accept UTF-8 strings and decode
> > them to unicode objects.  I would like to check that the strings are
> > no longer than 128 characters, and that they are "reasonable".  I'm
> > using FormEncode with a regex that looks like r".{1,128}$".  By
> > "reasonable", I think the only thing I want to prevent are control
> > characters.  Now, I'm sure some Unicode whiz out there knows how to do
> > this with some funky Unicode regex magic, but I don't know how.
> > Anyone know the right way to do this?  Should I be worried about more
> > than just control characters?  I'm already taking care of HTML
> > escaping, SQL injection, etc.
> >
> Mailing you privately in case I completely misunderstood and made this
> more complex than it needs to be :-)

Nope, I think you understand the problem completely, so I'm going to
CC the list.  Your comments below are very helpful!

> I don't have an answer but a couple of things to research/consider:
>     * I'm assuming you mean validating unicode strings (not validating
>       the utf-8 encoded bytes).


>     * What is a control character, I think the Unicode standard says
>       that codepoints in the range U+2400 to U+2421 are control
>       characters BUT this doesn't include things like U+000D which is a
>       carriage return.

Now we're on the same page! :)

>     * What is a white space character; there are lots of white space
>       characters, e.g. 0000, 200C..200F,202A..202E, 206A..206F, FEFF but
>       if you check what some of these are they are defined as things
>       that control behavior (e.g. U+200F) and they don't include CR/LF!.


> You may find that what is more important is what should be the field
> contain, what is it intended to be used for rather than what is not allowed.

I want to restrict the user to "reasonable things that a name might
contain".  Clearly, a tab is invalid.  However, I don't know the regex
/ unicode syntax to express that I want "normal characters".



More information about the Baypiggies mailing list