[Baypiggies] quick question: regex to stop naughty control characters
Shannon -jj Behrens
jjinux at gmail.com
Thu Apr 26 20:49:19 CEST 2007
On 4/25/07, Chris Clark <Chris.Clark at ingres.com> wrote:
> Shannon -jj Behrens wrote:
> > I'm doing some form validation. I accept UTF-8 strings and decode
> > them to unicode objects. I would like to check that the strings are
> > no longer than 128 characters, and that they are "reasonable". I'm
> > using FormEncode with a regex that looks like r".{1,128}$". By
> > "reasonable", I think the only thing I want to prevent are control
> > characters. Now, I'm sure some Unicode whiz out there knows how to do
> > this with some funky Unicode regex magic, but I don't know how.
> > Anyone know the right way to do this? Should I be worried about more
> > than just control characters? I'm already taking care of HTML
> > escaping, SQL injection, etc.
> >
>
> Mailing you privately in case I completely misunderstood and made this
> more complex than it needs to be :-)
Nope, I think you understand the problem completely, so I'm going to
CC the list. Your comments below are very helpful!
> I don't have an answer but a couple of things to research/consider:
>
> * I'm assuming you mean validating unicode strings (not validating
> the utf-8 encoded bytes).
Yep.
> * What is a control character, I think the Unicode standard says
> that codepoints in the range U+2400 to U+2421 are control
> characters BUT this doesn't include things like U+000D which is a
> carriage return.
Now we're on the same page! :)
> * What is a white space character; there are lots of white space
> characters, e.g. 0000, 200C..200F,202A..202E, 206A..206F, FEFF but
> if you check what some of these are they are defined as things
> that control behavior (e.g. U+200F) and they don't include CR/LF!.
:)
> You may find that what is more important is what should be the field
> contain, what is it intended to be used for rather than what is not allowed.
I want to restrict the user to "reasonable things that a name might
contain". Clearly, a tab is invalid. However, I don't know the regex
/ unicode syntax to express that I want "normal characters".
Thanks,
-jj
--
http://jjinux.blogspot.com/
More information about the Baypiggies
mailing list