Template language for random string generation
Paul Wolf
paulwolf333 at gmail.com
Mon Aug 11 01:06:39 EDT 2014
On Sunday, 10 August 2014 17:31:01 UTC+1, Steven D'Aprano wrote:
> Devin Jeanpierre wrote:
>
>
>
> > On Fri, Aug 8, 2014 at 2:01 AM, Paul Wolf <paulwolf333 at gmail.com> wrote:
>
> >> This is a proposal with a working implementation for a random string
>
> >> generation template syntax for Python. `strgen` is a module for
>
> >> generating random strings in Python using a regex-like template language.
>
> >> Example:
>
> >>
>
> >> >>> from strgen import StringGenerator as SG
>
> >> >>> SG("[\l\d]{8:15}&[\d]&[\p]").render()
>
> >> u'F0vghTjKalf4^mGLk'
>
> >
>
> > Why aren't you using regular expressions? I am all for conciseness,
>
> > but using an existing format is so helpful...
>
>
>
> You've just answered your own question:
>
>
>
> > Unfortunately, the equivalent regexp probably looks like
>
> > r'(?=.*[0-9])(?=.*[A-Z])(?=.*[a-z])[a-zA-Z0-9]{8:15}'
>
>
>
> Apart from being needlessly verbose, regex syntax is not appropriate because
>
> it specifies too much, specifies too little, and specifies the wrong
>
> things. It specifies too much: regexes like ^ and $ are meaningless in this
>
> case. It specifies too little: there's no regex for the "shuffle operator".
>
> And it specifies the wrong things: regexes like (?= ...) as used in your
>
> example are for matching, not generating strings, and it isn't clear
>
> what "match any character but don't consume any of the string" means when
>
> generating strings.
>
>
>
> Personally, I think even the OP's specified language is too complex. For
>
> example, it supports literal text, but given the use-case (password
>
> generators) do we really want to support templates like "password[\d]"? I
>
> don't think so, and if somebody did, they can trivially say "password" +
>
> SG('[\d]').render().
>
>
>
> Larry Wall (the creator of Perl) has stated that one of the mistakes with
>
> Perl's regular expression mini-language is that the Huffman coding is
>
> wrong. Common things should be short, uncommon things can afford to be
>
> longer. Since the most common thing for password generation is to specify
>
> character classes, they should be short, e.g. d rather than [\d] (one
>
> character versus four).
>
>
>
> The template given could potentially be simplified to:
>
>
>
> "(LD){8:15}&D&P"
>
>
>
> where the round brackets () are purely used for grouping. Character codes
>
> are specified by a single letter. (I use uppercase to avoid the problem
>
> that l & 1 look very similar. YMMV.) The model here is custom format codes
>
> from spreadsheets, which should be comfortable to anyone who is familiar
>
> with Excel or OpenOffice. If you insist on having the facility to including
>
> literal text in your templates, might I suggest:
>
>
>
> "'password'd" # Literal string "password", followed by a single digit.
>
>
>
> but personally I believe that for the use-case given, that's a mistake.
>
>
>
> Alternatively, date/time templates use two-character codes like %Y %m etc,
>
> which is better than
>
>
>
>
>
>
>
> > (I've been working on this kind of thing with regexps, but it's still
>
> > incomplete.)
>
> >
>
> >> * Uses SystemRandom class (if available, or falls back to Random)
>
> >
>
> > This sounds cryptographically weak. Isn't the normal thing to do to
>
> > use a cryptographic hash function to generate a pseudorandom sequence?
>
>
>
> I don't think that using a good, but not cryptographically-strong, random
>
> number generator to generate passwords is a serious vulnerability. What's
>
> your threat model? Attacks on passwords tend to be one of a very few:
>
>
>
> - dictionary attacks (including tables of common passwords and
>
> simple transformations of words, e.g. 'pas5w0d');
>
>
>
> - brute force against short and weak passwords;
>
>
>
> - attacking the hash function used to store passwords (not the password
>
> itself), e.g. rainbow tables;
>
>
>
> - keyloggers or some other way of stealing the password (including
>
> phishing sites and the ever-popular "beat them with a lead pipe
>
> until they give up the password");
>
>
>
> - other social attacks, e.g. guessing that the person's password is their
>
> date of birth in reverse.
>
>
>
> But unless the random number generator is *ridiculously* weak ("9, 9, 9, 9,
>
> 9, 9, ...") I can't see any way to realistically attack the password
>
> generator based on the weakness of the random number generator. Perhaps I'm
>
> missing something?
>
>
>
>
>
> > Someone should write a cryptographically secure pseudorandom number
>
> > generator library for Python. :(
>
>
>
> Here, let me google that for you :-)
>
>
>
> https://duckduckgo.com/html/?q=python+crypto
>
>
>
>
>
>
>
> --
>
> Steven
I should clarify that the use case of password generation is only one of the use cases out of several that strgen is intended to support. It is also for:
Test data generation:
[\l]{1:20}&[._]{0:1}@[\l]{15}.(com|net|org)
email addresses that use word characters and might have a period or an underscore in the first part. Or
((john|robert|harry)|(mary|agnes|shelly)) (smith|jones|taylor)
produce names with roughly equal distribution of female/male first names. I contemplated - but did not implement - a feature where you can give strgen named functions that generate the required string (using whatever selection process that implementation chooses):
($malefirstname|$femalefirstname) $lastname
where
def malefirstname():
# get a name from the database at random
Voucher generation:
[\d]{10}
10-digit voucher numbers.
In none of the foregoing is security a concern, it should be noted.
> Since the most common thing for password generation is to specify
> character classes, they should be short, e.g. d rather than [\d] (one
> character versus four).
But you assume only standard character classes and not custom ones like "[aeiuy]", not to mention unicode ranges outside of the English language.
> If you insist on having the facility to including
literal text in your templates,
I do :-), as per above.
> might I suggest:
"'password'd" # Literal string "password", followed by a single digit.
As per above, I think the more verbose notation for character classes is necessary. Although your suggestion is not a bad one. I could have taken a route where you define the character classes with aliases and then construct a very lean template. That is effectively what the - unimplemented - function expressions do in the example above.
The ability to produce weak passwords ('[abc]{3}') is something I chose not to take up in the strgen module because it should be (mostly) agnostic about what constitutes good security and to support a broader set of use cases as per above.
More information about the Python-list
mailing list