Home > 1 > These aren’t your father’s regular expressions

These aren’t your father’s regular expressions

September 17th 2009 | Morgan Kleene

These aren’t your father’s regular expressions

Regular expressions, as they’re traditionally defined, make it impossible to match balanced patterns like “ab”,” aabb”,”aaabbb”, “aaaabbbb”, etc. The Microsoft.NET implementation has augmented traditional regular expressions with new features that allow us to match arbitrary balanced constructs.

I’ll demonstrate a few patterns and build up to a Regular expression that matches a small xml-like language.

There are a few Regexp language ingredients that allow us to match balanced patterns. The first is the named ‘capture group’, and its interesting implementation in .NET. Given the Regex “^a(?abc)(?123)b$” consider what happens when we execute it on the string “aabc123b”: Both “abc” and “123” are captured by the group “g”. The following code demonstrates this by printing out “abc” and then “123”:

Regex r = new Regex(@"^a(?<g>abc)(?<g>123)b$");
      foreach (var s in r.Match("aabc123b").Groups["g"].Captures)
      {
          Console.WriteLine(s);
      }

NOTE: the ‘^’ at the beginning of the Regex means to only match at the beginning of line and the ‘$’ means to match only at the end of the line.

Imagine a stack associated with each group, in our case g, onto which the captured elements are pushed. Most stacks can be popped as well as pushed, right? These stacks can as well. By labeling the second group –g instead of g, matching the text in the parentheses pops off the top element of the g stack. So the following code prints absolutely nothing. Matching the “123” pops “abc” off of the g stack.

Regex r = new Regex(@"^a(?<g>abc)(?<-g>123)b$");
      foreach (var s in r.Match("aabc123b").Groups["g"].Captures)
      {
          Console.WriteLine(s);
      }

An important aspect of the -g construct is that if the g stack is empty then the elements enclosed cannot be matched. The expression “^(<g>a)(<-g>a)(<-g>a)$” will never be matched.

Now I’ll demonstrate how to match two different kinds of balanced expressions. The first will be strings of the form “()” “(())” “((()))” “(((())))”. When we match this kind of of regular expression there are essentially two phases: pushing elements onto the stack as we match the opening parentheses, and popping them off the stack as we match the closing parentheses. The Regexp that matches this expression is:

@"^(?<p>\()*(?<-p>\))*(?(p)(?!))$"

The only new part is the (?(p)(?!)) construct at the end of the line. This construct means that if the p group is nonempty then we should match the pattern “?!”. The pattern “?!” is a negative lookahead, in this case with no argument. This means that the entire pattern matches only in the case that it is not followed by the space between words, which is impossible. What this pattern does is force failure when the p group is not empty to prevent matching things like “((()”, where all the opening parentheses are not consumed by the closing parentheses.

The previous regex does not deal with things like “()((()))(())” because we have restricted all opening parentheses to precede all closing parentheses. It is a simple matter to change this, and allow the parentheses to occur in any order. The following regex will match all strings of balanced parentheses:

@“^(?:
(?<p>\()
|
(?<-p>\))
    )*
(?(p)(?!))$”

NOTE: this is compiled using the RegexOptions.IgnorePatternWhitespace option, to allow spacing for clarity.

We can think of this expression as “Match an opening parenthesis and push it on the p stack. Match a closing parenthesis and pop the p stack only if the p stack contains an opening parenthesis. If we reach the end of the expression assert that the p stack is empty.”

Think about what happens in these examples: Consider the string “(((”. This fails because the group “p” contains “(((“ when we reach the end of the string. Consider the string “)(“. This fails because the only time we match a closing parenthesis is when there is an opening parenthesis on the “p” stack; the first character is never matched.

We must make sure that we only match things that we’ve seen before. In .NET that’s a breeze. The “\k<g>” construct gives us access to the top of the stack associated with the g group.

Regex r = new Regex(@"^(?<g>abc)(?<-g>\k<g>)$");
      foreach (var s in r.Match("abcabc").Groups["g"].Captures)
      {
          Console.WriteLine(s);
      }

In the above code, nothing is printed. We match “abc”, push “abc” on the top of the g stack and then pop everything off the g stack when we reach the -g group and the “\k<g>” construct matches what’s on top of the g stack.

Finally we’re ready to see the expression that matches an arbitrary html-like language. The regular expression matches a language with balanced opening and closing tags. There are no attributes or any inner content, but extending such an expression to deal with things like that shouldn’t be too much of a problem.

@"^(?:
                            (?:<(?<tagName>[^>]+)>)
                                    |
                              (?<-tagName></\k<tagName>>)
                            )*
                           (?(tagName)(?!))$"

We first match the opening tag and capture the name of the tag to the

“tagName” group. When we encounter something that looks like a closing tag we only match (and pop the opening tag off of the “tagName” stack) if the same opening tag is already on the “tagName” stack. When we get to the end of the string we make sure that all opening tags have been matched by asserting that the “tagName” stack is empty.

.NET Regular expressions are a bit more powerful than the Regexes most of us are used to due to the addition of a stack that can push and pop arbitrary elements. If you have a simple, balanced pattern to match they can be extremely helpful.

References

[1] Friedl, J. 2006 Mastering Regular Expressions. O’Reilly Media, Inc.

Morgan Kleene is a Senior .NET consultant at Thycotic Software, an agile software services and product development company based in Washington DC. Secret Server is our flagship password management software product.

  1. Stephen Kleene
    February 3, 2010 at 10:50 pm

    NERD!!!!!!!

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: