Home > C# > Thinking in Regex A Csharp Regex Tutorial with Examples

Thinking in Regex A Csharp Regex Tutorial with Examples

Regular Expressions

March 4th | 2009

Thinking in Regex – A C# Regex Tutorial with Examples

Regular Expressions (regex) can be a difficult language to learn.The terse syntax is one factor-regex are notoriously difficult to read-but another factor is the problem space. Pattern-matching problems require a different mindset than software development in general. However, if you can phrase the question in precise terms, translation to regex becomes easier, even trivial in many cases.

Given this, there are a few things to keep in mind when tackling a problem with regex.

  1. Be specific in your requirements. “I want to prevent bad characters” is not useful. “I want to exclude asterisks” is useful.
  2. Know the text your regex will run on. A regex to match a URL can be very simple, or as complex as this monster. The goal is the simplest regex that will always work with your input text.
  3. Be sure you’re not matching too much. Once your regex is matching what you want it to match, test it against various not-quite-right text samples to avoid embarrassing mistakes.

Let’s say that you have some html that looks roughly like this:

<body>
    <form id="form1" runat="server">
    <div>
        He &amp; I are best buddies.
        <a href="http://www.mywebsite.com/page.aspx?param1=1&param2=2&param3=3">
http://www.mywebsite.com/page.aspx?param1=1&param2=2&param3=3</a>
        <a href="http://www.mywebsite.com/page2.aspx?param1=1&param2=2&param3=3">
http://www.mywebsite.com/page.aspx?param1=1&param2=2&param3=3</a>
        <a href="http://www.mywebsite.com/page3.aspx?param1=1&param2=2&param3=3">
http://www.mywebsite.com/page.aspx?param1=1&param2=2&param3=3</a>
        <a href="http://www.mywebsite.com/page4.aspx?param1=1&param2=2&param3=3">
http://www.mywebsite.com/page.aspx?param1=1&param2=2&param3=3</a>
    </div>
    </form>
</body>

There are many pages like this, some with many more links. You need to fix the ampersands displayed as the text of the link to be &amp; but you do not want to replace ampersands within the href attribute or in the rest of the body of the page. This sounds a little tricky, so let’s refine our requirements.

We want to replace & with &amp; within the text part of an anchor tag only. We don’t want to replace & when it is followed by amp;

Better. But, what exactly does ‘text part’ or ‘anchor tag’ mean in terms of characters?

Specifically, we want to replace & with &amp; between the <a> and </a> character blocks, unless immediately followed by amp;

Now that we have phrased our task this way, it is much easier to form a solution. Let’s build a .Net regex to solve this problem. .Net supports variable length lookbehind (more on this following the example), so we can build our regex in neat groups that apply conditions surrounding the text we want to replace.

Regex.Replace(inputText, "&", "&amp;");

This will replace any & with &amp; Now we want to apply our conditions.

1) Must not be followed by amp;

2) Must come after (be preceded by) <a>

3) Must be followed by </a>

So the resulting regex looks like this:

Regex.Replace(inputText, @"(?<=\<a[^<>]*>[^<>]*)&(?!amp;)(?=[^<>]*</a>)", "&amp;");
Condition Regex
Must not be followed by &amp; (?!amp;)
Must be preceded by <a> tag (?<=\<a[^<>]*>[^<>]*)
Must be followed by </a> (?=[^<>]*</a>)

(?!, (?=, (?<=, and (?<! are lookarounds. &(?!amp;) can be read as “Match &, then check the next four characters. If they are amp; fail the match.” The other two conditions follow a similar pattern, requiring a preceding <a ….. > tag and a following </a> tag before any < or > is encountered. Restricting the text within < …. > blocks from containing additional <> characters prevents our regex from spanning multiple anchor tags.

Two invaluable aids to learning and using regex in .NET are Expresso (a regex tool written in .NET), and the helpful forum community at RegexAdvice. If regex interests you I highly recommend playing with the tool and browsing the forum. And since you’re still reading—this probably means you!

David Cooksey is a Senior .NET Developer at Thycotic Software, an agile software services and product development company based in Washington DC. Secret Server is our flagship password management software product.

Categories: C#
  1. March 5, 2010 at 6:52 am

    Excellent read. 😉

    Follow this link on similar spirit.

    http://martinfowler.com/bliki/ComposedRegex.html

  2. March 8, 2010 at 9:47 am

    Interesting. Regex are something of a hobby for me, so usually I prefer to look at the regex directly unless code is required to form it. This is because I read it as “beginning of string followed by….”, which is difficult to do when reading variable names designed to hide exactly what the regex does. Of course, if the intention is to hide the regex details while expressing the general intention via variable or method names, splitting a regex out into composable blocks works perfectly.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: