Home > .NET > Digging into the Parallel Framework

Digging into the Parallel Framework

Digging into the Parallel Framework

April 22nd | 2010

Digging into the Parallel Framework

A new feature I really like in the .NET Framework 4 is the parallel framework. Virtually every home computer has at least two physical cores, but client applications don’t utilize multiple cores as effectively as they could; probably because parallel programming is difficult.

As I’ve never been required to make heavy use of parallel programming, I haven’t invested time in existing frameworks. Typically, I’ve used a messy combination of a semaphore and threads. This lead to some complex design patterns and the occasional bug. Nor was this method as effective as the parallel framework.

To get started, most of the parallel related APIs exist in the System.Threading.Tasks.Parallel namespace. The class we will discuss here is fittingly named Parallel.

Example task, and how to solve it.

Take 20 popular websites, download their HTML and save it to disk.

We’ll use Alexa as our source for the websites, and keep it simple by hard-coding the URLs for now.

Using no parallel tasks at all, the code looks like this:

Sub Main()
    Dim topUrls() As Uri = {
        New Uri("http://www.google.com")
    } 'Remainder ommited for clarity
    Dim watch As Stopwatch = Stopwatch.StartNew()
    For Each url As Uri In topUrls
        Dim wc As New WebClient()
        Dim data() As Byte = wc.DownloadData(url)
        File.WriteAllBytes(".\" &
            url.GetComponents(UriComponents.SchemeAndServer Xor
            UriComponents.Scheme, UriFormat.SafeUnescaped) &
            ".html", data
        )
    Next
    watch.Stop()
    Console.WriteLine("Elapsed Time: " & watch.Elapsed.ToString())
    Console.ReadKey(True)
End Sub

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: Consolas, “Courier New”, Courier, Monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

Given that network performance is subjective, our mileage will vary. There was negligible latency except for sina.com.cn. All in all, the resulting performance is 00:01:11.068 over an average of 10 runs. The performance of sina.com.cn was pretty bad, but that’s expected given that it’s Chinese based – and their site is a whopping 466kb at the time of writing.

This kind of work is perfect for parallel programming. Even though it’s not CPU intensive, it’s also not utilizing my network as well as it could. The Parallel class can help us, and it isn’t a big leap to get the basics going. There’s a static method on the Parallel class called ForEach, and it will act as our iterator rather than an actual ForEach loop. We then pass a lambda specifying what the work is. This is the simplest use of the Parallel class.

Here is the code I wrote using Parallel.

Sub Main()
    Dim topUrls() As Uri = {
        New Uri("http://www.google.com")
    } 'Remainder omitted for clarity
    Dim watch As Stopwatch = Stopwatch.StartNew()
    Parallel.ForEach(topUrls, Sub(url)
                Dim wc As New WebClient()
                Dim data() As Byte = wc.DownloadData(url)
                File.WriteAllBytes(".\" &
                url.GetComponents(UriComponents.SchemeAndServer Xor
                UriComponents.Scheme, UriFormat.SafeUnescaped) &
                ".html", data
                )
            End Sub
        )
    watch.Stop()
    Console.WriteLine("Elapsed Time: " & watch.Elapsed.ToString())
    Console.ReadKey(True)
End Sub

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: Consolas, “Courier New”, Courier, Monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

It doesn’t look too much different, does it? Behind the scenes, there is a lot going on. It’s a complex bit of multithreading and distribution. It’s dependent on your environment as well. The number of threads used depends on how many cores you have, and many other factors.

As a result, using the Parallel worker shaved an average of 10 seconds off every run.

Parallel.For is very similar in behavior, but rather than using a collection it will use a start integer and end integer.

Now let’s say we need to accumulate results. For example, let’s extract the title from the HTML. Again, the single threaded example looks like this:

Sub Main()
    Dim topUrls() As Uri = {
        New Uri("http://www.google.com")
    } 'Remainder omitted for clarity
    Dim watch As Stopwatch = Stopwatch.StartNew()
    Dim pageTitles = New List(Of String)()
    For Each url As Uri In topUrls
        Dim wc As New WebClient()
        Dim data As String = wc.DownloadString(url)
        Dim title As String = Regex.Match(data,
    "<title[^>]*>(?<title>[\W\w\r\n]*)</title>").Groups("title").Valu
        title = Regex.Replace(title, "[/\\:\?\*""<>|.\r\n\W]", "")
        pageTitles.Add(title)
        File.WriteAllText(".\" & title & ".html", data)
    Next
    watch.Stop()
    Console.WriteLine("Elapsed Time: " & watch.Elapsed.ToString())
    For Each title As String In pageTitles
        Console.WriteLine(title)
    Next
    Console.ReadKey(True)
End Sub

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: Consolas, “Courier New”, Courier, Monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

My regular expression handiwork isn’t the best, but it gets the job done. To put this in Parallel, we might be tempted to wrap the whole thing with a Parallel.ForEach again, but we can be more efficient than that.

The Parallel framework does not execute one item per-thread and ditch the thread. That would be inefficient. What we can do is specify local storage for each thread, then an additional lambda containing the accumulated data from the thread will be invoked. In addition to our body, we will now have an initializer that returns an initial state of an object that we define. This gets passed into the body, and the body will accumulate to that thread’s local storage. And finally, when the thread is finished, we’ll merge it into a master collection.

It sounds complex, but really, it isn’t. Let’s look at the code:

Sub Main()
    Dim topUrls() As Uri = {
        New Uri("http://www.google.com")
    } 'Remainder omitted for clarity
    Dim watch As Stopwatch = Stopwatch.StartNew()
    'Use a thread safe collection!
    Dim pageTitles = New ConcurrentBag(Of String)()
    Parallel.ForEach(topUrls, Function() New List(Of String)(),
      Function(url, loopState, threadData)
          Dim wc As New WebClient()
          Dim data As String = wc.DownloadString(url)
          Dim title As String = Regex.Match(data,
    "<title[^>]*>(?<title>[\W\w\r\n]*)</title>").Groups("title").Value
          title = Regex.Replace(title, "[/\\:\?\*""<>|.\r\n\W]", "")
          threadData.Add(title)
          File.WriteAllText(".\" & title & ".html", data)
          Return threadData
      End Function,
     Sub(threadData)
         For Each title In threadData
             pageTitles.Add(title)
         Next
     End Sub)
    watch.Stop()
    Console.WriteLine("Elapsed Time: " & watch.Elapsed.ToString())
    For Each title As String In pageTitles
        Console.WriteLine(title)
    Next
    Console.ReadKey(True)
End Sub

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: Consolas, “Courier New”, Courier, Monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

This isn’t entirely different from the previous result, there’s just a little extra. Looking at the call to ForEach, we are still handing in the topUrls as before, but now we have an init function before the body. This is run once before each thread starts, and it is unique to that thread. In this case, it’s a plain list. Since the list is unique to each thread, we don’t have to use a ConcurrentBag or worry about thread safety.

Next: the body. Previously it was a Sub/Void that took in each URL as the collection iterated. Now it includes the loopState, which can provide information about your current loop. We aren’t going to use that as of now. It also has the object that we initialized in the beginning. We will add our title to this collection, and return the collection. Lastly, we have the finally piece. Throughout the body we were adding the titles to different Lists, which now won’t do us any good. The finally is a lambda that says, OK. This thread is done. Here is all the data this thread accumulated. Do with it as you see fit. In this case, we copy it to a ConcurrencyBag back on our main thread. Note that whatever you call in the finally piece must be thread safe.

That’s the basics of the Parallel task. And there’s still a lot to explore in the in the Parallel Framework including Tasks, Cancelling, and Progress updates.

Kevin Jones is a Team Lead at Thycotic Software, an agile software services and product development company based in Washington DC. Secret Server is our flagship password management software product. On Twitter? Follow Kevin

Categories: .NET

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: