Friday, June 15, 2012

How to Parallelize your Algorithm with an ExecutorService. Part 1, the "Happy Path"

Let's say you have a long running algorithm to calculate stuff, for example, the median and 99th 
percentile (and other percentiles) of IRS income data.  This is slow because calculating medians and percentiles typically requires sorting data, and there's a lot of IRS data.  Your method looks like:


  public static PercentileResults calcPercentileResults (IRSData irsData, int year, Options options) {
     BigObjectLotsOfData = irsData.getData(year, options);
     // big calculation here
     return percentileResults;
  }
You always need a year.  Options represents optional complex things such as "only Married filing
jointly", "only ages 55-65", etc.  PercentileResults is a Business Object with the results.  Currently, the code calls this in serial and puts results into a Map:

  for (int year = 2000; year <= 2011; year++) {
     PercentileResults result = calcMedianAnd99(irsData, year, options);
     resultsMap.put(year, result);
  }
Obviously calculations for different years (and Options) can be done totally independently in parallel. An ExecutorService helps manage this for you. Most of the ExecutorService methods want a Callable, so the first step is to convert your class into a callable. Here's a first pass:

 public class CallableCalculation implements Callable<PercentileResults> {

   public final IRSData irsData;
   public final int year;
   public final Options options;
   
   public CallableCalculation(IRSData irsData, int year, Options options) {
      this.irsData = irsData;
      this.year = year;
      this.options = options;
   }

   @Override
   public PercentileResults call() throws Exception {
      
      BigObjectLotsOfData bolod = irsData.getData(year, options);
      PercentileResults result = new PercentileResults();
      // big calculation here that sets stuff in PercentileResults
      bolod = null; // important to free this memory
      // placeholder for Option 1 and 2 (see below)
      return result;
   }
   
 }

For simplicity, I made all the values public final so that they could be accessed later. If you don't like this style, make them private and add accessors as desired. More on this later... The simple way to call this in parallel would be:


 public Map<Integer, PercentileResults> calculateInParallel(Options options) throws Exception {
   IRSData irsData = IRSData.getInstance();
   HashMap<Integer, PercentileResults> resultMap = new HashMap<Integer, PercentileResults>();
      
   ArrayList<CallableCalculation> tasks = new ArrayList<CallableCalculation>();
      
   for (int year = 2000; year <= 2011; year++) {
      CallableCalculation cc = new CallableCalculation(irsData, year, options);
      tasks.add(cc);
   }
   int processors = Runtime.getRuntime().availableProcessors();
    //might want to adjust that number some...
   ExecutorService myService = Executors.newFixedThreadPool(processors);
      
   // oops - there's a problem coming up
   List<Future<PercentileResults>> results = myService.invokeAll(tasks);
   for (Future<PercentileResults> future : results) {
      PercentileResults pr = future.get();
      
      resultMap.put(year, pr);  // oops, what's the year for that Future???
   }      
      
   return resultMap;      
 }


Now, there's one "gotcha" so far. The input parameter year has gotten separated from the results.
There's a few options.

1) If the final location is really really clear and obvious, the algorithm itself could put the results
there. In this example, before returning the result (see comment "placeholder"), just put the results
into the Map. This is a simple solution but not very robust to changes in requirements.

2) Add the year as a new field to PercentileResults, and set it (at placeholder spot). This is robust.
But tedious and violates DRY. What if there are lots of settings you want to remember? Like all the
Options? And maybe you don't want to clutter your XXXResults with the input settings.  Or you can't - it's taken from some third party library.

3) The option I like best it to return the CallableCalculation! It already holds all the settings. You just need to add a field for the results, and relevant accessors. So you aren't doing much extra work. Your class would look like this (changes noted by "NEW")


public class CallableCalculation implements Callable<CallableCalculation> {

   public final IRSData irsData;
   public final int year;
   public final Options options;
   
   PercentileResults results;  // NEW
   
   public CallableCalculation(IRSData irsData, int year, Options options) {
      this.irsData = irsData;
      this.year = year;
      this.options = options;
   }

   public PercentileResults getPercentileResults () { return results; }  // NEW

   @Override
   public CallableCalculation call() throws Exception {
      
      BigObjectLotsOfData bolod = irsData.getData(year, options);
      results = new PercentileResults();
      // big calculation here that sets stuff in PercentileResults
      bolod = null; // important to free this memory
      return this; // NEW
   }
   
}
and the calling method, in place of the "oops - there's a problem coming up" section has:


List<Future<CallableCalculation>> results = myService.invokeAll(tasks);
   for (Future<CallableCalculation> future : results) {
      CallableCalculation cc = future.get();       
      resultMap.put(cc.year, cc.getPercentileResults());
   } 
In a future post we will consider the unhappy path with errors and Exceptions.

1 comment:

  1. Hi MORGAN CONRAD,

    I have a question with respect to multithreading. I have a method lets say "readfile(String filename)". I want to read different files with same method in parallel and write them to different files.

    example readfile("abc.txt") and readfile("def.txt") these both method should run in different threads and write to different files.

    ReplyDelete