Skip to content.

Find topic

Web tools

Help

Tools

       Analysis Tool Bar  +

Text Analysis Benchmarks

Benchmark comparisons of different programming languages are notoriously thorny, with so many variables in play (system configuration, compiler, algorithm, etc.); in fact, some make a game of skewing the tests and results in favour of one's favourite language. In any case, benchmarks often seem to compare the performance of languages for mathematical or scientific problems, such as fractals.

Although one might assume that the relative speed differences hold for text analysis, it's worth testing. And, of course, it's worth reminding ourselves that speed isn't everything: there's also the relative time it takes to write code in each language, as well as the maintainability and scalability of each language.

These experiments are deliberately simple and modular. For each language, there are multiple ways of doing things, and it would be good to document preferably approaches here.

Tokenization and Counting

Task:

Retrieve a relatively large text (King James Bible as plain text), tokenize it (keeping all punctuation and spaces), and count types (unique occurrences of each type).

Summary:

language time
Ruby 13
PHP (regex) 9
PHP (built-in words) 1
Java 1

These are on a MacBook? Pro 2.16 GHz Intel Core Duo with 2GB Ram, using Ruby 1.8.6, PHP 5.2.4, Java 1.5

Ruby

code


  # original code by Stéfan Sinclair, November 2007

  require 'net/http'
  
  # document to use – King James Bible from Gutenberg
  url = ['localhost', '/~sgs/Temp/kjv10.txt']
  
  # tokenize to find words, spaces, or other single characters
  regex = "/(\w+['-]\w+|\w+|\s+|.)/";
  
  # retrieve text
  last = Time.new.to_i
  contents = Net::HTTP.get 'localhost', '/~sgs/Temp/kjv10.txt'
  diff = (Time.new.to_i-last)
  puts "retrieval: " + diff.to_s + " (" + contents.length.to_s + " characters)\n"
  
  # tokenize using a scan
  last = Time.new.to_i
  words = {}
  contents.scan(/(\\w+['-]\w+|\w+|\s+|.)/) do |token|
    if !words.has_key?(token)
      words[token] = 0
    end
    words[token] += 1
  end
  diff = (Time.new.to_i-last)
  puts "tokenizing and counting: " + diff.to_s + " (" + words.size.to_s + " types)\n";

results

  retrieval: 1 (4445260 characters)
  tokenizing and counting: 12 (14212 types)

specs

MacBook Pro, 2.16 GHz Intel Core Duo, 2 GB RAM
$ ruby -v
ruby 1.8.6 (2007-06-07 patchlevel 36) [universal-darwin9.0]


PHP

Note that the PHP code has two versions, one which uses regular expressions to count all words, punctuation and spaces (like the other tests) and one that only counts words using built-in functions. The difference in performance is huge, most likely because the built-in function happens entirely within the C intepreter, whereas the regular expressions version crosses back-and-forth between PHP and the interpreter (and, of course, uses regular expressions).

code


       // original code by Stéfan Sinclair, November 2007

   // document to use -- King James Bible from Gutenberg
   $url = "http://localhost/~sgs/Temp/kjv10.txt";
   
   // tokenize to find words, spaces, or other single characters
   $regex = "(\w+['-]\w+|\w+|\s+|.)";

   // retrieve text
   $last = time();
   $contents = file_get_contents($url);
   $diff = time()-$last;
   echo "retrieval: $diff seconds (",  strlen($contents),  " characters)\n";

   // tokenize using matching offsets
   $last = time();
   $offset = 0;
   /* doing a looping match on the full string is relatively inefficient
    * since the full string is used each time, but I wasn't able to use
    * preg_match_all because of memory constraints; besides this is closer
    * to how the other tests (Java, Ruby) are running
    */
   while(preg_match("/$regex/", $contents, $matches, PREG_OFFSET_CAPTURE, $offset)) {
      $words[$matches[0][0]]++;
      $offset = $matches[0][1] + strlen($matches[0][0]);
   }
   $diff = time()-$last;
  echo "tokenizing and counting with regex: $diff seconds (", count($words), " types)\n";

  // tokenize using built-in word and counting functions
  $last = time();
  $offset = 0;
  // note that the built-in functions don't allow us to keep punctuation and spaces like the others
  $words = array_count_values(str_word_count($contents, 1));
  $diff = time()-$last;
  echo "tokenizing and counting with built-in methods: $diff seconds (", count($words), " types)\n";

results

retrieval: 0 seconds (4445260 characters)
tokenizing and counting with regex: 9 seconds (14453 types)
tokenizing and counting with built-in methods: 1 seconds (14291 types)

specs

MacBook Pro, 2.16 GHz Intel Core Duo, 2 GB RAM
$ php -v
PHP 5.2.4 (cli) (built: Sep 23 2007 22:34:35) 
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies


Java

code

/* original code by Stéfan Sinclair November 2007 */
   import java.io.BufferedReader;
   import java.io.IOException;
   import java.io.InputStreamReader;
   import java.net.MalformedURLException;
   import java.net.URL;
   import java.util.Date;
   import java.util.HashMap;
   import java.util.Map;
   import java.util.regex.Matcher;
   import java.util.regex.Pattern;
   
   public class Bench {
   
      // document to use -- King James Bible from Gutenberg
      static String urlString = "http://localhost/~sgs/Temp/kjv10.txt";
   
      // tokenize to find words, spaces, or other single characters
      static Pattern tokensPattern = Pattern
      .compile("(\\w+['-]\\w+|\\w+|\\s+|.)");
   
      static long last; // keep track of last time stamp
      static long diff; // keep track of time difference
      
      /**
       * @param args
       * @throws IOException
       */
      public static void main(String[] args) throws IOException {
   
         // retrieve text
         last = new Date().getTime();
         StringBuilder sb = new StringBuilder();
         try {
            // Create a URL for the desired page
            URL url = new URL(urlString);
   
            // Read all the text returned by the server
            BufferedReader in = new BufferedReader(new InputStreamReader(url
                  .openStream()));
            String str;
            while ((str = in.readLine()) != null) {
               sb.append(str).append("\n");
            }
            in.close();
         } catch (MalformedURLException e) {
         } catch (IOException e) {
         }
         diff = (new Date().getTime() - last);
         System.err.println("retrieval: " + new Long(diff / 1000).intValue()
               + " seconds (" + sb.length() + " characters)");
   
         // tokenize
         last = new Date().getTime();
         Matcher tokensMatcher = tokensPattern.matcher(sb.toString());
         Map<String, Integer> words = new HashMap<String, Integer>();
         String word;
         while (tokensMatcher.find()) {
            word = tokensMatcher.group(1);
            if (!words.containsKey(word)) {
               words.put(word, 0);
            }
            words.put(word, words.get(word) + 1);
         }
         diff = (new Date().getTime() - last);
         System.err.println("tokenizing and counting: " + new Long(diff / 1000).intValue()
               + " seconds (" + words.size() + " types)");
   
      }
   }

results

retrieval: 0 seconds (4345143 characters)
tokenizing and counting: 1 seconds (14453 types)

specs

MacBook Pro, 2.16 GHz Intel Core Duo, 2 GB RAM
$ java -version
java version "1.5.0_13"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13-b05-237)
Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing)

-- StefanSinclair - 16 Nov 2007


Use this box to quickly add a comment to the page.

more options...