Text Analysis Benchmarks
Benchmark comparisons of different programming languages are notoriously thorny, with so many variables in play (system configuration, compiler, algorithm, etc.); in fact, some make a
game of skewing the tests and results in favour of one's favourite language. In any case, benchmarks often seem to compare the performance of languages for mathematical or scientific problems, such as
fractals.
Although one might assume that the relative speed differences hold for text analysis, it's worth testing. And, of course, it's worth reminding ourselves that speed isn't everything: there's also the relative time it takes to write code in each language, as well as the maintainability and scalability of each language.
These experiments are deliberately simple and modular. For each language, there are multiple ways of doing things, and it would be good to document preferably approaches here.
Tokenization and Counting
Task:
Retrieve a relatively large text (King James Bible as plain text), tokenize it (keeping all punctuation and spaces), and count types (unique occurrences of each type).
Summary:
| language | time |
| Ruby | 13 |
| PHP (regex) | 9 |
| PHP (built-in words) | 1 |
| Java | 1 |
These are on a
MacBook? Pro 2.16 GHz Intel Core Duo with 2GB Ram, using Ruby 1.8.6, PHP 5.2.4, Java 1.5
Ruby
code
# original code by Stéfan Sinclair, November 2007
require 'net/http'
# document to use – King James Bible from Gutenberg
url = ['localhost', '/~sgs/Temp/kjv10.txt']
# tokenize to find words, spaces, or other single characters
regex = "/(\w+['-]\w+|\w+|\s+|.)/";
# retrieve text
last = Time.new.to_i
contents = Net::HTTP.get 'localhost', '/~sgs/Temp/kjv10.txt'
diff = (Time.new.to_i-last)
puts "retrieval: " + diff.to_s + " (" + contents.length.to_s + " characters)\n"
# tokenize using a scan
last = Time.new.to_i
words = {}
contents.scan(/(\\w+['-]\w+|\w+|\s+|.)/) do |token|
if !words.has_key?(token)
words[token] = 0
end
words[token] += 1
end
diff = (Time.new.to_i-last)
puts "tokenizing and counting: " + diff.to_s + " (" + words.size.to_s + " types)\n";
results
retrieval: 1 (4445260 characters)
tokenizing and counting: 12 (14212 types)
specs
MacBook Pro, 2.16 GHz Intel Core Duo, 2 GB RAM
$ ruby -v
ruby 1.8.6 (2007-06-07 patchlevel 36) [universal-darwin9.0]
PHP
Note that the PHP code has two versions, one which uses regular expressions to count all words, punctuation and spaces (like the other tests) and one that only counts words using built-in functions. The difference in performance is huge, most likely because the built-in function happens entirely within the C intepreter, whereas the regular expressions version crosses back-and-forth between PHP and the interpreter (and, of course, uses regular expressions).
code
// original code by Stéfan Sinclair, November 2007
// document to use -- King James Bible from Gutenberg
$url = "http://localhost/~sgs/Temp/kjv10.txt";
// tokenize to find words, spaces, or other single characters
$regex = "(\w+['-]\w+|\w+|\s+|.)";
// retrieve text
$last = time();
$contents = file_get_contents($url);
$diff = time()-$last;
echo "retrieval: $diff seconds (", strlen($contents), " characters)\n";
// tokenize using matching offsets
$last = time();
$offset = 0;
/* doing a looping match on the full string is relatively inefficient
* since the full string is used each time, but I wasn't able to use
* preg_match_all because of memory constraints; besides this is closer
* to how the other tests (Java, Ruby) are running
*/
while(preg_match("/$regex/", $contents, $matches, PREG_OFFSET_CAPTURE, $offset)) {
$words[$matches[0][0]]++;
$offset = $matches[0][1] + strlen($matches[0][0]);
}
$diff = time()-$last;
echo "tokenizing and counting with regex: $diff seconds (", count($words), " types)\n";
// tokenize using built-in word and counting functions
$last = time();
$offset = 0;
// note that the built-in functions don't allow us to keep punctuation and spaces like the others
$words = array_count_values(str_word_count($contents, 1));
$diff = time()-$last;
echo "tokenizing and counting with built-in methods: $diff seconds (", count($words), " types)\n";
results
retrieval: 0 seconds (4445260 characters)
tokenizing and counting with regex: 9 seconds (14453 types)
tokenizing and counting with built-in methods: 1 seconds (14291 types)
specs
MacBook Pro, 2.16 GHz Intel Core Duo, 2 GB RAM
$ php -v
PHP 5.2.4 (cli) (built: Sep 23 2007 22:34:35)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies
Java
code
/* original code by Stéfan Sinclair November 2007 */
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Bench {
// document to use -- King James Bible from Gutenberg
static String urlString = "http://localhost/~sgs/Temp/kjv10.txt";
// tokenize to find words, spaces, or other single characters
static Pattern tokensPattern = Pattern
.compile("(\\w+['-]\\w+|\\w+|\\s+|.)");
static long last; // keep track of last time stamp
static long diff; // keep track of time difference
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
// retrieve text
last = new Date().getTime();
StringBuilder sb = new StringBuilder();
try {
// Create a URL for the desired page
URL url = new URL(urlString);
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url
.openStream()));
String str;
while ((str = in.readLine()) != null) {
sb.append(str).append("\n");
}
in.close();
} catch (MalformedURLException e) {
} catch (IOException e) {
}
diff = (new Date().getTime() - last);
System.err.println("retrieval: " + new Long(diff / 1000).intValue()
+ " seconds (" + sb.length() + " characters)");
// tokenize
last = new Date().getTime();
Matcher tokensMatcher = tokensPattern.matcher(sb.toString());
Map<String, Integer> words = new HashMap<String, Integer>();
String word;
while (tokensMatcher.find()) {
word = tokensMatcher.group(1);
if (!words.containsKey(word)) {
words.put(word, 0);
}
words.put(word, words.get(word) + 1);
}
diff = (new Date().getTime() - last);
System.err.println("tokenizing and counting: " + new Long(diff / 1000).intValue()
+ " seconds (" + words.size() + " types)");
}
}
results
retrieval: 0 seconds (4345143 characters)
tokenizing and counting: 1 seconds (14453 types)
specs
MacBook Pro, 2.16 GHz Intel Core Duo, 2 GB RAM
$ java -version
java version "1.5.0_13"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13-b05-237)
Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing)
--
StefanSinclair - 16 Nov 2007