Postabon Blog

Tales from the Early Days 
« Back to blog

Make Lisp 15x faster than Python or 4x faster than Java

Hi Guys - welcome back to the Postabon blog (after our not-so-brief hiatus)!

I've been insanely busy for the last few weeks (I don't think I ever appreciated just how much actual work launching a startup is, until I actually did one). In the last month or so we've rewritten most of the front end of our website (it's much prettier!), made several trips to NYC and the west coast (most recently for Jason Calacanis's Open Angel Forum (among other things) where I got to meet lots of amazing angels and entrepreneurs), and so much more I can't talk about publicly just yet ;-)

I've had this post in mind for a few weeks now and, since I couldn't sleep last night, I finally got around to banging it out.

One of the three most common criticisims I hear when I tell people I'm using Lisp for my startup is that 'Lisp is so slow!' (the other two being 'How can you code with all those ugly parentheses?!', and the dirth of libraries - which I've addressed in other places).

This is quite funny to me, because Lisp is one of the fastest high level languages I know of (with a few obvious exceptions, in very strongly typed languages like Haskell or OCaml ;-)). One of the featuers I like most about Common Lisp is that you can optionally provide the compiler type hints, and you'll come close to the speed of raw C. Most languages provide some sort foreign function interface - but it's usually ugly, hard to use, and non-portable. Being able to take any existing Lisp code, and drop in a few '(declare (type ...))' declarations and getting the speed of a low level language is pretty great!

I thought I'd demonstrate this functionality, with a bit of real world code. Postabon recommends deals to users based on a variety of factors such as the age of the deal, other people's votes, your preferences, and (relevant for this example) your distance from the deal. Most of the other factors can be computed asyncronously and just cached, but your distance from deals is computed a lot, and can't really be pre-computed since I have no way of knowing your location a priori (well, that's not strictly true, and we do do some memoization, but it has a relatively low hit rate). The most common way to compute the distance between two latitude/longitude pairs is to assume the earth is a perfect sphere, and then do some basic trig that's known as the 'Haversine Formula' (Postabon actually does something a little different internally, but this is a reasonable simplification).

Some profiling suggested my program was spending a fair amount of time in a 'distance' function - and, since I'd already tackled a lot of the other low hanging fruit, I decided to take a few minutes to speed it up.

Without further ado, I'll present the results and the code, and talk about how I ran these tests.

 

None of the code is 'optimal', but they are all just literal line-by-line translations of each other (and of the formula in the Wikipedia article :-D). Wherever one uses a temporary variable, so do the others, etc. All the benchmarks were run 5x each on the same Debian machine, using the latest runtime/compiler available from apt. Also of note is that virtual machine start up and end time was disregarded since all timings were done in code. I did experiment with some changes to make things more idiomatic (e.g., using doubles in Java, or list comprehensions in Python), but I ended up using whatever was fastest.

To transform the first piece of Lisp code into the second, I told the compiler I wanted to optimize for speed by typing (declare (optimize (speed 3) (safety 0) (debug 0))) at the top of the salient functions. Then, when I tried to compile my code, it would then tell me whenever it couldn't infer the type of something (or was being forced to coerce types of something) with a warning like:

; in: DEFUN DISTANCE
;     (* (THE FIXNUM *EARTH-RADIUS*) 2
;        (THE SINGLE-FLOAT (ASIN (THE SINGLE-FLOAT (SQRT #)))))
; ==>
;   (* (* (THE FIXNUM *EARTH-RADIUS*) 2)
;      (THE SINGLE-FLOAT (ASIN (THE SINGLE-FLOAT (SQRT #)))))
;
; note: doing signed word to integer coercion (cost 20), for:
;       the first argument of GENERIC-*

Translated, that just means that when SBCL is multiplying the Earth's radius by two, it doesn't know for certain that the result will fit in a 'fixnum' (analagous to an 'int' in most other languages) so it has to assume the result is an integer (which is an infinite precision integer, similar to a 'BigInt' in most other languages). This conversion is relatively expensive but, since I know the Earth's radius will always be under 7,000 km, - I can just give the compiler a 'hint' and guarantee that the result will always be small enough to fit in a 'fixnum.'

I'll admit these warnings are a little cryptic. When I first started programming with SBCL it used to me almost an hour to figure them all out properly optimize a small piece of code. But, after doing it a half dozen times, I can usually anticipate where the type conversions/ambiguities are going to be before the compiler even tells me - and I can usually speed up most small functions three or four fold with just five minutes of work.

Of course, I still do this sort of optimization fairly rarely, since the code loses some flexibility (and I'd rather work on adding new features until profiling tells me I have to go do this). I also very rarely use (safety 0) in production (if I do, I'm careful to have enough pre-assertions to prove that my hints will never be wrong). But, it's certainly nice to have the option to make the code go so fast so easily.

If I'm making some terrible mistake in my Java or Python code (which is possible, I'm a bit rusty with both), please let me know or submit a revised (but still equivalent) solution, and I'll be happy to rerun the benchmarks.

Posted by Shaneal Manek 

Comments (37)

Jan 28, 2010
Ben said...
Try running the java code using the server jvm.
Jan 28, 2010
Shaneal Manek said...
Thanks for the advice Ben.

Running DistancePrimitives with -server took about 113 seconds (instead of the ~115 it takes with -client). A little faster, but not really a significant difference.

Jan 28, 2010
andrew cooke said...
interesting; i'm surprised by this. what jre are you using? i can't imagine it would make any difference, but you could "import static" the Math functions in the java code.
Jan 28, 2010
andrew cooke said...
also, you could try running the entire java solution several times to see if it gets faster (again, just a stab in the dark).
Jan 28, 2010
Nit Picker said...
(defconstant +earth-diameter+ 12742)
Jan 28, 2010
onne said...
for the hotspot to kick in, I guess running 5 times might not enough try 100 times and see if there is a large discrepancy in the timing

also might I recommend using System.nanoTime();

Jan 28, 2010
andrew cooke said...
sorry for posting so many times. just to clarify - what both i and onne are suggesting is that you need to run the routine more *within the same process*. there is a known issue that benchmarks can be misleading because something in the jit/memory management takes time to become efficient (although i have no idea what the threshold is, and what you are doing does seem like a fair chunk of work already).
Jan 28, 2010
Shaneal Manek said...
andrew: Thanks for the hints. Rerunning the benchmark repeatedly in the same JVM yielded about a 10% improvement. Still not what as good as I think it should be.

$ java -version
java version "1.6.0_0"
OpenJDK Runtime Environment (build 1.6.0_0-b11)
OpenJDK 64-Bit Server VM (build 1.6.0_0-b11, mixed mode)

It's the latest java installable by apt in Debian ...

Jan 28, 2010
David Beazley said...
For the Python code, I would strongly suggest using a statement such as "from math import *" and using the functions sin(), cos(), sqrt(), etc. without the associated 'math.' prefix. You will find that the code runs about 25% faster (result I obtained when I tested it).
Jan 28, 2010
Shaneal Manek said...
David: Thanks - good catch. I'm rerunning the benchmarks now. As soon as it finishes I'll update the code and the chart.
Jan 28, 2010
andrew cooke said...
continuing to extend the bounds of my lack of knowledge... i had no idea what openjdk was (i just had a look at the faq - http://www.sun.com/software/opensource/java/faq.jsp - and it says what it contains, but doesn't seem to say what it doesn't contain that the proprietary jdk might).

so it might be worth downloading the "non-open" jre from java.sun.com and trying that (i develop commercially with java on linux and have always installed from java.sun.com rather than using what is provided in distros; but this openjdk looks like it is "the real thing" (at last) so again it's not clear there will be any speedup).

and while reading around i found this - http://java.sun.com/performance/reference/whitepapers/tuning.html#section3.1 - which explains better the earlier comment. see also section 4 for other suggestions on how to tune java code.

ps i am not really surprised lisp is fast - more surprised that java is not comparable.

Jan 28, 2010
bebraw said...
It probably doesn't make much of a difference but you could try replacing the while constructs of the Python code with something along:

for latA in xrange(-90, 90 + increment, increment):
for lngA in xrange(-180, 180 + increment, increment):
for latB in xrange(-90, 90 + increment, increment):
for lngB in xrange(-180, 180 + increment, increment):

In my tests that was a bit faster.

If you are willing to benchmark Python further, please take a look at Psyco (http://psyco.sourceforge.net/), PyPy (http://codespeak.net/pypy/dist/pypy/doc/) and Cython (http://cython.org/). It would be somewhat interesting to see how the results gained using those would measure up against the rest.

You could also try to memoize (cache) the math func results to see if that makes any measurable difference. See http://code.activestate.com/recipes/52201/ for further information. That might be considering cheating, however. :)

Jan 28, 2010
Mike said...
Lua version:
------------------------------------------------------------------
local radius = 6371

local function distance(latA, lngA, latB, lngB)
local latAr = math.rad(latA)
local lngAr = math.rad(lngA)
local latBr = math.rad(latB)
local lngBr = math.rad(lngB)

local deltaLat = latBr - latAr
local deltaLng = lngBr - lngAr

return radius * 2 * math.asin(math.sqrt(
math.sin(deltaLat/2)^2 + math.cos(latAr)
* math.cos(latBr) * (math.sin(deltaLng/2)^2)))
end

local start = os.clock()

for latA=-90,90,2.5 do
for lngA=-90,90,2.5 do
for latB=-90,90,2.5 do
for lngB=-90,90,2.5 do
distance(latA, lngA, latB, lngB)
end
end
end
end

local stop = os.clock()
print(stop-start)
------------------------------------------------------------------

Runtime on a 3 GHz Core2 E8400:

270.55s Python 2.6.2 (for comparison)
33.49s Lua 5.1.4
0.04s LuaJIT 2.0.0-beta2

The distance() function is pure and its result is unused, so the compiler just optimizes it away. Even if you sum up the results to prevent this, it only takes 3.06 seconds with LuaJIT 2.0. No type declarations needed.

Jan 28, 2010
Ryan said...
Change your time in java to use: System.currentTimeMillis();

You can test the overhead:

final long start = System.nanoTime();

//final long time = System.currentTimeMillis();
final long time = java.util.Calendar.getInstance().getTimeInMillis();

final long exec = System.nanoTime() - start;

System.out.println("exec: " + exec);

Jan 28, 2010
Daniel said...
Java 1.6.0_0 was released December 2006! The current version is 1.6.0_18.
Jan 28, 2010
Shaneal Manek said...
bebraw and andrew: very good points. I have to go do real work now, but I'll try to come back and rerun the java benchmark (with the official sun JRE) and the Python benchmark (with xrange). I think memoization is cheating (I'm not using it in any of the other solutions), and for the sake of simplicity I'm going to stick default Python interpreter/compiler.

Mike: Lua seems very impressive - I need to learn some more of it one of these days!

Ryan: You're right, that's probably the better way to do so. But it doesn't appear to make a measurable difference.

Jan 28, 2010
Ryan said...
My mistake... I briefly glanced at this and thought the times were in milliseconds (not seconds).
Jan 28, 2010
Isaac Gouy said...
>> If I'm making some terrible mistake in my Java <<

It's more that you are highlighting the danger of generalizing from a single small code snippet - maybe your code snippet does nothing more than remind us that Java sin and cos might not be using the machine hardware.

Notice the difference between

http://shootout.alioth.debian.org/gp4/program.php?test=partialsums&lang=java&id=4

and the plain vanilla program

http://shootout.alioth.debian.org/gp4/program.php?test=partialsums&lang=java&id=3

Jan 28, 2010
Isaac Gouy said...
>> Rerunning the benchmark repeatedly in the same JVM <<

Here are some examples of what the differences can be, binary-trees has the noticeable improvement -

http://shootout.alioth.debian.org/help.php#java

Use -XX:+PrintCompilation -XX:-PrintGC to see what's happening with your code snippet

http://shootout.alioth.debian.org/u32q/program.php?test=binarytrees&lang=javasteady&id=2#about

Jan 28, 2010
LanguageNarc said...
Use a spell-checker! It's "dearth," not "dirth." And the ONLY valid meaning of "it's" is "it is" -- the possessive form is spelled "its" (no apostrophe). Stop contributing to the decline of the English language!
Jan 28, 2010
Brian Harring said...
@bebraw:

your xrange isn't the same- xrange converts floats to ints for py2.6...

Jan 28, 2010
bebraw said...
Brian Harring:
Oops. Good point! Coming up with a simple frange should not be too hard but that's probably beyond the point of this benchmark. :)
Jan 28, 2010
Brian Harring said...
Few notes from boredom and an i7 @ ~2.8GHZ

1) the beazley version you've posted is bugged- needs to be sqrt( instead of math.sqrt( since you've imported sqrt into the globals() of that module
2) unladen-swallows (u-s) does a nice number on this- 144s. Still crappy admittedly, but down from 258s.
3) about 28s of u-s runtime is from continually recalculating floats + radius/radians scope lookups... calculate once (meaning down to 116s now)- that same change knocks 7s off of py2.6. Specifically making a list of the lat/longs it will inspect w/in each loop, and just iterating over that instead.
4) python function calls are stupidly expensive- via #3, down to 251s. Inlining distance (and doing the usual "localize the function" trick for the math bits you invoke) brings it down to 211s. As said... pricey ;)
5) pythons math module internally converts the items to doubles rather then floats- I'd assume that can be a hit using sqrt instead of sqrtf although I've no clue what lisp which lisp uses (I presume sqrtf going by your type modifications).

Aside from that random optimization run... yeah, python is getting it's ass handed to it here- I do truly wonder what the sqrt/sqrtf hit is though.

Jan 28, 2010
domor said...
Obligatory Haskell version: http://domor.pastebin.com/f11063fdd
Runs in 13s on my laptop (-02, on Windows so no -fvia-c), but over half the time is spent in GC so it could probably be improved a lot.

It does have some nice type safety between radians and degrees :)

Jan 28, 2010
Noel Grandin said...
Java's transcendental math functions have a well-known problem in that their API contract means that they don't use the hardware sin/cos instructions. Something to do with the fact that java produces good results no matter what the input while the intel hardware instructions will do some strange stuff outside of the [-PI/2, PI/2] range. Can't find the link right now.
So yeah, in this case, you're going to get better results with a language-runtime-JIT that has enough information to safely compile down to the machine instructions.
Jan 28, 2010
jtarchie said...
You are doing optimizations in your querying, but there are better ways of handling 'nearest neighbor' scenario. You can go from seconds to milliseconds. One data structure that comes to mind if the kd-tree, which is often used for when you want to find nearby locations to a coordinates.

http://en.wikipedia.org/wiki/Kd-tree

Jan 28, 2010
Shaneal Manek said...
@jtarchie: We're using a modified R*-tree with some really cool bulk loading stuff for spatial/nearest-neigbhor queries. This is an order of magnitude faster than a kd-tree - and I may write a post at some point explaining it. The distance function is used in our recommendation engine - where we have to have a precise numerical distance as an argument in the 'weighting' function.
Jan 28, 2010
framiere said...
In this current configuration, hotspot is not working.
In order for Hotspot to trigger its magic, you need to extract your code into another method.
Warm-up hotspot by executing a few times this method.
Then you should experience the real speed of java.
Jan 28, 2010
andrew cooke said...
and someone has tried it with the standard jdk - http://news.ycombinator.com/item?id=1084139 - which makes the times comparable (lisp is still slightly faster).
Jan 28, 2010
drhodes said...
Python with psyco clocked at 150 seconds. Without: 398.
Jan 28, 2010
TheSchemer said...
I love it how everyone's jumping in to defend the other languages.

All I have to say is... Go Lisp!

Jan 29, 2010
Artur said...
@framiere - not true, hotspot server can do on-stack replacement. With such long running test, difference is not huge.

Noel gave a real hint - sin/cos etc are very slow in java due to different expectations for quality of results. What you are testing here is speed of those functions and here java loses, no question about it. In fact, most of Math.* functions are way slower than expected - I remember getting 5x speedup in some microbenchmark when using (int)f instead of Math.floor(f) after noticing that f > 0 in particular place.

Jan 29, 2010
ilowry said...
Hi, I so often saw the code like (the float (... (the float ...))), so decided to write small library that would add the type declarations recursively. It's very basic and just provides the only macro THE*, so if you write

(THE* SIGNLE-FLOAT (SIN (/ 1 4))

, it expands it into

(THE SINGLE-FLOAT (SIN (THE SINGLE-FLOAT (/ (THE SINGLE-FLOAT 1.0) (THE SINGLE-FLOAT 4.0)))))

But I have never used in in practice and can not tell how it is usable yet. So, I'd be glad if you could try it, and tell me if it is usable or not for a real project.

It is asdf-installable the name of the package is THE. Page on cliki: http://www.cliki.net/the

Jan 29, 2010
snorbi said...
For the Java benchmark, please see the "Performance Options" part of the HotSpot VM options: http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp#PerformanceTuning

For your test it may be more appropriate to apply aggressive optimization, like AggressiveOpts, or setting the CompileThreshold to a very low value (like to 1 :).

Best regards:
--
Norbert Sándor
http://jvminside.blogspot.com/

Jan 30, 2010
Goner said...
The frivolous denial of the Java crowd is classic.
Jan 30, 2010
snorbi said...
> The frivolous denial of the Java crowd is classic.

That's not true :)
I'm just curious, if the test is run with more proper "real-life"-like configuration, how the HotSpot performs.

Feb 01, 2010
evardsson said...
I am surprised at how poorly Python did in this case. I know from previous experience that in terms of string handling / parsing large log files etc it outperforms Java by a wide margin. I wonder what the bottleneck here is.

Leave a comment...

 
Got an account with one of these? Login here, or just enter your comment below.
Posterous-login    twitter