A simplistic look at the performance of several embedded Java databases

_update_: I’ve update some of the results to include HSQLDB’s CACHED tables.

h2. Introduction

As part of my PhD research I am developing a fairly complex simulation of pedestrian movement. Well, it’s at least moderately complex, particularly when scaled up to tens of thousands of people! I’ve have developed the simulation with MySQL as a backend, serving both to provide the input data (overall configuration, street maps, routes, agent preferences, etc) and collect the output data (basically an event is generated whenever two pedestrians walk past each other). Further overviews on my research can be found at in the “research section”:http://jamie.ideasasylum.com/research/phd.html

MySQL was initially chosen because I had some experience with it via PHP, version 4.1 supports OpenGIS definitions (which, for example, allows me to query for all streets within a particular area) and it has plenty of formal and informal support (partiularly documentation and management tools). In order to explore the parameter space of my simulation I need to deploy it on the University’s research cluster. However, accessing a remote database, particularly for such frequent input/output, was not a plausible or efficient situation. The easiest solution was to replace MySQL with a pure-Java, embedded database. Embedded databases eschew the client-server architecture of a mainstream database (such as Oracle, Postgres, MySQL) and instead execute within the JVM and store their data in local files. Most embedded databases still use the standard JDBC interface through which traditional client-server databases are accessed. Thus, you should be able to change the database drivers (and possibly make minors modifications to the SQL) and everything would work just like with MySQL — but faster and with no network requirements. In reality, it isn’t quite that simple.

It seemed like a wise step to spend a short time comparing the available solutions which is what this article is all about. Please note that, although I’m a Java programmer for the past 8 years, I am not a database expert. I make no claims to the accuracy of this article. Please don’t simply scan the graphs and read the conclusions without understanding what I have benchmarked! Also, I wouldn’t try to generalise the results too much — it’s always best to benchmark your own candidates with a representative usecase from your application. I supply the benchmark source code for you to peruse and adapt as you require.

h2. Candidate Databases

So, based on Google, “Java-source”:http://www.java-source.net/ and general knowledge, I went in search of a pure Java embedded database which would fulfil my criteria:

* Free for non-commercial/educational use and/or open source (well, I am a student!)
* Save to local files (which should be documented so I can retrieve them from the remote machines)
* Support JDBC and act as a relational database
* Support auto-incremented integer columns
* Be fast and small with minimal external configuration
* INSERT operations are probably more important than SELECTs so optimisation there would be appreciated
* Good documentation
* Recently updated and under active development

Probably the most widely used embedded database is “Sleepycat’s”:http://www.sleepycat.com/ “BerkeleyDB”:http://www.sleepycat.com/products/db.shtml, of which they now offer “a pure Java version”:http://www.sleepycat.com/products/je.shtml. Personally I’m a bit of a fan of this database but it uses a record structure, not a relational one, and therefore it doesn’t support JDBC. I’ve found three other possible candidates: HSQLDB, Derby (previously IBM’s Cloudscape) and Daffodil’s One$DB. I give a very brief overvew of each database below but the primary focus of this article is performance benchmarking not feature comparison.

h3. MySQL

For the sake of completeness, I include “MySQL”:http://www.mysql.com/ here as it is the database I’m migrating from.
MySQL is a popular cross-platform open-source database with extensive documentation, books and tools. MySQL is a native application, which is accessed in a client-server fashion, and is widely used in combination with PHP and Apache for web applications.


“HSQLDB”:http://hsqldb.sourceforge.net/, previously known as HypersonicDB, is a mature Java embedded database which has recently found favour with the “OpenOffice.org”:http://www.openoffice.org team and it will be integrated with their forthcoming database office application, Base. HSQLDB is often used in combination with Hibernate. The website for HSQLDB is fairly plain, but easy to navigate, and the documentation appears to be quite comprehensive. There are a few tools supplied with HSQLDB, including a database browser, but several other tools also support this database (including the useful and attractive “DbVisualiser”:http://www.minq.se/products/dbvis/)

h3. Derby

“Derby”:http://incubator.apache.org/derby/ (previously known as Cloudscape) was recently open-sourced by IBM and contributed to the Apache project. As such, it is a mature product but a relatively unknown quantity to most developers. The website is clean and there is a good level of useful documentation available. There are a few tools provided with the distribution including a rudimentary viewer and a command -line interface.

h3. Daffodi One$DB

“One$DB”:http://www.daffodildb.com/one-dollar-db.html is the open-source version of DaffodilDB but retains most of it’s features. One$DB was open sourced in December 2004 and can be embedded into your application or in a typical client-server database. One$DB is supplied with a database browser (although you can also use DbVisualiser) and a good selection of well-presented documentation (although their SQL reference could use some examples). Daffodil can also supply an ODBC driver and database replicator which work with One$DB.

h2. Method

My primary goals were to perform the same tests on each database using the default setup for each system. To this end I’ve written a class, unimaginatively called Benchmark, which creates a table, INSERTs some rows and then reads them back using a SELECT statement. Originally, I had intended to use the same SQL for all four databases but there were enough differences that each database now has its own SQL statements. I was concerned that many of the embedded database would be doing some heavy caching. To exclude this possibility, the database connection is closed after each operation:

start timer 1
open database
DROP the table if it exists and (re-)CREATE the table
close the database
stop timer 1
start timer 2
open the database
INSERT n rows using a PreparedStatement
close the database
stop timer 2
start timer 3
open the database
SELECT all rows using a PreparedStatement
close the database
stop timer 3

This is an example of the table used in the benchmark. It is a simple combination of popular column types which I shall require in my work:

name VARCHAR( 254 ) NOT NULL,
longnumber BIGINT NOT NULL,
floatnumber FLOAT NOT NULL,

h3. Implementation

I won’t write too much on the actual implementation of the benchmarks since its pretty simple and boring. The whole suite (such as it is) is “available for download”:/research/dbBenchmark.zip [~6MB]. It contains the source code, all the required libraries and a Netbeans 4.0 project to build it with. Of course, since Netbeans actually uses ANT as a build environment you can just type ant jar it the main dbBenchmark directory (sorry for the blatant publisising but I’m a bit of a Netbeans 4 fan). There are a bunch of .bat files which may need to be modified for your environments but they should provide you with an understanding of the various command-line options.

Feel free to modify the benchmarks to suit your own requirements. I won’t dignify it by slapping an open-source licence on it but treat it as public domain code — although if you make any significant changes I’d be interested to hear about it in the comments below.

h3. Usage

The benchmark is invoked from the command line using the following options:

usage: java com.ideasasylum.dbbenchmark.Benchmark
|-I,–increment|Increment the select and inserts for load testing|
|-b,–benchmark|The class name of the benchmark to execute|
|-d,–database|The JDBC url of the database|
|-j,–driver|JDBC Driver|
|-n,–runs|The number of benchmark run to perform|
|-o,–output|The output file name (CSV format)|
|-p,–password|The database password|
|-r,–rows|The number of selects to perform (selects should be >= inserts)|
|-u,–username|The database username|

For example, the following command will benchmark a Derby database using the Benchmark class com.ideasasylum.dbbenchmark.DerbyBenchmark, the driver org.apache.derby.jdbc.EmbeddedDriver and the database jdbc:derby:derbytest;create=true. It will perform 10 runs, starting with 5000 inserts and incrementing this by 5000 each time (so the last run will be inserting and retrieving 50000 rows). The output is sent to a comma-seperated file, derbyload.csv.

java -classpath dist/dbBenchmark.jar;lib/derby.jar;lib/commons-cli-1.0.jar com.ideasasylum.dbbenchmark.Benchmark -b com.ideasasylum.dbbenchmark.DerbyBenchmark -j org.apache.derby.jdbc.EmbeddedDriver -d jdbc:derby:derbytest;create=true -n 10 -r 5000 -o derbyload.csv --increment=5000

h3. Configuration

|MySQL|4.1.7-nt|com.mysql.jdbc.Driver (version 3.1.7)|

All tests were performed using Java 5 on a 2.6GHz P4, 1GB RAM and Windows XP. The machine was lightly loaded (e.g. email client etc). The results from each run are output into a CSV file. An Excel spreadsheet links in these files and plots a few graphs, calculates some averages etc. The data supplied here is not definitive but it is enough to provide a quick impression of the speed of each database.

During the course of the experiments it became clear that HSQLDB was unbelievably fast. _Too fast_. Upon checking the documentation, I discovered that HSQLDB should be shutdown by sending it the SQL command, SHUTDOWN. It doesn’t shutdown properly if connection.close() is used although, since it is a reliable database, no data loss appears to occur. I’ve included a seperate series called “HSQLDB Shutdown” which benchmarks HSQLDB when it is shutdown properly. Update: As someone pointed out, by default HSQLDB creates in-memory tables unless you use CREATE CACHED TABLE. I’ve added two more columns which show the results when HSQLDB is run using CACHED tables and when SHUTDOWN properly.

h2. Results

h3. Drop/Create Performance

|Rows|MySQL|HSQLDB|Derby|Daffodil|HSQLDB Shutdown|HSQLDB Cached|HSQLDB Cached Shutdown|

h3. Insertion Performance

|Rows|MySQL|HSQLDB|Derby|Daffodil|HSQLDB Shutdown|HSQLDB Cached|HSQLDB Cached Shutdown|

h3. Selection Performance

|Rows|MySQL|HSQLDB|Derby|Daffodil|HSQLDB Shutdown|HSQLDB Cached|HSQLDB Cached Shutdown|

h3. Other Data

* MySQL displays very low CPU utilisation (~2% for the benchmark application and 5% for the mysql server) which probably indicates that the bottleneck is the client-server I/O (which is to be expected).
* One$DB and HSQLDB had a very high CPU utilisation (>95% measured using the Windows XP Task Manager). Again, this was expected since the databases are integrated into the application and the only bottleneck is how fast it can process the data.
* Derby had a lower CPU usage than the other embedded databases (~40-70%)

h3. Graphs

_Click on the graphs for a sharper image_

!/notebook/images/20t.jpg (Average Times for Embedded Databases)!:/notebook/images/20.png
The average times (in milliseconds) for each embedded database to perform an operation involving 2000 rows

!/notebook/images/17.png (INSERT times for 2000 rows)!
The time to complete an INSERT of 2000 rows, performed in 10 consecutive runs. Notice how the first run is significantly slower.

!/notebook/images/18t.jpg (INSERT times for all databases)!:/notebook/images/18.png
The INSERT time for an increasing number of rows. MySQL is disproportionately slow in this test and obscures the performance of the embedded databases.

!/notebook/images/16t.jpg (INSERT times for the Embedded Database)!:/notebook/images/16.png
The INSERT times for an increasing number of rows. HSQLDB, whether SHUTDOWN or not, is clearly the fastest database here.

!/notebook/images/19t.jpg (The SELECT times for all databases)!:/notebook/images/19.png
The SELECT times (in milliseconds) for an increasing number of rows. HSQLDB, when SHUTDOWN properly after each SELECT operation, is cleary the slowest database. Although, it can still retrieve 50,000 rows in less than 4 seconds.

!/notebook/images/15t.jpg (Drop/Create Times)!:/notebook/images/15.png
The time (in milliseconds) to DROP and re-CREATE a table. The number of rows is actually the number of rows which are going to be INSERTed and SELECTed. The number of rows in the table that is DROPped is row-5000 (the supplied increment variable).

h2. Analysis

* All of the databases appear to scale linearly although it is not possible to be sure with the experiments performed here.
* All of the databases could probably be improved with specific optimisations
* Closing the database connection between each operation is obviously a pretty unrealistic scenario. However, the only database which seems to be adversely affected by this is HSQLDB when SHUTDOWN. In all the other cases,
* The first benchmark run is always a little slow which is almost certainly due to the JVM’s Just-In-Time compilation.
* MySQL has awful INSERT times compared to the embedded databases
* MySQL is fairly efficient for row retrieval, keeping up with the embedded databases.
* HSQLDB is ridiculously fast for everything (INSERTs, SELECTs, DROP/CREATEs) and it appears to scale better than any of the others.
* The DROP/CREATE times are not a terribly important benchmark but it does show that HSQLDB, when SHUTDOWN properly, is the only database whose times do not remain constant regardless of the number of rows existing in the table.
* HSQLDB, when it is not SHUTDOWN properly, has almost no table DROP/CREATE overhead.
* HSQLDB is generally slower when it is actually shutdown properly (using SHUTDOWN) and it would appear that the main bottleneck with HSQLDB is the SHUTDOWN process
* HSQLBD doesn’t seem to exhibit any adverse effects if it is not shutdown properly, which is as you’d expect from a good database. Of course, you are left to wonder, if HSQLDB does suffer adverse effect when not SHUTDOWN properly, why incur the speed penalty?
* Derby is slower than One$DB for INSERTs but has better SELECT performance
* Most of the databases are optimised for SELECT with the exception of HSQLDB which has extremely fast INSERTs and only mediocre SELECTs.
* Whilst I wouldn’t put too much stock in the exact figures, it is interesting that *HSQLDB takes less than 4 seconds to INSERT 50,000 rows, whilst MySQL takes over 20mins!*
* I am suspicious that MySQL’s INSERT performance is so poor compared to its SELECT performance. I believe this could possibly be a configuration problem on my part, although I have no idea what that might be.

h2. Conclusion

Choosing a database is not just about performance — but that is usually one of the criteria. It seems fairly clear that any of the embedded databases will have a significant performance advantage over MySQL.

For me, the winner is HSQLDB, particularly due to its wonderfully fast INSERTs performance, which will be the bulk of the work required from the database. The SELECT performance would also appear to be excellant when the SHUTDOWN times are excluded. Since MySQL possesses useful GIS support, and better integration with external reporting tools, I won’t be completely leaving it behind. My intention is to use HSQLDB during the simulation runs for input and output data, but then load the data into MySQL when performing the analysis of the experiments.

I’d urge anyone choosing between databases to run their own benchmarks and don’t rely on my results and interpretations. Feel free to download my “Benchmarking suite”:/research/dbBenchmark.zip and modify it to your own requirements. I won’t dignify it by slapping an open-source licence on it but if you make any significant changes I’d be interested to hear about it in the comments below.

*Update*: Despite my initial mistake in using HSQLDB in-memory tables, the CACHED version is still faster than the other contenders and the SELECT performance doesn’t suffer from the final write-to-disk bottleneck of in-memory tables. Other people have noted that MySQL would be much faster is the INSERTs were performed inside a single transaction. This is quite true but I have to migrate away from MySQL and it made sense to benchmark MySQL is the way that I currently use it. There’s “a thread running on JavaLobby about this comparision”:http://www.javalobby.org/java/forums/m91827199.

Subscribe to future posts

No yukky spam • No more than one email a month