BioBenchmark Toyama 2012

=> Bigdata1.1

=> Bigdata1.2

=> OwlimSe4.3

=> OwlimSe5.1

  • Summary

=>SummarizeOldVersion

=>Summarize

Overview

We present an evaluation of triple stores on biological data. Compared with the data in other areas biological data is typically huge. Therefore the performance of bulk loading and querying are essential to decide whether a triple store can be applied to the biological field. Our target is to verify whether the current triple stores are efficient to deal with the tremendous biological data. We tested five native triple stores Virtuoso, OwlimSE, Mulgara, 4store, and Bigdata. We chose five real biological data set instead of synthetic ones ranging from tens of millions to eight billions. We presented their load times and query cost but did not test the inference ability in this study.

For each database we provide several results by adjusting their parameters, which could influence the performance importantly. However these parameters could perform differently with different hardware and software platforms, and even with different data set. It is difficult to test all the cases by adjusting and combining all the parameters for every data set because the importing of our data set, such as UniProt? and DDBJ, may take several days. Therefore we do not guarantee what we provide is the best performance of each database although we try to find out the best performance for each triple store.

4store

4store is a RDF/SPARQL store written in C and designed to run on UNIX-like systems, either single machines or networked clusters. Please refer to  http://4store.org/ for detail.

Bigdata

Bigdata is designed as a distributed database architecture running over clusters of 100s to 1000s of commodity machines, but also can run in a high-performance single-server mode. It supports RDFS and limited OWL inference. Bigdata is written in java and open source. Please refer to  http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted.

OWLIM-SE

OWLIM-SE is a member of OWLIM family, which provides native RDF engines implemented in Java and deliveries full performance through both Sesame and Jena. From OwlimSE 4.3 it begins to support SPARQL 1.1 Federation. It supports for the semantics of RDFS, OWL 2 RL and OWL 2 QL. OWLIM-SE is only available in commercial license. Please refer to  http://www.ontotext.com/owlim.

Mulgara

Mulgara is written entirely in Java and available in open source. Mulgara provides a SQL-like language iTQL(Interactive Tucana Query Language) shell to query and update Mulgara databases, which also supports RDFS and OWL inferencing. It also provides a SPARQL query parser and query engine. Please refer to  http://www.mulgara.org/.

Virtuoso

Virtuoso provides a triple storage solution for RDF in RDBMS platform. Virtuoso is a multi-purpose data server for RDBMS, RDF, XML and so on. It offers stored procedures to load RDFXML, ntriples, and compressed triples and supports for SPARQL. Virtuoso supports limited RDFS and OWL inferencing. Virtuoso can be run in both standalone and cluster mode. The function as a standalone triple store server is available in both open source and commercial licenses. Please refer to  http://virtuoso.openlinksw.com/.

The following table summarizes some basic information.

Triple Store OpenSource? cluster inference federated query
4store Yes Yes No No
Bigdata Yes Yes RDFS and limited OWL inference Yes
Mulgara Yes Yes RDFS and OWL (full ?) No ?
OWLIM-SE No No RDFS, OWL 2 RL and OWL 2 QL Yes
Virtuoso Partly Yes limited RDFS and OWL Yes

Platform

* Machine:

  • OS: GNU/linux
  • CPU: GenuineIntel? 6; model name : Intel(R) Xeon(R) CPU E5649 @ 2.53GHz; 12 cores 24 hyper-threading
  • Mem: 65996128 kB
  • Harddisk: SCSI Raid 0 (three hard disks of 2 Tera bytes; two of them are used to store data)

* Software:

  • JDK:1.6.0_26
  • 4store: 1.1.4
  • Bigdata: RWSTORE_1_2_0
  • Mulgara: 2.1.13
  • OWLIM-SE: 5.1.5269
  • Virtuoso: 6.4 commercial

The version we chose(The latest version nowday refers to the newest version up to Oct.10,2012 ):

4store: The version we tested is V1.1.4, released on Sep.20,2011. The newest version is 1.1.5 which was released in Jul.10,2012.

Bigdata: The version we tested is V1.1.0, and the newest version nowday is 1.2.2.

Mulgara: The version we tested is V2.1.13, released on Jan.10, 2012 and it is the latest version.

OWLIM-SE: The version we tested is V4.3.4238, released in November, 2011, and its latest version is V5.2.

Virtuoso: The version we tested is V6.4, released in May, 2012, and its latest version is V06.04.3132. We tried to use the open source V6.1.4 and gave up for a memory failuture when uploading the data over 3 billion triples.

Data

We chose five real typical biological data sets instead of synthetic data, the number of triples of which ranges from 10 Million to 8 Billion. We summarize the query characteristics in => QueryCharacteristics .

Cell Cycle Ontology : .rdf (RDFXML) format, 11,315,866 tripples, from  http://www.semantic-systems-biology.org/. The Sparql query attachment:cell.txt ダウンロード .

Allie: .n3 format, 94,420,989 tripples, from  ftp://ftp.dbcls.jp/allie/allie_rdf/. The Sparql query attachment:allie.txt ダウンロード .

PDBj: .rdf.gz format, 589,987,335 triples, 77878 files, from  ftp://ftp.pdbj.org/RDF/. The Sparql query attachment:pdbj.txt ダウンロード.

The queries in PDBJ are point queries which retrieve the relative characteristics of certain EntryID, such as 107L. Therefore their result set is small but the number of query joins is big.

UniProt?: .rdf.gz format , 4,025,881,829 triples, the 3 larger files are uniprot.rdf.gz,uniparc.rdf.gz,uniref.rdf.gz, from  ftp://ftp.uniprot.org/pub/databases/uniprot/ (the experiment used data was 2011.Nov version). The Sparql query attachment:uniprot.txt ダウンロード or  http://beta.sparql.uniprot.org/.

DDBJ: .rdf.gz format, 7,902,743,055 triples, 330 files, from  ftp://ftp.ddbj.nig.ac.jp/ddbj_database/ddbj/. The Sparql query attachment:ddbj.txt ダウンロード .

Approach

We imported the data with default parameters and several empirically improved settings. And then we test each triple store twice with the best setting, and reported their average cost as the importing cost.

We did the query evaluation by executing the whole query mix (composed of the query sequence) five times in each triple store , remove the highest one and then get the average time cost of the other four queries. We presented the five detailed time cost in each database section and the average cost in the summary section.

添付ファイル