survey – TogoRDF

Context Navigation

バージョン 40 (更新者: wu, 14 年前)
--

Triple Store Survey for Life Science Data

Overview
Platform
Data
Approach
Database
- 4store => 4store
- Bigdata => Bigdata
- Mulgara => Mulgara
- Owlim-SE => OwlimSe
- Virtuoso => Virtuoso

Summary =>Summarize

Overview

We present an evaluation of native triple stores on biological data. Compared with the data in other areas biological data is typically huge. Therefore the performance of bulk loading and querying are essential to decide whether a triple store can be applied into the biological field. Here we test five native triple stores Virtuoso, OwlimSE, Mulgara, 4store, and Bigdata with five biological dataset, which ranging from tens of millions to eight billions. We present their load times and query cost.

For each database we provide several results by adjusting their parameters, which could influence the performance importantly. However These parameters could perform differently with different hardware and software platforms, and even with different data set. It is difficult to test all the cases by adjusting and combining all the parameters for every data set because the importing of our data set, such as uniprot and DDBJ, may take over two days or several weeks. Therefore we do not guarantee what we provide is the best performance of each database although we try to find out the best performance for each triple store.

4store

4store is a RDF/SPARQL store written in C and designed to run on UNIX-like systems, either single machines or networked clusters. Please refer to http://4store.org/ for detail.

Bigdata

Bigdata is designed as a distributed database architecture running over clusters of 100s to 1000s of commodity machines, but also can run in a high-performance single-server mode. It supports RDFS and limited OWL inference. Bigdata is written in java and open source. Please refer to http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted.

OwlimSE

OwlimSE is a member of OWLIM family, which provides native RDF engines implemented in Java and deliveries full performance through both Sesame and Jena. From OwlimSE 4.3 it begins to support SPARQL 1.1 Federation. It supports for the semantics of RDFS, OWL 2 RL and OWL 2 QL. OwlimSE is only available in commercial license. Please refer to http://www.ontotext.com/owlim.

Mulgara

Mulgara is written entirely in Java and available in open source. Mulgara provides a SQL-like language iTQL(Interactive Tucana Query Language) shell to query and update Mulgara databases, which also support RDFS and OWL inferencing. It also provides a SPARQL query parser and query engine. Please refer to http://www.mulgara.org/.

Virtuoso

Virtuoso provides a triple storage solution for RDF in RDBMS platform. Virtuoso is multi-protocol RDBMS for RDBMS, RDF, XML and so on. It offers stored procedures to load RDFXML, ntriples, and compressed triples and supports for SPARQL. Virtuoso supports limited RDFS and OWL inferencing. Virtuoso can be run in both standalone and cluster mode.The function as a standalone triple store server is available in both open source and commercial licenses. Please refer to http://virtuoso.openlinksw.com/.

	OpenSource?	clustering	inference	federated query
4Store	Yes	Yes	No	No
Bigdata	Yes	Yes	RDFS and limited OWL inference	Yes
Mulgara	Yes	Yes	RDFS and OWL (full ?)	No ?
OwlimSE	No	No	RDFS, OWL 2 RL and OWL 2 QL	Yes
Virtuoso	Part	Yes	limited RDFS and OWL	Yes

Platform

* Machine:

OS: GNU/linux
CPU: GenuineIntel? 6; model name : Intel(R) Xeon(R) CPU E5649 @ 2.53GHz; 12 cores 24 hyper-threading
Mem: 65996128 kB
Harddisk: SCSI Raid 0 (three hard disks of 2 Tera bytes; two of them are used to store data)

* Software:

JDK:1.6.0_26
4store: 1.1.4
Bigdata: RWSTORE_1_1_0
Mulgara: 2.1.13
OwlimSE: 4.3.4238
Virtuoso: 6.4 commercial

Data

We select five real typical biological data sets instead of synthetic data, the number of triples of which range from 10 Million to 8 Billion. We summarize the query characteristics in => QueryCharacteristics .

Cell cycle: .rdf (RDFXML) format, 11,315,866 tripples, from http://www.semantic-systems-biology.org/. The Sparql query attachment:cell.txt .

Allie: .n3 format, 94,420,989 tripples, sparql query attachment:allie.txt .

PDBJ: .rdf.gz format, 589,987,335 triples, 77878 files, from ftp://ftp.pdbj.org/XML/rdf/. The Sparql query attachment:pdbj.txt .

The queries in PDBJ are point queries which retrieve the relative characteristics of certain EntryID, such as 107L. Therefore their result set is small but the number of query joins is big.

Uniprot: .rdf.gz format , 4,025,881,829 triples, the 3 larger files are uniprot.rdf.gz,uniparc.rdf.gz,uniref.rdf.gz, from ftp://ftp.uniprot.org/pub/databases/uniprot/ (the experiment used data was 2011.Nov version). The Sparql query attachment:uniprot.txt or http://beta.sparql.uniprot.org/.

DDBJ: .rdf.gz format, 7,902,743,055 triples, 330 files, from ftp://ftp.ddbj.nig.ac.jp/ddbj_database/ddbj/. The Sparql query attachment:ddbj.txt .

Approach

We evaluated the data in every Sparql end point at least twice to make it sure that there is no much difference between two test values:|2nd-1st|/max(2nd,1st)<0.1(we took the first value in the summarize part now because some loading is still in test ).

We did the query evaluation by executing the whole query mix (composed of the query sequence) five times in every Sparql endpoint, remove the highest one and then get the average time cost of other four queries. We report the five detailed time cost in every database section and the average cost in the summary section.

添付ファイル

allie.txt (2.5 KB) - 登録者 wu 14 年前.
pdbj.txt (2.6 KB) - 登録者 wu 14 年前.
cell.txt (9.2 KB) - 登録者 wu 14 年前.
ddbj.txt (3.0 KB) - 登録者 wu 14 年前.
uniprot.txt (3.5 KB) - 登録者 wu 14 年前.

異なるフォーマットでダウンロード:

テキスト