== Triple Store Survey for Life Science Data == * [#overview Overview] * [#platform Platform] * [#data Data] * [#approach Approach] * Database * 4store [wiki:4store => 4store ] * Bigdata [wiki:bigdata => Bigdata ] * Mulgara [wiki:Mulgara => Mulgara ] * Owlim-se [wiki:OwlimSe => OwlimSe ] * Virtuoso [wiki:Virtuoso => Virtuoso ] * Summary [wiki:summarize =>Summarize] === Overview === #overview We present an evaluation of native triple stores on biological data. Compared with the data in other areas biological data is typically huge. Therefore the performance of bulk loading and querying are essential to decide whether a triple store can be applied into the biological field. Here we test five native triple stores Virtuoso, OwlimSE, Mulgara, 4store, and Bigdata with five biological dataset, which ranging from tens of millions to eight billions. We present their load times and query cost. For each database we provide several results by adjusting their parameters, which could influence the performance importantly but work differently with different hardware and software platforms. We do not guarantee what we provide is the best performance of each database. '''4store''' 4store is a RDF/SPARQL store written in C and designed to run on UNIX-like systems, either single machines or networked clusters. '''OwlimSE''' OwlimSE is a member of OWLIM family, which provides native RDF engines implemented in Java and deliveries full performance through both Sesame and Jena. From OwlimSE 4.3 it begins to support SPARQL 1.1 Federation. OwlimSE is only available in commercial license. '''Virtuoso''' Virtuoso provides a triple storage solution for RDF in RDBMS platform. Virtuoso is multi-protocol RDBMS for RDBMS, RDF, XML and so on. It offers stored procedures to load RDFXML, ntriples, and compressed triples and supports for SPARQL. The function as a standalone triple store server is available in both open source and commercial licenses. === Platform === #platform * Machine: * OS: GNU/linux * CPU: GenuineIntel 6; model name : Intel(R) Xeon(R) CPU E5649 @ 2.53GHz; 12 cores 24 hyper-threading * Mem: 65996128 kB * Harddisk: SCSI Raid 0 (three hard disks of 2 Tera bytes; two of them are used to store data) * Software: * JDK:1.6.0_26 * 4store: 1.1.4 * Bigdata: RWSTORE_1_1_0 * Mulgara: 2.1.12 * OwlimSE: 4.3.4238 * Virtuoso: 6.4 commercial === Data === #data '''Cell cycle''': .rdf (RDFXML) format, 11,315,866 tripples,from [http://www.semantic-systems-biology.org/]. sparql query attachment:cell.txt . '''Allie''': .n3 format, 94,420,989 tripples, sparql query attachment:allie.txt . '''PDBJ''': .rdf.gz format ,589,987,335 triples, 77878 files, from [ftp://ftp.pdbj.org/XML/rdf/]. sparql query attachment:pdbj.txt. The queries in PDBJ are point queries which retrieve the relative characteristics of certain EntryID, such as 107L. Therefore their result set is small but the number of query joins is big. '''Uniprot''': .rdf.gz format , 4,025,881,829 triples, the 3 larger files are uniprot.rdf.gz,uniparc.rdf.gz,uniref.rdf.gz, from [ftp://ftp.uniprot.org/pub/databases/uniprot/] (the experiment used data was 2011.Nov version). sparql query attachment:uniprot.txt or [http://beta.sparql.uniprot.org/]. '''DDBJ''': .rdf.gz format, 7,902,743,055 triples, 330 files, from [ftp://ftp.ddbj.nig.ac.jp/ddbj_database/ddbj/]. sparql query attachment:ddbj.txt . === Approach === #approach We evaluated the data in every Sparql end point at least twice to make it sure that there is no much difference between two test values:|2nd-1st|/max(2nd,1st)<0.1. We did the query evaluation by executing the whole query mix (composed of the query sequence) five times in every Sparql endpoint, remove the highest one and then get the average time cost of other four queries. We report the five detailed time cost in every database section and the average cost in the summary section.