bigdata1.1.0 – TogoRDF

Context Navigation

Bigdata Configuration

The journal in Bigdata (please refer to http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=StandaloneGuide for details.)

The WORM (Write-Once, Read-Many) is the traditional log-structured append only journal. It was designed for very fast write rates and is used to buffer writes for scale-out. This is a good choice for immortal databases where people want access to ALL history. Scaling is to several billions of triples.

The RW store (Read-Write) supports recycling of allocation slots on the backing file. It may be used as a time-bounded version of an immortal database where history is aged off of the database over time. This is a good choice for standalone workloads where updates are continuously arriving and older database states may be released. The RW store is also less sensitive to data skew because it can reuse B+Tree node and leaf revisions within a commit group on large data set loads. Scaling should be better than the WORM for standalone and could reach to 10B+ triples. The default property file is attachment:RWStore.properties.

Load Performance

Approach 1:

Upload data from Bigdata sparql point(NanoSparqlServer?). Post the data every 10000 lines. Please refer to attachment:upload.pl for details.

Approach 2:

Upload with com.bigdata.rdf.store.DataLoader? tools and RW store default parameter.

And test the situation when adding GC in JVM.

-Xmx55G -Xms30G -XX:+UseG1GC -XX:+TieredCompilation? -XX:+HeapDumpOnOutOfMemoryError

Without garbage collecting it took 35 minutes to upload Allie.

Approach 3:

We modified the following two important parameters(In the rest test we use this configure in default):

com.bigdata.btree.writeRetentionQueue.capacity=500000
com.bigdata.rdf.sail.BigdataSail.bufferCapacity=1000000

Approach 4: Split the file into 12 small files.

Allie upload

Approach 1: 26hours

Approach 2: 5.89hours when Setting JVM GC : 6.75hours

Approach 3: 2.61 hours with UseG1GC;

35 minutes with automatic garbage collection (vm.swappiness =10); 38 minutes with automatic garbage collection (vm.swappiness =60);

Approach 4: 1.03 hours with UseG1GC;

35 minutes with automatic garbage collection;

PDBJ upload

Result: 8.95 hours with UseG1GC;

7.15 hours(429 minutes) with automatic garbage collection (vm.swappiness =10);

Uniprot upload

uniprot.rdf.gz

We firstly uploaded the file uniprot.rdf.gz (3.16 billion triples):

time: over one week(7.48 days): 646336127ms with UseG1GC

INFO : 646335942 main com.bigdata.rdf.store.DataLoader??.logCounters(DataLoader??.java:1185): extent=249818775552, stmts=3161144450, bytes/stat=79 Wrote: 241474404352 bytes. Total elapsed=646336127ms

with automatic garbage collection： 74.3hours(4458 minutes)

INFO : 267385254 main com.bigdata.rdf.store.DataLoader?.logCounters(DataLoader?.java:1185): extent=249818775552, stmts=3161144450, bytes/stat=79

Wrote: 241489608704 bytes. Total elapsed=267385843ms

uniref.rdf.gz

When adding uniref.rdf.gz, it took over 11 days to import 411800000 statements, and we stopped the procedure for the bad performance.

INFO : 1032251699 main com.bigdata.rdf.store.DataLoader?$2.processingNotification(DataLoader?.java:1018): 411800000 stmts buffered in 1032247.664 secs, rate= 398, baseURL= http://purl.uniprot.org, t otalStatementsSoFar=411800000

uniprot.rdf.gz （2nd）

INFO : 349235249 main com.bigdata.rdf.store.DataLoader?.logCounters(DataLoader?.java:1185): extent=249818775552, stmts=3161144450, bytes/stat=79 Wrote: 241707974656 bytes. Total elapsed=349235466ms INFO : 349235324 main com.bigdata.rdf.store.DataLoader?.main(DataLoader?.java:1545): Total elapsed=349235466ms

Load performance

In the result we configured the setting as vm.swappiness = 60, automatic garbage collection.

loadtime	Cell Cycle Ontology	Allie	PDBj	UniProt?*
1st time	3mins	35mins	429mins	4458mins
2nd time	3mins	38mins	412 mins	5820 mins
average	3mins	37 mins	421mins	5139 mins

UniProt?*: We only uploaded uniprot.rdf.gz, 3.16 billion triples.

Sparql query performance

Cell cycle query

Query\time(ms)	time 1	time 2	time 3	time 4	time 5
case1	341	353	327	328	327
case2	46	43	41	45	39
case3	3361	3039	2855	3284	3416
case4	32	21	9	22	10
case5	416	574	404	401	433
case6	1216	1295	1135	1277	1134
case7	21	21	19	23	21
case8	105	113	83	108	93
case9	44	43	44	45	40
case10	14	14	14	14	14
case11	25	29	24	29	14
case12	44	49	45	50	32
case13	7	21	19	9	18
case14	3	17	15	18	15
case15	19456	19229	18670	19016	19583
case16	X	X	X	X	X
case17	X	X	X	X	X
case18	X	X	X	X	X
case19	44	36	29	46	37

note: do not support count query in case16,17 and 18.

Allie query

Query\time(ms)	time 1	time 2	time 3	time 4	time 5
case1	423	424	424	443	436
case2	4160	4200	4263	4201	4264
case3	3352	3230	2329	2329	2308
case4	568	592	92	92	97
case5	1830742	661710	39296	39296	39784

PDBJ query

Query\time(ms)	time 1	time 2	time 3	time 4	time 5
case1	751	213	213	212	213
case2	27	14	15	13	26
case3	188	56	45	53	66
case4	337	58	57	53	59

Uniprot query

DDBJ query

異なるフォーマットでダウンロード:

テキスト