バージョン 20 (更新者: wu, 12 年 前)

--

* OwlimSE 配置

* Load performance

* Sparql query performance

OwlimSE 配置

It is well known that the construction of index can help accelerate the query but make the update slow. In OwlimSE, we make cache-memory = tuple-index-memory, namely, we enable POS/PSO indices but disable PCSOT,PTSOC,SP, PO indices and full-text search. By setting build-pcsot,build-ptsoc,ftsIndexPolicy,enablePredicateList to true or some appropriate value("onCommit", "onStartup" or "onShutdown" for ftsIndexPolicy) we can enable them accordingly.

More information please refer to  http://owlim.ontotext.com/display/OWLIMv43/OWLIM-SE+Configuration

Load Performance

Approach 1: 'load' command in the Sesame console application, for files including less than one billion triples.

Owlim showed that they can not load a billion statements with Owlim in a large file with a load command.

Operation step (we use Allie as an example):

-- create allie.ttl template:
[togordf@ts01 ~]$ ls ~/.aduna/openrdf-sesame-console/templates/
allie.ttl
---in openrdf-console directory
[togordf@ts01 ~]$ ./console.sh
18:12:24.166 [main] DEBUG info.aduna.platform.PlatformFactory - os.name = linux
18:12:24.171 [main] DEBUG info.aduna.platform.PlatformFactory - Detected Posix platform
Connected to default data directory
Commands end with '.' at the end of a line
Type 'help.' for help
> connect "http://localhost:8080/openrdf-sesame".
Disconnecting from default data directory
Connected to http://localhost:8080/openrdf-sesame
> help create.
Usage:
create <template-name>
  <template-name>   The name of a repository configuration template
> create allie.
> open allie.
Opened repository 'allie'
uniprot> load $PathOfData

Please refer to  http://owlim.ontotext.com/display/OWLIMv40/OWLIM-SE+Administrative+Tasks: In general RDF data can be loaded into a given Sesame repository using the 'load' command in the Sesame console application or directly through the workbench web application. However, neither of these approaches will work when using a very large number of triples, e.g. a billion statements. A common solution would be to convert the RDF data into a line-based RDF format (e.g. N-triples) and then split it into many smaller files (e.g. using the linux command 'split'). This would allow each file to be uploaded separately using either the console or workbench applications.

Approach 2:

The idea is from uniprot, which uses owlim as an library as follows:

Basically They have one specific loader program, where there is one java thread that reads the triples into a blocking queue. Then a different number of threads take triples from that queue and insert the data into OWLIM-se (or any other sesame API compatible triplestore). Normally one inserting thread per owlim file-repository fragment. The inserter treads use transactions that commit every half a million statements. The basic is to add statements not files.

final org.openrdf.model.Statement sesameStatement = getSesameStatement(object);

//Takes one from the blocking queue filled by the other thread

connection.add(sesameStatement, graph);

and every millionth statement , do connection.commit();

(Please refer to  https://github.com/JervenBolleman/sesame-loader/ for details)

Allie upload

Approach 1: 38 minutes

Approach 2: 28 minutes

PDBJ upload

Approach 2: 197mins

Uniprot upload

when

-Xmx60G -Xms30G -Druleset=empty -Dentity-index-size=675000000 -Dcache-memory=20633m  -DenablePredicateList=false -Dtuple-index-memory=20633m -DftsIndexPolicy=never  -Dbuild-pcsot=false -Dbuild-ptsoc=false  -Djournaling=true -Drepository-type=file-repository   -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -Dentity-id-size=32  

and we used 12 threads to import, it took 68 hours and 29 minutes.

when

 -Xmx60G -Xms30G -Druleset=empty -Dentity-index-size=800000000 -Dcache-memory=20000m  -DenablePredicateList=false -Dtuple-index-memory=20000m -DftsIndexPolicy=never  -Dbuild-pcsot=false -Dbuild-ptsoc=false  -Djournaling=true -Drepository-type=file-repository  -XX:+HeapDumpOnOutOfMemoryError -Dentity-id-size=32

and we used 3 threads to import, it took 59 hours and 15 minutes.

DDBJ upload

when

-Xmx60G -Xms30G -Druleset=empty -Dentity-index-size=675000000 -Dcache-memory=20633m  -DenablePredicateList=false -Dtuple-index-memory=20633m -DftsIndexPolicy=never  -Dbuild-pcsot=false -Dbuild-ptsoc=false  -Djournaling=true -Drepository-type=file-repository   -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -Dentity-id-size=32  

and we used 12 threads to import, it took 128hours15minutes and an outofMemory occurred at the end. And we took another 54hours and 53minutes to do a data restore for using the database.

when

 -Xmx60G -Xms30G -Druleset=empty -Dentity-index-size=800000000 -Dcache-memory=20000m  -DenablePredicateList=false -Dtuple-index-memory=20000m -DftsIndexPolicy=never  -Dbuild-pcsot=false -Dbuild-ptsoc=false  -Djournaling=true -Drepository-type=file-repository  -XX:+HeapDumpOnOutOfMemoryError -Dentity-id-size=32

and we use 3 threads to import, it took 82hours and 2 minutes successfully to import DDBJ.

loadtime Cell Cycle Ontology Allie PDBj UniProt? DDBJ
1st time 3mins 28mins 197mins 3555mins 4922mins
2nd time 3mins 30mins 219mins
average 3mins 29mins 208mins

Until the failure Owlim had finished 7,883,140,000 triples within 70.5 hours.

Sparql query performance

Cell cycle query

Query\time(ms) time 1 time 2 time 3 time 4time 5
case1 111 116 109 109 112
case2 6 6 6 6 6
case3 2 2 2 2 2
case4 156 148 151 148 149
case5 416 574 404 401 433
case6 2182 2120 1940 2245 2040
case7 2 2 3 2 6
case8 33 33 33 32 33
case9 23 23 20 22 22
case10 0 0 0 0 0
case11 6 6 6 6 6
case12 6 7 6 6 7
case13 2 2 2 2 2
case14 0 0 0 0 0
case15 46043 46334 45843 46294 47640
case16 XX X X X
case17 XX X X X
case18 XX X X X
case19 13 14 13 14 14

note: do not support count query in case16,17 and 18.

Allie query

Query\time(ms) time 1 time 2 time 3 time 4time 5
case1 149 138 147 152 144
case2 2036 1954 2049 1959 1971
case3 1520 1484 1464 1467 1490
case4 36 37 40 38 41
case5 380858 67225 69009 68948 68296

PDBJ query

Query\time(ms) time 1 time 2 time 3 time 4time 5
case1 52 61 55 53 50
case2 1 1 1 1 1
case3 188 191 204 203 182
case4 4 4 4 4 4

Uniprot query

Query\time(ms) time 1 time 2 time 3 time 4time 5
case1 305 295 405 864 711
case2 349 400 312 470 898
case3 440 460 674 500 1049
case4 15 200 170 201 172
case5 20 22 20 22 77
case6 850266 605532 650282 645702 612007
case7 1138731 446141 584173 223218 482121
case8 13449 13617 502 482 13262
case9 3430 3166 673 639 3214
case10 127019 113550 958 1085 119581
case11 6669 6287 179 142 6455
case12 266 205 39 10 213
case13 32 29 6 6 45
case14 42 41 45 45 40
case15 29112 3809438291 34950 67722
case16 378191 372805 375879 274524 265025
case17 6163 5948 5828 5916 5808
case18 83955 8942 8842 9025 8792

DDBJ query

Query\time(ms) time 1 time 2 time 3 time 4time 5
case1 26500 25588 1711816823 15064
case2 3400 3437 3136 3203 3365
case3 3874 3923 3556 3643 3765
case4 237 104 53 52 118
case5 247 83 61 86 110
case6 109 129 144 112 104
case7 7871 7646 3990 5923 4577
case8 16278 14020 6991 11214 9645
case9 3640 2824 1605 2314 1656
case10 1 1 1 1 1