OwlimSE 配置
It is well known that the construction of index can help accelerate the query but make the update slow. In OwlimSE, we make cache-memory = tuple-index-memory, namely, we enable POS/PSO indices but disable PCSOT,PTSOC,SP, PO indices and full-text search. By setting build-pcsot,build-ptsoc,ftsIndexPolicy,enablePredicateList to true or some appropriate value("onCommit", "onStartup" or "onShutdown" for ftsIndexPolicy) we can enable them accordingly.
More information please refer to http://owlim.ontotext.com/display/OWLIMv43/OWLIM-SE+Configuration
Load Performance
Approach 1: 'load' command in the Sesame console application, for files including less than one billion triples.
Owlim showed that they can not load a billion statements with Owlim in a large file with a load command.
Operation step (we use Allie as an example): -- create allie.ttl template: [togordf@ts01 ~]$ ls ~/.aduna/openrdf-sesame-console/templates/ allie.ttl ---in openrdf-console directory [togordf@ts01 ~]$ ./console.sh 18:12:24.166 [main] DEBUG info.aduna.platform.PlatformFactory - os.name = linux 18:12:24.171 [main] DEBUG info.aduna.platform.PlatformFactory - Detected Posix platform Connected to default data directory Commands end with '.' at the end of a line Type 'help.' for help > connect "http://localhost:8080/openrdf-sesame". Disconnecting from default data directory Connected to http://localhost:8080/openrdf-sesame > help create. Usage: create <template-name> <template-name> The name of a repository configuration template > create allie. > open allie. Opened repository 'allie' uniprot> load $PathOfData
Please refer to http://owlim.ontotext.com/display/OWLIMv40/OWLIM-SE+Administrative+Tasks: In general RDF data can be loaded into a given Sesame repository using the 'load' command in the Sesame console application or directly through the workbench web application. However, neither of these approaches will work when using a very large number of triples, e.g. a billion statements. A common solution would be to convert the RDF data into a line-based RDF format (e.g. N-triples) and then split it into many smaller files (e.g. using the linux command 'split'). This would allow each file to be uploaded separately using either the console or workbench applications.
Approach 2:
The idea is from uniprot, which uses owlim as an library as follows:
Basically They have one specific loader program, where there is one java thread that reads the triples into a blocking queue. Then a different number of threads take triples from that queue and insert the data into OWLIM-se (or any other sesame API compatible triplestore). Normally one inserting thread per owlim file-repository fragment. The inserter treads use transactions that commit every half a million statements. The basic is to add statements not files.
final org.openrdf.model.Statement sesameStatement = getSesameStatement(object);
//Takes one from the blocking queue filled by the other thread
connection.add(sesameStatement, graph);
and every millionth statement , do connection.commit();
(Please refer to https://github.com/JervenBolleman/sesame-loader/ for details)
Allie upload
Approach 1: 38 minutes
Approach 2: 28 minutes
PDBJ upload
Approach 2: 197mins
Uniprot upload
when vm.swappiness=60
-Xmx60G -Xms30G -Druleset=empty -Dentity-index-size=675000000 -Dcache-memory=20633m -DenablePredicateList=false -Dtuple-index-memory=20633m -DftsIndexPolicy=never -Dbuild-pcsot=false -Dbuild-ptsoc=false -Djournaling=true -Drepository-type=file-repository -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -Dentity-id-size=32
we used 12 threads to import, it took 68 hours and 29 minutes.
when vm.swappiness=60
-Xmx60G -Xms30G -Druleset=empty -Dentity-index-size=800000000 -Dcache-memory=20000m -DenablePredicateList=false -Dtuple-index-memory=20000m -DftsIndexPolicy=never -Dbuild-pcsot=false -Dbuild-ptsoc=false -Djournaling=true -Drepository-type=file-repository -XX:+HeapDumpOnOutOfMemoryError -Dentity-id-size=32
we used 3 threads to import, it took 59 hours and 15 minutes.
DDBJ upload
when
-Xmx60G -Xms30G -Druleset=empty -Dentity-index-size=675000000 -Dcache-memory=20633m -DenablePredicateList=false -Dtuple-index-memory=20633m -DftsIndexPolicy=never -Dbuild-pcsot=false -Dbuild-ptsoc=false -Djournaling=true -Drepository-type=file-repository -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -Dentity-id-size=32
and we used 12 threads to import, it took 128hours15minutes and an outofMemory occurred at the end. And we took another 54hours and 53minutes to do a data restore for using the database.
when
-Xmx60G -Xms30G -Druleset=empty -Dentity-index-size=800000000 -Dcache-memory=20000m -DenablePredicateList=false -Dtuple-index-memory=20000m -DftsIndexPolicy=never -Dbuild-pcsot=false -Dbuild-ptsoc=false -Djournaling=true -Drepository-type=file-repository -XX:+HeapDumpOnOutOfMemoryError -Dentity-id-size=32
and we use 3 threads to import, it took 82hours and 2 minutes successfully to import DDBJ.
when vm.swappiness=10,
it took 49 hours and 12 minutes.
Load performance
loadtime | Cell Cycle Ontology | Allie | PDBj | UniProt? | DDBJ |
1st time | 3mins | 28mins | 197mins | 59hs15mins | 49hs12mins |
2nd time | 3mins | 30mins | 219mins | 50hs26mins | |
average | 3mins | 29mins | 208mins | 49hs49mins |
Until the failure Owlim had finished 7,883,140,000 triples within 70.5 hours.
Sparql query performance
Cell cycle query
Query\time(ms) | time 1 | time 2 | time 3 | time 4 | time 5 |
case1 | 111 | 116 | 109 | 109 | 112 |
case2 | 6 | 6 | 6 | 6 | 6 |
case3 | 2 | 2 | 2 | 2 | 2 |
case4 | 156 | 148 | 151 | 148 | 149 |
case5 | 416 | 574 | 404 | 401 | 433 |
case6 | 2182 | 2120 | 1940 | 2245 | 2040 |
case7 | 2 | 2 | 3 | 2 | 6 |
case8 | 33 | 33 | 33 | 32 | 33 |
case9 | 23 | 23 | 20 | 22 | 22 |
case10 | 0 | 0 | 0 | 0 | 0 |
case11 | 6 | 6 | 6 | 6 | 6 |
case12 | 6 | 7 | 6 | 6 | 7 |
case13 | 2 | 2 | 2 | 2 | 2 |
case14 | 0 | 0 | 0 | 0 | 0 |
case15 | 46043 | 46334 | 45843 | 46294 | 47640 |
case16 | X | X | X | X | X |
case17 | X | X | X | X | X |
case18 | X | X | X | X | X |
case19 | 13 | 14 | 13 | 14 | 14 |
note: do not support count query in case16,17 and 18.
Allie query
Query\time(ms) | time 1 | time 2 | time 3 | time 4 | time 5 |
case1 | 149 | 138 | 147 | 152 | 144 |
case2 | 2036 | 1954 | 2049 | 1959 | 1971 |
case3 | 1520 | 1484 | 1464 | 1467 | 1490 |
case4 | 36 | 37 | 40 | 38 | 41 |
case5 | 380858 | 67225 | 69009 | 68948 | 68296 |
PDBJ query
Query\time(ms) | time 1 | time 2 | time 3 | time 4 | time 5 |
case1 | 52 | 61 | 55 | 53 | 50 |
case2 | 1 | 1 | 1 | 1 | 1 |
case3 | 188 | 191 | 204 | 203 | 182 |
case4 | 4 | 4 | 4 | 4 | 4 |
Uniprot query
Query\time(ms) | time 1 | time 2 | time 3 | time 4 | time 5 |
case1 | 305 | 295 | 405 | 864 | 711 |
case2 | 349 | 400 | 312 | 470 | 898 |
case3 | 440 | 460 | 674 | 500 | 1049 |
case4 | 15 | 200 | 170 | 201 | 172 |
case5 | 20 | 22 | 20 | 22 | 77 |
case6 | 850266 | 605532 | 650282 | 645702 | 612007 |
case7 | 1138731 | 446141 | 584173 | 223218 | 482121 |
case8 | 13449 | 13617 | 502 | 482 | 13262 |
case9 | 3430 | 3166 | 673 | 639 | 3214 |
case10 | 127019 | 113550 | 958 | 1085 | 119581 |
case11 | 6669 | 6287 | 179 | 142 | 6455 |
case12 | 266 | 205 | 39 | 10 | 213 |
case13 | 32 | 29 | 6 | 6 | 45 |
case14 | 42 | 41 | 45 | 45 | 40 |
case15 | 29112 | 38094 | 38291 | 34950 | 67722 |
case16 | 378191 | 372805 | 375879 | 274524 | 265025 |
case17 | 6163 | 5948 | 5828 | 5916 | 5808 |
case18 | 83955 | 8942 | 8842 | 9025 | 8792 |
DDBJ query
Query\time(ms) | time 1 | time 2 | time 3 | time 4 | time 5 |
case1 | 26500 | 25588 | 17118 | 16823 | 15064 |
case2 | 3400 | 3437 | 3136 | 3203 | 3365 |
case3 | 3874 | 3923 | 3556 | 3643 | 3765 |
case4 | 237 | 104 | 53 | 52 | 118 |
case5 | 247 | 83 | 61 | 86 | 110 |
case6 | 109 | 129 | 144 | 112 | 104 |
case7 | 7871 | 7646 | 3990 | 5923 | 4577 |
case8 | 16278 | 14020 | 6991 | 11214 | 9645 |
case9 | 3640 | 2824 | 1605 | 2314 | 1656 |
case10 | 1 | 1 | 1 | 1 | 1 |