* Virtuoso 配置

* Load performance

* Sparql query performance

Virtuoso 配置

About Virtuoso index:

The index scheme consists of the following indices:

  • PSOG - primary key.
  • POGS - bitmap index for lookups on object value.
  • SP - partial index for cases where only S is specified.
  • OP - partial index for cases where only O is specified.
  • GS - partial index for cases where only G is specified.

* NumberOfBuffers?: the amount of RAM used by Virtuoso to cache database files. This has a critical performance impact and thus the value should be fairly high for large databases. Exceeding physical memory in this setting will have a significant negative impact. For a database-only server about 65% of available RAM could be configured for database buffers. Each buffer caches one 8K page of data and occupies approximately 8700 bytes of memory.

* MaxCheckpointRemap:to avoid out of memory error, you should make sure the values for the paramaters NumberOfBuffers? and MaxCheckpointRemap? are not set with the same values.

* AsyncQueueMaxThreads?: the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better.

* ThreadsPerQuery?: the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better.

* IndexTreeMaps?: the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial. A low number will lead to frequent contention; upwards of 64 will have little contention.

* ThreadCleanupInterval? & ResourcesCleanupInterval?: Set both to 1 in order to reduce memory leaking.

NumberOfBuffers          = 6500000
MaxDirtyBuffers          = 5000000
MaxCheckpointRemap       = 1000000
AsyncQueryMaxThreads     = 18
ThreadsPerQuery          = 18
IndexTreeMaps            = 512
ThreadCleanupInterval    = 1
ResourcesCleanupInterval = 1

Please refer to attachment:virtuoso.ini.2 ダウンロード for the detailed parameter in the test.

More information please refer to  http://docs.openlinksw.com/virtuoso/databaseadmsrv.html  http://www.openlinksw.com/weblog/oerling/?id=1665

it is generally recommended with the Virtuoso 6.x release 16GB of memory is required per billion triples.

Load Performance

Allie upload

Data: 94,420,989 tripples, n3 format.

* Approach 1:

Load the big file in one stream.

Result:

2hours.

Step:

$ nohup $VIRTUOSO_HOME/bin/isql 1111 dba dba <$VIRTUOSO_HOME/scripts/load.list.isql &

load.list.isql script:

log_enable (2);
DB.DBA.TTLP (file_to_string_output('file path'),' ','http://mydbcls.jp/', 0);
checkpoint;

* Approach 2:

Use one stream per core (not per core thread). Split the big file into 12 small files(precisely, 13(12+1)files, #linesPerFiles=#totleLines/12).

Result: 46mins22secs.

Step 1. load file into ld_dir:

$nohup $VIRTUOSO_HOME/bin/isql 1111 dba dba <$VIRTUOSO_HOME/scripts/load.list.isql &

load.list.isql script:

 delete from load_list;
 ld_dir('data directory','*.rdf.nt','http://allie.dbcls.jp/');
 select * from load_list;

Step 2. upload the file into virtuoso:

$ nohup $VIRTUOSO_HOME/bin/isql 1111 dba dba <$VIRTUOSO_HOME/scripts/load.data.isql &

load.data.isql script:

--record CPU time
select getrusage ()[0] + getrusage ()[1];

rdf_loader_run () &
...(omit 10 times)
rdf_loader_run () &
checkpoint;

-- Record CPU time
select getrusage ()[0] + getrusage ()[1];

The following procedures use approach 2, 12 streams to upload the data.

Cell Cycle upload

4mins

PDBJ upload

Result:103min31s.

UniProt? upload

vm.swappiness = 60: 71hs58mins

vm.swappiness = 10: 42hs43mins

DDBJ upload

vm.swappiness = 10: 78hs8mins

Load performance

We uploaded all the data twice with the least cost configuration.

loadtime Cell Cycle Ontology Allie PDBj UniProt? DDBJ
1st time 4mins 46mins 103mins 42hs43mins 78hs8mins
2nd time 4mins 48mins 81mins 40hs12mins 80hs30mins
average 4mins 47 mins 92mins 41hs28mins 79hs19mins

Sparql query performance

Cell cycle query

Query\time(ms) time 1 time 2 time 3 time 4time 5
case1 23 23 24 24 24
case2 2 2 2 2 2
case3 22368 23440 2296123655 23655
case4 2 3 10 4 4
case5 43265 42911 42172 42459 42459
case6 13062 13057 13069 13102 13102
case7 3 3 3 12 12
case8 7683 7479 7455 7656 7656
case9 40 38 36 51 51
case10 1 8 1 3 3
case11 120 119 118 123 123
case12 521 18 17 20 20
case13 24 4 2 7 7
case14 1 1 1 1 1
case15 55065 57530 56760 56203 56203
case16 36 34 46 65 65
case17 14 23 18 13 13
case18 23 17 16 16 16
case19 16980 17064 16643 16631 16631

Allie query

The time cost for the five use case (please refer to  http://kiban.dbcls.jp/togordf/wiki/survey#data)

Query\time(ms) time 1 time 2 time 3 time 4time 5
case1 269 27 21 22 21
case2 1350 1273 1729 1300 1381
case3 395 145 138 155 172
case4 171 81 71 101 127
case5 26934 28107 27204 28276 26781

PDBJ query

Query\time(ms) time 1 time 2 time 3 time 4time 5
case1 184 156 150 152 131
case2 2 1 2 1 2
case3 6 3 2 1 1
case4 114 157 121 164 161

Uniprot query

Query\time(ms) time 1 time 2 time 3 time 4time 5
case1 49 42 56 157 58
case2 105 127 94 90 90
case3 114 108 116 123 116
case4 2 2 2 2 2
case5 8 1 66 16 1
case6 2252 2217 2177 2237 2192
case7 58027 13139 42780 42017 41729
case8 421 402 410 417 487
case9 589 597 614 644 619
case10 702 5862 622 642 643
case11 70 43 50 57 61
case12 6 2 21 3 3
case13 278 317 285 288 276
case14 268 270 274 271 264
case15 10635 10453 10785 10650 10684
case16 9075 9008 9049 9074 9260
case17 78 2 1 1 5
case18 180 70 45 98 89

DDBJ query

Query\time(ms) time 1 time 2 time 3 time 4time 5
case1 248 238 245 208 213
case2 270 225 212 214 222
case3 8672 401 431 430 411
case4 61 59 57 55 54
case5 24 11 6 6 5
case6 110 95 92 94 149
case7 14 12 3 2 4
case8 3 3 6 3 5
case9 13 4 23 4 6
case10 0 1 1 1 1

添付ファイル