初期バージョン から バージョン 1 における更新: bigdata1.1.0

差分発生行の前後
無視リスト:
更新日時:
2012/11/09 18:33:39 (11 年 前)
更新者:
wu
コメント:

--

凡例:

変更なし
追加
削除
変更
  • bigdata1.1.0

    v1 v1  
     1* [#configure Bigdata Configuration] 
     2 
     3* [#load Load performance]  
     4    * [#allieload Allie upload] 
     5    * [#pdbjload PDBJ upload] 
     6    * [#uniprotload Uniprot upload] 
     7    * [#ddbjload DDBJ upload] 
     8* [#Sparql  Sparql query performance]  
     9    * [#alliequery Allie query ] 
     10    * [#pdbjquery PDBJ query ] 
     11    * [#uniprotquery Uniprot query ] 
     12    * [#ddbjquery DDBJ query ] 
     13 
     14 
     15=== Bigdata Configuration === #configure 
     16 
     17The journal in Bigdata (please refer to [http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=StandaloneGuide] for details.) 
     18 
     19The WORM (Write-Once, Read-Many) is the traditional log-structured append only journal. It was designed for very fast write rates and is used to buffer writes for scale-out. This is a good choice for immortal databases where people want access to ALL history. Scaling is to several billions of triples.  
     20 
     21The RW store (Read-Write) supports recycling of allocation slots on the backing file. It may be used as a time-bounded version of an immortal database where history is aged off of the database over time. This is a good choice for standalone workloads where updates are continuously arriving and older database states may be released. The RW store is also less sensitive to data skew because it can reuse B+Tree node and leaf revisions within a commit group on large data set loads. Scaling should be better than the WORM for standalone and could reach to 10B+ triples. The default property file is attachment:RWStore.properties. 
     22 
     23 
     24=== Load Performance === #load 
     25 
     26Approach 1: 
     27 
     28 Upload data from Bigdata sparql point(NanoSparqlServer). Post the data every 10000 lines. Please refer to attachment:upload.pl for details.  
     29 
     30Approach 2: 
     31 
     32 Upload with com.bigdata.rdf.store.DataLoader tools and RW store default parameter. 
     33 
     34And test the situation when adding GC in JVM.  
     35  
     36{{{ 
     37-Xmx55G -Xms30G -XX:+UseG1GC -XX:+TieredCompilation? -XX:+HeapDumpOnOutOfMemoryError 
     38}}} 
     39 
     40Without garbage collecting it took 35 minutes to upload Allie. 
     41 
     42Approach 3: 
     43 
     44We modified the following two important parameters(In the rest test we use this configure in default): 
     45{{{ 
     46com.bigdata.btree.writeRetentionQueue.capacity=500000 
     47com.bigdata.rdf.sail.BigdataSail.bufferCapacity=1000000 
     48}}} 
     49 
     50Approach 4: Split the file into 12 small files. 
     51 
     52=== Allie upload === #allieload  
     53  
     54Approach 1: 26hours  
     55 
     56Approach 2: 5.89hours  
     57when Setting JVM GC : 6.75hours  
     58 
     59Approach 3: 2.61 hours with UseG1GC; 
     60            35 minutes with automatic garbage collection (vm.swappiness =10); 
     61            38 minutes with automatic garbage collection (vm.swappiness =60); 
     62 
     63Approach 4: 1.03 hours with UseG1GC; 
     64            35 minutes with automatic garbage collection; 
     65 
     66 
     67 
     68=== PDBJ upload === #pdbjload  
     69 
     70'''Result:''' 8.95 hours with UseG1GC; 
     71              7.15 hours(429 minutes) with  automatic garbage collection (vm.swappiness =10); 
     72 
     73=== Uniprot upload === #uniprotload  
     74 
     75'''uniprot.rdf.gz''' 
     76 
     77We firstly uploaded the file uniprot.rdf.gz (3.16 billion triples): 
     78 
     79time: over one week(7.48 days): 646336127ms with UseG1GC 
     80 
     81INFO : 646335942 main com.bigdata.rdf.store.DataLoader?.logCounters(DataLoader?.java:1185): extent=249818775552, stmts=3161144450, bytes/stat=79 Wrote: 241474404352 bytes. Total elapsed=646336127ms 
     82 
     83with automatic garbage collection: 74.3hours(4458 minutes) 
     84 
     85 INFO : 267385254      main com.bigdata.rdf.store.DataLoader.logCounters(DataLoader.java:1185): extent=249818775552, stmts=3161144450, bytes/stat=79 
     86Wrote: 241489608704 bytes. 
     87Total elapsed=267385843ms 
     88 
     89'''uniref.rdf.gz''' 
     90 
     91When adding uniref.rdf.gz, it took over 11 days to import 411800000 statements, and we stopped the procedure for the bad performance. 
     92 
     93INFO : 1032251699      main com.bigdata.rdf.store.DataLoader$2.processingNotification(DataLoader.java:1018): 411800000 stmts buffered in 1032247.664 secs, rate= 398, baseURL=http://purl.uniprot.org, t 
     94otalStatementsSoFar=411800000 
     95 
     96 
     97'''uniprot.rdf.gz''' (2nd) 
     98 
     99INFO : 349235249      main com.bigdata.rdf.store.DataLoader.logCounters(DataLoader.java:1185): extent=249818775552, stmts=3161144450, bytes/stat=79 
     100Wrote: 241707974656 bytes. 
     101Total elapsed=349235466ms 
     102INFO : 349235324      main com.bigdata.rdf.store.DataLoader.main(DataLoader.java:1545): Total elapsed=349235466ms 
     103 
     104=== Load performance === 
     105 
     106In the result we configured the setting as vm.swappiness = 60, automatic garbage collection. 
     107 
     108||loadtime|| Cell Cycle Ontology || Allie || PDBj || UniProt* ||  
     109|| 1st time || 3mins ||35mins ||429mins ||4458mins  ||  
     110|| 2nd time || 3mins ||38mins ||412 mins  ||5820 mins  ||  
     111|| average || 3mins || 37 mins ||421mins || 5139 mins  ||  
     112 
     113UniProt*: We only uploaded uniprot.rdf.gz, 3.16 billion triples. 
     114 
     115 
     116=== Sparql query performance === #Sparql 
     117 
     118=== Cell cycle query === #cellquery  
     119 
     120 
     121||Query\time(ms) ||time 1 || time 2 || time 3 ||time 4||time 5 || 
     122||case1 ||341   ||353   ||327   ||328   ||327|| 
     123||case2 ||46||  43      ||41||  45||    39|| 
     124||case3 ||3361  ||3039  ||2855  ||3284  ||3416|| 
     125||case4 ||32    ||21    ||9     ||22||  10|| 
     126||case5 ||416   ||574   ||404|| 401||   433 
     127||case6 ||1216||        1295||  1135    ||1277||        1134|| 
     128||case7 ||21    ||21||  19||    23||    21|| 
     129||case8 ||105   ||113|| 83||    108||   93|| 
     130||case9 ||44    ||43||  44||    45||    40|| 
     131||case10 ||14|| 14||    14      ||14||  14|| 
     132||case11 ||25|| 29||    24||    29||    14|| 
     133||case12 ||44|| 49||    45||    50||    32|| 
     134||case13 ||7||  21      ||19||  9||     18|| 
     135||case14 ||3||  17||    15||    18||    15|| 
     136||case15 ||19456||      19229|| 18670   ||19016||       19583|| 
     137||case16 ||X||X||       X||     X||     X|| 
     138||case17 ||X||X||       X||     X||     X|| 
     139||case18 ||X||X||       X||     X||     X|| 
     140||case19 ||44   ||36    ||29||  46||    37|| 
     141 
     142note: do not support '''count''' query in case16,17 and 18. 
     143 
     144 
     145 
     146  === Allie query === #alliequery  
     147 
     148||Query\time(ms) ||time 1 || time 2 || time 3 ||time 4||time 5 || 
     149||case1 ||423 ||        424||   424||   443||   436|| 
     150||case2 ||4160  ||4200||4263||  4201||  4264|| 
     151||case3 ||3352||        3230||  2329||  2329||  2308|| 
     152||case4 ||568   ||592|| 92||    92||    97|| 
     153||case5 ||1830742||     661710||        39296|| 39296|| 39784|| 
     154 
     155=== PDBJ query === #pdbjquery  
     156 
     157 
     158||Query\time(ms) ||time 1 || time 2 || time 3 ||time 4||time 5 || 
     159||case1 ||751 ||        213||   213||   212||   213|| 
     160||case2 ||27     ||14   ||15||  13||    26|| 
     161||case3 ||188 ||        56||    45||    53||    66|| 
     162||case4 ||337|| 58||    57||    53||    59|| 
     163 
     164 
     165 
     166 
     167 
     168=== Uniprot query === #uniprotquery  
     169 
     170=== DDBJ query === #ddbjquery  
     171 
     172