Federated SPARQL query system benchmark for life sciences data
- Federated SPARQL query engines
- FedX => FedX
- SPLENDID => SPLENDID
- ADERIS => ADERIS
- ANAPSID => ANAPSID
- Federated query 1.1=> Federated1.1
- Others => Others
- Data & Query=> Data&Query
- Result=> Result
Overview
A federated query, querying RDF data across multiple sources, is indispensable in comprehending biological process, diseases, medicine development ,and also biological data integration. Querying RDF data via SPARQL endpoints can be processed based on the original date sources. We Evaluate whether the existing federated SPARQL endpoint query systems can satisfy the requirements(mainly, response performance) of the big life sciences data.
Different from Fedbench,SP2Bench, and the fine-grained evaluation of SPARQL endpoint federation systems, all of which use a simulated federated environment, and synthetic data or subset of real data. Their result set is small, which can not show the performance when the retrieved data are big. In addition, whether we can directly execute SPARQL 1.1 query without using the federated query engines is also unreported. In this report, we use the real life science data and the real endpoints. We test FedX,SPLENDID,ADERIS,ANAPSID. We also report the performance of SPARQL 1.1 queries with and without these engines.
Federated system engine | Licencing | Platform | Cache | Code | Pre-computing | Source selection | SPARQL1.1 |
FedX | GNU Affero General Public License | Java | Yes | available | No | "ask" | Yes |
SPLENDID | GNU Lesser GPL | Java | No | available | Yes | "ask"+statistics | No |
ADERIS | Apache License 2.0 | Java | No | available | No | predicate | No |
ANAPSID | GNU Lesser GPL | Python | No | available | Yes | predicate | Yes |
Method
We use five real biological SPARQL endpoints,and designed five basic queries,considering the number of really queried endpoints, the triple patterns (varying from 4 to 9), and the number of results(from 5 to 11000). And we rewrite query 3 and 5 with “limit 100” clause.
To keep a stable server and network environment, we sequentially execute a query for all engines, and repeat it five times. Finally we remove the biggest value and calculate the average of other four values.
To test the performance when users do federated 1.1 queries in an endpoint directly instead of using a federated query engine, we rewrite all five queries with service keywords and change the order of two service clauses, and execute the query in one of five endpoints.
Conclusion
1. Now although many SPARQL endpoints support SPARQL 1.1 query, they can not take place of federated query engines.
2. FedX shows good performance both on its ease of use and better response.
3. All of these systems could finish a light query.
4. Neither FedX nor ADERIS needs a pre-computed statistics information, which makes them easy to use. However ADERIS queries the predicate information of all the endpoints on-the-fly query produce a big cost.
5. SPLENDID shows better performance except FedX , and need pre-computed predicates and other statistic information.
6.ANAPSID is the only one who use the non-java platform. It uses heuristic to choose data source, which makes the query is faster while the real real results could be lost.