Scaling up Graph Databases for Network-centric applications (Picasso and COLOR Projects)
![]()
French-Spanish Picasso and COLOR Projects
The project aims at joining the experience of INRIA Atlas team (Montpellier, France) in the areas of distributed and parallel data management with the experience of DAMA-UPC (Technical University of Catalonia, Barcelona, Spain) in massive graph data management systems.
Project Description
The project aims at joining the experience of INRIA Atlas team (Montpellier, France) in the areas of distributed and parallel data management with the experience of DAMA-UPC (Technical University of Catalonia, Barcelona, Spain) in massive graph data management systems. The level of synergy between those groups leads to the objective of setting up a set of activities to strengthen their collaboration and to expand these links to place those groups in the lead of an emerging field, Large Graph Databases (LGD) for network-centric applications. The technical objectives for the proposal are: to focus the joint research on the different aspects that define the area of LGD, including but not limited to data access, query placement and optimization, algorithms for the performance of specific operations on large datasets, and infrastructure for caching in large graph management. The collaboration and networking objectives are: to obtain a long lasting relation between the two groups, organizing joint internal workshops, setting up a Master/PhD student exchange routine that allows the groups to have a continued collaboration, setting up international activities such as joint open workshops or EC project proposals on the topic so that the two groups are recognized and valued as leaders in this emerging area.
The Teams
The ATLAS team at INRIA Sophia-Antipolis Méditerranée, Montpellier headed by P. Valduriez brings strong expertise in data management in distributed systems, in particular, P2P systems. The key team members involved in the project will be P. Valduriez, R. Akbarinia (query processing, P2P) and Esther Pacitti (replication, query processing, P2P). The team has good experience in technology transfer (e.g. transfer of the Disco information mediator to the Kelkoo startup by P. Valduriez and his team at INRIA in 2000). ATLAS has started to build the Atlas P2P Architecture (http://www.sciences.univ-nantes.fr/lina/atlas/appa), a P2P data management system that provides scalability, availability and performance for applications which deal with semantically rich data (XML, relational, etc.) [AMPV06]. APPA is a P2P data management system in Open Source Software that provides scalability, availability and performance for applications which deal with semantically rich data (XML, relational, etc.). APPA provides advanced services such as queries, replication and load balancing. It is being implemented on top on various P2P networks such as JXTA, OpenChord and Pastry and tested on GRID5000 and PlanetLab. The current services of APPA are : key-based timestamping (KTS), Satisfaction-based query allocation (SbQA), P2P logging and timestamping for reconciliation (P2P-LTR) and PeerUnit, a P2P testing framework. The APPA services are used in several projects: Strep Grid4All, ANR RNTL Xwiki Concerto and ANR VERSO DataRing. APPA also addresses data replication [APV07a, MPEJ08] and query processing [APV06, APV07b, APV08c]. The ScalingGDB project will enable the ATLAS team to gain experience on very large graph data applications and produce new graph data management services within APPA.
The DAMA-UPC group at Universitat Politècnica de Catalunya, Barcelona (16 researchers and Ph.D./Masters/Degree students and 10 developers) headed by J. Ll. Larriba Pey brings its expertise both in relational and graph database management systems. In particular, the expertise of the group in relational databases focuses on sorting [DLN99,DNL01], query optimization for very large join queries [MPAZL06, MAZML06, MLAL07], parallel join and stream processing [AMZL07,ATML06,ATML08], distributed caching [DLS07] and compilation for Large Databases [RLV05, RBGC01]. One of the main objectives of DAMA-UPC is technology transfer. The projects of the group started in 2001 in collaboration with Institut Català d'Oncologia (ICO). The effort in this project focuses in the generation of a Record Linkage software, DAURUM. The main objective started in order to help the Catalan hospitals to link all the databases related to cancer. This software has been acquired by the Death Registry of Catalonia, the Cancer Registries of Aragón, Baleares, Islas Canarias y Navarra. In 2003, DAMA-UPC started another project with IBM. The effort is to collaborate in the integration of new software features in DB2. The IBM project started with 2 full time developers and one part time developer, and now sizes to 5 full time developers. In 2004 we started a project related to the codification of death causes for the Death Registry of Catalonia. The objective of this project is to obtain an automated tool that helps the specialized coder to obtain a faithful transcription of written data. DAMA-UPC has been working on a full-fledged Graph query engine from 2005, DEX, which has been reported in [MMGNSL07]. DEX is a generic graph query engine that has been implemented based on the research in our group, and can be used in the different applications mentioned in Section 1. In particular, there is a demo of the use of DEX in the web, called BIBEX, at www.dama.upc.edu/bibex, where it is possible to launch different graph queries to an integrated instance of DBLP and Citeseer. DEX allows working on large graphs with hundreds of millions of nodes and edges, giving answers within seconds. The DEX technology has been used and validated by the registrars of Spain for fraud detection in property transactions, and it is being used to develop a full-fledged research analysis application for the Ministry of Education and Science of Spain.
Motivation
The emergence of network-centric applications with the requirement to manage large networked or graph information has created the need to set up graph management systems that allow dealing with millions of nodes and edges. Such applications include social networking, where the nodes can be people and their hobbies or activities, and the edges are the relations between them (LinkedIn, Facebook, etc); bibliographic databases, with complex on-line queries where the nodes are the authors or the papers written by them, and the edges are authorships or references to the papers (www.dama.upc.edu/bibex); fraud detection applications in different areas like police investigation, where the nodes are the entities investigated and the actions taken by those entities, and the edges are the relations between those entities and the actions; the Wikipedia or similar wiki-like sources of information, where the nodes are the different keywords (including names, locations, urls, etc) and the edges are the relations between the places where those keywords are used or the urls that they point to; etc. The need to manage the knowledge generated by those applications and to launch ad-hoc queries to those large databases makes them very attractive to the research community and to the IT companies to create new solutions for this already consolidated market.
One important aspect of network/graph information is its size. In some cases, the graphs are large because they represent huge amounts of information like in web logs, or in large bibliographic data sets, like Scopus (www.scopus.com) or JCR by Thomson (scientific.thomson.com). In those cases it may be necessary to implement parallel graph database systems to allow for reasonable performance in the execution of queries. In some other cases, they are distributed in different sites like in the case of the different language instances of Wikipedia. In all these cases, if we want to launch queries to those large graphs (either parallel or distributed), it is necessary to know the nature of the data, the way it is distributed or replicated and the representation for each of the data sets. This valuable information can be used in many different aspects of the research proposed jointly in this document, making it important to understand and investigate different aspects of future parallel/distributed network/graph databases. The important aspects to take into account range from how to distribute the large amounts of data managed by those applications, the best query placement and optimization strategies and the techniques and data structures that allow caching data efficiently in such environments, among others.
Technical Background
It is necessary to build graph databases with open querying capabilities for supporting the large spectrum of applications that involve graph/networked data. However, efficient query processing on such graph databases is challenging due to the high complexity of processing large volumes of graph data.
Graph Databases. There are many graph database models that have been proposed in the literature. GOOD [GPTB93] provided the theoretical basis for a system in which the manipulation as well as the representation of data are transparently graph-based. Among subsequent developments based on GOOD are GMOD [AGPTA92], which proposed a number of concepts for graph-oriented database user interfaces, and G-Log [PPA95] which proposed a declarative query language for graphs.
Other proposals that deal with graph data models are GraphDB [Güt94], GRAS [KSW95], GRACE [SMM05] and RDF [AG05, KC04]. Among these, GRAS and GRACE propose an implementation of a graph database management system for small and medium sized databases. RDF has become a de-facto standard for describing resources in the form of a graph. Currently there is research work on storing information expressed in RDF but none of this work defines formally a graph database model. Querying RDF stores from a graph database perspective is discussed in [AG05].
According to query processing, existing research has been conducted mainly on two types of graph databases. The first type is a database that consists of a large set of small graphs such as chemical compounds. This type of database is especially popular in scientific domains such as chemistry and bioinformatics. Typical queries for this type of database include subgraph queries and similarity queries. They consist in retrieving all graphs in that database that are supergraphs of, or structurally similar to, a given query graph. Much work has been done to improve this type of queries through proposing structural indexes [GS02] and graph mining techniques [YH02].
The second type of graph databases, which is addressed by this proposal, consists of very large graphs, such as the Web graphs and social networks. Typical querying tasks for such graph databases include finding the best connections between a set of query nodes [FMT04], finding subgraphs that match a given query pattern [TFGE07] or query keywords.
Keyword search on graph structured and semi-structured data has attracted recently a lot of attention. With the growth of the World Wide Web (WWW), users/applications need to be able to access online databases, without having a detailed knowledge of schema or query language. This issue has been addressed in several approaches, where special-purpose tools, like BANKS [BHN02] or Précis [KSI06], were built to map a relational database into a graph to be returned as a response for a keyword search query. These approaches assume either implicitly or explicitly that the graph is in memory, and would incur high random-IO overheads if the graph does not fit in memory. Some interesting works have started recently to discuss this issue [DKS08].
In any case, and to the best of our knowledge, there is not any background on the proposal and analysis of Large Graph Database management systems, except for DEX in [MMGNSL07]. DEX is a high performance graph database querying system that allows for the integration of multiple data sources. However, a limitation of DEX and other existing works on graph databases is their assumption that the data is collected and graph is processed in a centralized manner.
Regarding data structures and algorithms, several papers propose efficient ways for implementing the most typical operations on graphs such as reachability [WaVLDB08, JiSIGMOD08, YaICDE08, Cheng08a], frequent subgraph search and mining [YaSIGMOD08, GoSIGMOD08, YoICDE08], isomorphism and pattern search [Cheng08b, TiICDE08, ZoEDBT08, Agrawal94] or shortest path [DiEDBT08], to give some examples. Efficient structures for representing graph-like data have been proposed in the literature for the last two decades [Amann92, KiIS20, RoICPR02, SrCOMAD05, MMGNSL07]. The difference between this work and the former referenced work is that, while in previous papers the focus was high performance on very simple graph structures with no attributes, these papers consider a graph to be data-oriented, keeping huge amounts of information, as in a classic DBMS.
Databases in P2P Systems. Scaling up data management in order to support very large databases or high numbers of concurrent users requires parallel/distributed techniques which exploit a highly parallel computer architecture such as multiprocessor, cluster or peer-to-peer (P2P) networks [VP07,ÖV99,Val08]. Parallel data management refers to the management of data in a parallel/distributed system. It can be done in different ways, e.g. a parallel database system implemented in a multiprocessor, or a search engine with the index implemented in a cluster. The basic principle of parallel data management is to partition the data across different nodes, in order to increase performance through parallelism and availability through replication. This enables supporting very large databases with very high query or transaction loads. Parallel database systems can exploit distributed database techniques. In particular, data partitioning is somewhat similar to database fragmentation. Essentially, the solutions for transaction management, i.e. distributed concurrency control, reliability, atomicity, and replication, can be reused. However, the critical issues for parallel data systems are data placement, parallel query processing (including optimization and execution) and load balancing. These issues are much more difficult than in distributed DBMS because the number of nodes may be much higher. Furthermore, the interconnection network of a multiprocessor system provides better opportunities for improving performance than a general-purpose network. Parallel data management in P2P or grid environments also raises new issues due to the dynamic behavior of nodes [PV04, PVM07].
Most of the work in handling graph-like data in parallel or distributed data management has been done in the context of the relational model, where flat tables make it easy for partitioning. In the late 1980, there have been interesting attempts to deal with complex objects (mostly trees) in the context of object-oriented databases [VKC86, KV88]. This work can be useful to deal with XML documents (trees) in parallel/distributed systems but requires major rethinking to deal with more general graphs. Some work has been oriented to the parallel aspects of algorithms to solve some important algorithms over graphs such as the shortest path problem [MaALE07, CrMTAAP07] or finding frequent patterns [ReHIPS07]. Other graph algorithms like the shortest path or graph isomorphism have been used to assess their scalability on new parallel architectures, without paying much attention to new algorithm designs [Bader05b, UnWMAA07]. Parallelism in graph algorithms has also been largely exploited for other typical graph operations such as BFS, graph partitioning, spanning tree search or graph coloring [Bader05c, JoSIAM93, KaTechRep03, Boman05, Bozda05, DeIPDPS06, YoSUPER05].
Scientific Description
The need for the parallel or distributed management of graph databases is obvious from the discussion above. The project aims at investigating in detail the different aspects that arise from this analysis. In the following, we detail the purpose of the technical activities involved in this project.
Data distribution
The current DEX graph database assumes that all the data is collected in a central server. Nevertheless, the decentralized approach requires changes into the data collection and/or into the query engine because in this scenario data is split in as many graph databases as different peers.
There are different approaches for data distribution that can be applied to a graph database:
- Horizontal partitioning, where data units are split based in hash keys, clusters of object identifiers or other map function between an object and its location (peer). This is the simplest approach but makes query solution more complicated.
- Vertical partitioning, where each data units is stored in a single peer, with optional replicas for safety and performance. This approach simplifies query solving but arises some complexity problems to decide the better solutions.
- Path partitioning, where all the objects that form a composite unit are clustered into a partition, for example based in connectivity or distance between nodes.
In this project, we will study all these aspects and others that might arise from our work in order to understand the most efficient data distribution for a distributed graph database system.
Query placement and optimization
There are two main potential strategies for distributed query processing: information propagation and query propagation. Information propagation, also known as data shipping, means that the querying peer or a central super-peer collects all data from the available peers, and then unifies the information to return the final result. In iterative or recursive queries it will mean multiple calls to the peers and a huge volume of calls and data into the inter-process communication channel, but the unification of results becomes easier and no big changes are required to the query processor.
Query propagation or query shipping is a more advanced strategy based on the replica of all or parts of the query at each peer. The query plan, the querying peer or the super-peer decides which part of the query must be executed at each peer, and when and how the partial results will be unified to return the final result. This approach is more difficult to implement because it requires an advanced analysis of queries and a transformation of the query plan. Query solving is also more complex on iterative or recursive queries, but the number of calls and amount of data sent should be significantly smaller.
Finally, there exist other possibilities, as for instance hybrid shipping, which combines query and data shipping. However, we will initially focus on the two firstly mentioned main strategies.
As a starting point for the three approaches, we will model distributed homogeneous graphs with a shared schema with optional replicated objects (nodes, edges and attributes) and extra local objects. The second step will be to perform a detailed analysis of the behavior and performance of all the low-levels graph operations in the three parts of the distributed schema (shared, replicated and local). With this information, query plans can be modified based in operation affinity, data locality, communication costs or other performance counters available.
We will also investigate optimization techniques in order to generate efficient query execution plans that allow improving the performance of the system by proposing an efficient organization of the operations executed to return the results of the query.
Caching
The access to secondary storage is one of the main bottlenecks in systems that access to big data repositories. Disk accesses are extremely slow compared to the frequencies achieved by processors.
Therefore, a selection of the data that the system keeps in memory is necessary and influences the system performance. The addition of local caches provides a big performance gain in most programs. However, it has been long observed that the access to the memory of a remote computer through the network is faster than the access to the local disk storage. Thus, effective cooperative caching can be added to a distributed system to increase its performance.
In graph databases, several computing nodes share the same data set. In order to achieve the best performance, it is necessary that the nodes have the data available in local memory or that they can locate the information in the memory of other computer, using the network. Thus, it is necessary to create algorithms that: (a) send the data to the nodes which use the data frequently; and that (b) keep frequently accessed data in some node of the network for as long as possible. Along with these algorithms, our objective is to design data structures which capture which subsets of data are used in each node and that can be transmitted efficiently.
In this project we plan to explore all these aspects. Also, we will explore the reuse partial computations from previous queries to solve future queries. For example, in a graph database that contains researchers, a query that find researchers who have published in SIGMOD and CIKM will generate with high probability similar results to a query which find researchers who have published in VLDB and CIKM. If the system is able to find which parts of the computation they had been previously computed, it is possible to reduce significantly the response time of the system.
Current Work
Currently we have been working on graph partitioning techniques in order to reduce the inter-node communications. The first results of this work have been published in the 1st International Workshop on Graph Databases (IWGD) (http://www.icst.pku.edu.cn/IWGD2010/index.html):- Graph Partitioning Strategies for Efficient BFS in Shared-Nothing Parallel Systems, Muntés-Mulero Victor (DAMA-UPC), Norbert Martínez-Bazán (DAMA-UPC), Josep-L Larriba-Pey (DAMA-UPC), Esther Pacitti (LIRMM), Patrick Valduriez (INRIA)
Bibliography
[AG05] R. Angles and C. Gutiérrez. Querying rdf data from a graph database perspective. ESWC Conf., 346–360, 2005.
[AGPTA92] M. Andries, M.Gemis, J.Paredaens, I. Thyssens, J.V.Den Bussche. Concepts for graph-oriented object manipulation. In Proceedings of the 3rd International Conference on Extending Database Technology (EDBT). LNCS, vol. 580. Springer, 21–38, 1992.
[AMZL07] J. Aguilar Saborit, V. Muntés Mulero, C. Zuzarte, J. L. Larriba Pey. Data & Knowledge Engineering, 63(3):995-1013, 2007.
[AMPV06] R. Akbarinia, V. Martins, E. Pacitti, P. Valduriez. Design and Implementation of APPA. Global Data Management, R. Baldoni, G. Cortese, F. Davide (eds.), IOS Press, 2006.
[APV06] R. Akbarinia, E. Pacitti, P. Valduriez. Reducing Network Traffic in Unstructured P2P Systems Using Top-K Queries. Distributed and Parallel Databases, 19(2), 67-86, 2006.
[APV07a] R. Akbarinia, E. Pacitti, P. Valduriez. Data Currency in Replicated DHTs. ACM SIGMOD Int. Conf. on Management of Data, 211-222, 2007.
[APV07b] R. Akbarinia, E. Pacitti, P. Valduriez. Best Position Algorithms for Top-k Queries. Int. Conf. on Very Large Databases (VLDB), 495-506, 2007.
[APV07c] R. Akbarinia, E. Pacitti, P. Valduriez. Processing Top-k Queries in Distributed Hash Tables. European Conf. on Parallel Computing (Euro-Par), 489-502, 2007.
[ATML06] J. Aguilar-Saborit, P. Trancoso, V. Muntés-Mulero, J. L. Larriba-Pey. Dynamic Count Filters. ACM SIGMOD Record 35 (1):26-32, 2006.
[ATML08] J. Aguilar-Saborit, P. Trancoso, V. Muntes-Mulero, J.L. Larriba-Pey Dynamic adaptive data structures for monitoring data streams. Data & Knowledge Engineering, In Press, 2008.
[Bader05b] D.A. Bader, G. Cong, and J. Feo. On the Architectural Requirements for Efficient Execution of Graph Algorithms. ICPP’05, pp. 547-556.
[Bader05c] D. A. Bader and G. Cong. A fast, parallel spanning tree algorithm for symmetric multiprocessors (smps). J. Parallel Distrib. Comput., 65(9):994–1006, 2005
[BHN02] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, S. Sudarshan. Keyword searching and browsing in databases using BANKS. Int. Conf. on Data Engineering (ICDE), 431–440, 2002.
[Boman05] E. G. Boman, D. Bozda , U. Catalyurek, A. H. Gebremedhin, and F. Manne. A scalable parallel graph coloring algorithm for distributed memory computers. EuroPar 2005.
[Bozda05] D. Bozda, U. Catalyurek, A. H. Gebremedhin, F. Manne,E. G. Boman, and F. Ozgner. A parallel distance-2 graph coloring algorithm for distributed memory computers. HPCC’05.
[CrMTAAP07] J.R. Crobak, J. Berry, K. Madduri, and D.A. Bader. Advanced Shortest Paths Algorithms on a Massively-Multithreaded Architecture. MTAAP 2007.
[DKS08] B.Dalvi, M.Kshirsagar, S.Sudarshan. Keyword Search on External Memory Data Graphs. In the Proceedings of VLDB, 2008.
[DeIPDPS06] K. Devine, et al. Parallel hypergraph partitioning for scientific computing.In IPDPS. 2006
[DLN99] D. Jiménez-González, J. Larriba-Pey y J. J. Navarro, Communication Conscious Radix Sort. ACM Int. Conf. on Supercomputing (ICS), 76-82, 1999.
[DNL01] D. Jiménez-González, J. Navarro y Josep-L. Larriba-Pey. Fast Parallel In-memory 64 bit sorting. Int. Conf. on Supercomputing (ICS), 114-122, 2001.
[DLS07] D. Dominguez-Sal, J. L. Larriba-Pey, M. Surdeanu. Multi-layer Collaborative Cache for Question Answering. European Conf. on Parallel Computing (Euro-Par), 287-298, 2007.
[FMT04] C.Faloutsos, K.S. McCurley, A.Tomkins. Fast Discovery of connection sub-graphs. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2004.
[GPTB93] M. Gemis, J. Paredaens, I. Thyssens, J. V. den Bussche. Good: A graph-oriented object database system. ACM SIGMOD Conf., 505–510, 1993.
[GS02] R. Giugno, D. Shasha. Graphgrep: A fast and universal method for querying graphs. Int. Conf. on Pattern Recognition, 112-115, 2002.
[Güt94] R. H. Güting. Graphdb: Modeling and querying graphs in databases. Int. Conf. on VLDB, 297–308, 1994.
[JoSIAM93] M. T. Jones and P. Plassmann. A parallel graph coloring heuristic. SIAM J. Sci. Comput., 14(3):654–669, 1993.
[KaTechRep03] G. Karypis, et al: Parallel graph partitioning and sparse matrix ordering library, version 3.1. TR, D. of Comp. Sci. U. Minnesota, 2003. http://www-users.cs.umn.edu/karypis/metis/parmetis/download.html.
[KV88] S. Khoshafian, P. Valduriez, G. Copeland. Parallel Query Processing of Complex Objects. IEEE Int. Conf. on Data Engineering (ICDE), 202-209, 1988.
[KSI06] G. Koutrika, A. Simitsis, Y. Ioannidis. Précis: The essence of a query answer. IEEE Int. Conf. on Data Engineering (ICDE), 2006.
[KSW95] N. Kiesel, A. Schuerr, and B. Westfechtel. Gras, a graph-oriented engineering database system. Information Systems, 20(1):21–51, 1995.
[KC04] G. Klyne and J. Carroll. Resource Description Framework (RDF): Concepts and Abstract Syntax. W3C Recommendation, 10 February 2004. http://www.w3.org/TR/2004/REC-rdf-concepts-2004.
[MaALE07] K. Madduri, et al., An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances, ALENEX’07.
[MMGNSL07] N. Martínez-Bazan, V. Muntés-Mulero, S. Gómez-Villamor, J. Nin, M. A. Sánchez-Martínez, J. L. Larriba-Pey: Dex: high-performance exploration on large graphs for information retrieval. CIKM Conf., 573-582, 2007.
[MPEJ08] V. Martins, E. Pacitti, M. El-Dick, R. Jiménez-Peris. Scalable and Topology-Aware Reconciliation on P2P Networks, Distributed and Parallel Databases, to appear, 2008.
[MPAZL06] V. Muntés-Mulero, M. Pérez-Casany, J. Aguilar-Saborit, C. Zuzarte, J. L. Larriba-Pey. Parameterizing a genetic optimizer. DEXA Conf., 707-717, 2006.
[MAZML06] V. Muntés-Mulero, J. Aguilar-Saborit, C. Zuzarte, V. Markl, J.L. Larriba-Pey. An inside analysis of a genetic-programming optimizer. IDEAS Conf., 249-255, 2006.
[MLAL07] Victor Muntés-Mulero, Néstor Lafón-Gracia, Josep Aguilar-Saborit and Josep-Ll. Larriba- Pey. Improving Quality and Convergence of Genetic Query Optimizers. DASFAA Conf., 6-17, 2007.
[ÖV99] T. Özsu, P. Valduriez. Principles of Distributed Database Systems. Prentice Hall, Englewood Cliffs, New Jersey, 1st edition, 562 pages, 1991; 2nd edition, 666 pages, 1999; 3rd edition (forthcoming).
[PPA95] J. Paredaens, P. Peelmaa, L. And Tanca. G-Log: A graph-based query language. IEEE Trans. Knowl. Data Eng. 7, 3, 436–453, 1995.
[PVM07] E. Pacitti, P. Valduriez, M. Mattoso. Grid Data Management: open problems and new issues. Journal of Grid Computing, 5(3), 273-281, 2007.
[RBGC01] A. Ramirez, L. Barroso, K. Garachorloo, R. Cohn, Josep-L. Larriba-Pey, G. Lowney y Mateo Valero. Code Layout optimizations for Transaction Processing Workloads. Int. Symp. on Computer Architecture (ISCA), 155-165, 2001.
[RLV05] A. Ramírez, J.L. Larriba-Pey, M. Valero. Software Trace Cache. IEEE Transactions on Computers, 54(1):22-35, 2005.
[ReHIPS07] S. Reinhardt and G. Karypis. A Multi-Level Parallel Implementation of a Program for Finding Frequent Patterns in a Large Sparse Graph. HIPS'07, 2007.
[SMM05] S. Srinivasa, M. Maier, M. R. Mutalikdesai, G. K. A., G. P. S. Lwi. Safari: A new index structure and query model for graph databases. COMAD Conf., 138–147, 2005.
[TFGE07] H. Tong, C. Faloutsos, B. Gallagher, T. Eliassi-Rad. Fast best-effort pattern matching in large attributed graphs. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2007.
[UnWMAA07] Keith Underwood, Megan Vance, Jonathan Berry and Bruce Hendrickson. Analyzing the Scalability of Graph Algorithms on Eldorado. MTAAP’07
[Val08] P. Valduriez (ed.). Parallel Databases. Encyclopedia of Database Systems. L. Liu and T. Özsu (eds), Springer, 2008 (forthcoming).
[VKC86] P. Valduriez, S. Khoshafian, G. Copeland. Implementation Techniques of Complex Objects. Int. Conf. on VLDB, 101-110, 1986.
[VP04] P. Valduriez, E. Pacitti. Data Management in Large-scale P2P Systems. Int. Conf. on High Performance Computing for Computational Science (VecPar -2004), Invited paper, Valencia, Spain, 2004. LNCS 3402, Springer, 109-122, 2005.
[VP07] P. Valduriez, E. Pacitti. Parallel Database Systems. Handbook of Database Technology, J. Hammer and M. Scheider (eds.), CRC Press, 2007.
[YH02] X. Yan, J. Han. Gspan: Graph-based substructure pattern mining. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 2002.
[YoSUPER05] Yoo, A., Chow, E., Henderson, K., McLendon, W., Hendrickson, B., and Catalyurek, U. 2005. A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L. Supercomputing 2005.