Distributed Graph Databases
The emergence of network-centric applications with the requirement to manage large networked or graph information has created the need to set up graph management systems that allow dealing with millions of nodes and edges. Such applications include social networking, where the nodes can be people and their hobbies or activities, and the edges are the relations between them (LinkedIn, Facebook, etc); bibliographic databases, with complex on-line queries where the nodes are the authors or the papers written by them, and the edges are authorships or references to the papers (www.dama.upc.edu/bibex); fraud detection applications in different areas like police investigation, where the nodes are the entities investigated and the actions taken by those entities, and the edges are the relations between those entities and the actions; the Wikipedia or similar wiki-like sources of information, where the nodes are the different keywords (including names, locations, urls, etc) and the edges are the relations between the places where those keywords are used or the urls that they point to; etc. The need to manage the knowledge generated by those applications and to launch ad-hoc queries to those large databases makes them very attractive to the research community and to the IT companies to create new solutions for this already consolidated market.
One important aspect of network/graph information is its size. In some cases, the graphs are large because they represent huge amounts of information like in web logs, or in large bibliographic data sets, like Scopus (www.scopus.com) or JCR by Thomson (scientific.thomson.com). In those cases it may be necessary to implement parallel graph database systems to allow for reasonable performance in the execution of queries. In some other cases, they are distributed in different sites like in the case of the different language instances of Wikipedia. In all these cases, if we want to launch queries to those large graphs (either parallel or distributed), it is necessary to know the nature of the data, the way it is distributed or replicated and the representation for each of the data sets. This valuable information can be used in many different aspects of the research proposed jointly in this document, making it important to understand and investigate different aspects of future parallel/distributed network/graph databases. The important aspects to take into account range from how to distribute the large amounts of data managed by those applications, the best query placement and optimization strategies and the techniques and data structures that allow caching data efficiently in such environments, among others.
People involved in the project
PhD student: Norbert Martínez-Bazán
PhD Advisor: Victor Muntés-Mulero