# Second Graph-TA Program

## 9:00 - 9:30 Registration

## 9:30 - 10:45 Presentation session I (Chair: Josep Lluis Larriba-Pey)

### Hassan Chafi. "Graph Analytics Research at Oracle Labs"

**Abstract::** The problem of efficiently analyzing graphs of various shapes and sizes has been recently enjoying an increased level of attention both in the academia and in the industry. This trend prompted creation of specialized graph databases that have been rapidly gaining popularity of late. In this presentation we discuss the various concerns being addressed by existing and emerging graph analysis systems. We will also present some of our own efforts in the area of graph analytics which are focused on high-performance, parallel, in-memory graph analysis.

### Fernando Orejas and Elvira Pino. "Incremental Model Synchronization with Triple Graph Grammars"

**Abstract:** In model-driven software development, typically we may have several models describing the same system or artifact, by providing different views on it. In this case, we say that these models are consistently integrated. In this context, model synchronization is the operation to propagate updates in one model to another model, restoring their consistency. However, this kind of operation is not only related to the area of model-driven development, but it has been also studied in the context of programming language implementation or in databases (view update problem).

Triple Graph Grammars (TGGs) are a general and powerful tool to describe (bidirectional) model transformations and, in particular, to specify classes of consistent models. There are several approaches to describe model synchronization in terms of TGGs, but most of them have a computational cost that depends on the size of the given models. In general this may be very costly since these models may be quite large. In this presentation we present a new approach that is incremental, in the sense that its cost depends only on the size of the update, and we shoe that the new procedure is correct.

### Peng Wang, Veronique Eglin, Chistophe Garcia, Christine Largeron, Josep Lladós and Alicia Fornés. "Handwritten Word Spotting Based on Graph Representation"

**Abstract:** Effective information retrieval on handwritten document images has always been a challenging task. Most existing solutions are based on using the appearance-based model to represent the handwriting in the format of one-dimensional sequential scalar feature vector. In this paper, we propose a novel handwritten word spotting approach by using graph representation rather than the appearance-based one in view of the fact that the structure is more stable and relevant to the handwriting than the pixels and the two-dimensional nature of handwriting structure is difficult to be described in the one-dimensional space. In order to preserve the essential information of the handwriting, the presented model comprises both topological and morphological signatures. Three types of structural points (starting/ending, branch and high-curved) detected on the skeleton of the handwriting are used as the vertexes of the graph, and the strokes are taken as the edges. For each connected component, a skeleton-based graph with the Shape Context labelled vertexes is established. In the end, each word image is represented as a sequence of graphs. In order to be robust to the handwriting variations, an exhaustive merging process based on DTW alignment result is introduced in the similarity measure between word images. With respect to the computation complexity, an approximate graph edit distance approach using bipartite matching is employed for graph matching. The experiments on the George Washington dataset and the marriage records from the Barcelona Cathedral dataset demonstrate that the proposed approach outperforms the state-of-the-art structural methods.

### A. Godoy, M. Sales-Pardo and R. Guimerà. "Long-Term Evolution of Email Communication Networks"

**Abstract: **The study of data from technological platforms as email, mobile or public trasportation networks, can provide us with a better understanding of human communication patterns and hold the possibility to predict the future changes the network may undergo. In this work we focus on the long-term evolution of the anonymous email network provided by Rovira I Virgili University. Our findings show that, behind the aparent arbitrariness, the annual changes in communication growth between pairs of users resemble those featured in other human activities such as growth of companies or air transportation networks (M. Stanley et al., Nature 379, 804 (1996), A. Gautreau et al., PNAS 22, 8847 (2009)).

We show that the emails' growth between users of the network is exponentially distributed. We checked that the shape of the growth distribution between a couple depends on the amount of emails exchanged in the past. Even of greater importance is that distributions are not symmetric with respect to the center. This means that the processes governing the increase or decrease of the communication follow different mechanisms. Finally, we observe that the growths of different pairs are not independent.

The correlations patterns we found in the system will be used to build a model in order to predict evolution dynamics in arbitray emall networks. Development of such a work could be of great interest in the desing of secure and solid network of email servers.

### Ricard Gavaldà and Alberto Lumbreras. "Trust Metrics and User Recommendation in Social Networks"

**Abstract: **Recommender systems have been strongly researched within the last decade. With the arising and popularization of digital social networks a new field has been opened for social recommendations. Considering the network topology, user interactions, or estimating trust between users are some of the new strategies that recommender systems can take into account in order to adapt their techniques to these new scenarios. We introduce MarkovTrust, a way to infer trust from Twitter interactions and to compute trust between distant users. MarkovTrust is based on Markov chains, which makes it simple to be implemented and computationally efficient. We study the properties of this trust metric and study its application in a recommender system of tweets.

### Xavier Muñoz, Elisabet Burjons and Juraj Hromkovic. "Bounds for the required advise for 1-competitiveness in the k-server problem in a k-star"

**Abstract:** The k-server problem is the problem in which an online algorithm must control the movement of a set of k servers, placed in the vertices of a graph, and handle requests from vertices. As each request arrives, the algorithm must determine which server to move to the requested vertex. The goal of the algorithm minimize the competitive ratio, that is, to keep the total distance all servers move small, relative to the total distance the servers could have moved by an optimal adversary who knows in advance the entire sequence of requests. In this work we deal with the k-server problem on a k-star with advise. In particular we find upper and lower bounds for the number of advise bits required to achieve 1-competitiveness.

### Roar Audun Myrset and Sebastian Verheughe. "Using a graph database for resource authorization"

**Abstract:** Telenor Norway is a telecommunications company. Growth in our customer portfolios along with the customer continuously expectation of better response times, triggered in 2011 a project that aimed at delivering improved customer experience.

Telenor has the need of controlling access to customer information. Users of our IT systems will only gain access to information that the user has been granted. A user might have access to large corporate portfolios detailing several hundred thousand resources. A resource can be a customer, account, subscription or an agreement. Portfolio structures are modelled as trees and access is given with inheritance, making traversal of these trees necessary to deduce the granted resources.

The previous solution for calculation of resources was based on using a relational database on the master data tables. These tables were also used for other queries and not optimized solely for this use case. The SQL algorithm did not perform well at scale, and was very complicated to understand due to how the SQL recursive code was written.

Since the problem to be solved required traversal of a graph (tree) and NOSQL solutions such as graph databases was becoming better known, we did an evaluation of using a graph database to solve the problem. Early prototyping showed how the graph code language was simple to write and understand, and resulted in much faster calculations. We did not prototype an optimized SQL solution due to project limitations and our assumption that the graph database was the better fit.

The new solution, using a Neo4j graph database, was in some cases 600 times faster. It has been running in production since spring 2013, and the measured performance matched our expectations.

### Arnau Prat-Perez. "High Quality, Scalable and Parallel Community Detection for Large Real Graphs"

**Abstract:** Community detection has arisen as one of the most relevant topics in the field of graph mining, principally for its applications in domains such as Social or Biological networks analysis. Different community detection algorithms have been proposed during the last decade, approaching the problem from different perspectives. However, existing algorithms are, in general, based on complex and expensive computations, making them unsuitable for large graphs with millions of vertices and edges such as those usually found in the real world. In this paper, we propose a novel disjoint community detection algorithm called Scalable Community Detection (SCD). By combining different strategies, SCD partitions the graph by maximizing the {Weighted Community Clustering (WCC), a recently proposed community detection metric based on triangle analysis. Using real graphs with ground truth overlapped communities, we show that SCD outperforms the current state of the art proposals (even those aimed at finding overlapping communities) in terms of quality and performance. SCD provides the speed of the fastest algorithms and the quality in terms of NMI and F1Score of the most accurate state of the art proposals. We show that SCD is able to run up to two orders of magnitude faster than practical existing solutions by exploiting the parallelism of current multi-core processors, enabling us to process graphs of unprecedented size.

## 10:45 - 11:45 Poster session I (in conjunction with coffee break)

## 11:45 - 13:15 Presentation session II (Chair: David Dominguez-Sal)

### Peter Boncz. "LDBC: Benchmarking Graph Data Management Systems"

**Abstract: **The Linked Data Benchmark Council (LDBC) is an industry initiative, spawned out of a EU project that aims to create a industry-backed organization for graph data management benchmarks. In this talk I will focus on the work of its Social Network Benchmark Taskforce, which has created a state-of-the-art graph data generator that can create huge correlated social network datasets in parallel fashion. Further, I will discuss the three diverse benchmark workloads: (1) the Interactive workload that tests short-running navigational queries and updates, (2) the Business Intelligence workload, that tests performance on large-scale grouping and aggregation including graph navigation and (3) the Algorithms workload that will test a suite of graph analysis algorithms that cannot typically be formulated in a database query language. As such, the social network benchmarking effort of LDBC spans a range of systems, including graph database systems, relational systems, NoSQl systems, graph programming frameworks like Giraph as well as MapReduce.

### Francesc Serratosa. "Learning Weights for Graph Matching Edit Costs"

**Abstract: **Throughout the last 30 years, several methods have been presented to perform error tolerant graph matching. All of these methods assume there are some given weights that gauge the importance of each of the attributes on nodes or edges. These weights are supposed to be manually validated and few research have been done to automatically learn the best combination of weights such that the resulting graph matching problem best matches the expected solution that an expert (human or artificial) would provide. We present a Loss function and Regularisation term to automatically find these weights without the need of a validation process and we do not emphasise on the optimisation algorithm. Our practical evaluation reveals that our method properly learns these weights.

### Anjan Dutta. "Product Graph based Subgraph Matching"

**Abstract:** In this work we develop a subgraph matching algorithm based on product graph, by product graph we mean tensor product graph. Product graph is a graph generated from a binary operation called graph product of two operand graphs. The nodes of a product graph establish a correspondence between each pair of nodes of the operand graphs and its edges locate the common edges of the operand graphs. Having the common edges, one can think of joining them to form longer common walks. These walks can be modeled by random or backtrackless walks as in graph kernel. Now an existence of a pair of matching nodes should create numerous common walks of different lengths originating or terminating at that node in the product graph. This can be obtained in two steps, first, by exponentiating the adjacency matrix A (this is a very nice property of the adjacency matrix of a graph) iteratively and adding them up, which produces a matrix As. Total number of common walks of different lengths between two nodes i and j can be obtained as As(i,j). Second, column wise (or row wise assuming the graph is undirected) addition of As can give the total number of walks originating or terminating at respective nodes. Here a true pair of matched nodes (a node in the product graph) should obtain a higher value than outliers. The subgraph matching routine then groups/detects these nodes as a true occurrence of a model graph.

### Arnau Padrol, Guillem Perarnau, Julian Pfeifle and Victor Muntés-Mulero. "Overlapping communities"

**Abstract: **Efficient graph clustering (or partitioning) has become crucial for many different purposes, ranging from social network and web analysis to data graph mining or graph summarization. Although communities usually overlap, most of the literature related to community search on graphs/networks has focused on finding non-intersecting groupings of nodes. In addition, previous proposals typically rely on prohibitively expensive computations, which makes them unsuitable considering the size of modern data sets.

We present the Overlapping Community Algorithm (OCA), a novel approach which is effective at discovering the latent groups or communities defined by the link structure of a graph and can also detect overlapped communities. The new fitness function we propose to this end evaluates the quality of a community without penalizing nodes that are significantly linked to nodes in another community, as it happens in some previous proposals, and thus allows overlapping communities to come to the surface naturally. In addition, our algorithm does not need to preassume a specific size or number of communities, since this information is extracted from the graph structure itself. The expected running time of our algorithm (in a simplified scenario) on a graph on n nodes with maximum degree is O(n).

We validate this theoretical result by reporting on a comprehensive set of experiments on both synthetic and real-world datasets. These show that OCA outperforms previous approaches in terms of execution time (even by several orders of magnitude in the best cases), while achieves to successfully unveil the community structure. OCA is the most suitable algorithm for accurately retrieving the overlapped communities of large graphs. Finally, we show that our proposal is able to efficiently handle the graph extracted from Wikipedia containing almost half a billion nodes and edges.

### Emili Sapena, Lluis Padró and Jordi Turmo. "A Global Relaxation Labeling Approach to Coreference Resolution"

**Abstract**: This paper presents a constraint-based graph partitioning approach to corefer- ence resolution solved by relaxation label- ing. The approach combines the strengths of groupwise classifiers and chain forma- tion methods in one global method. Ex- periments show that our approach signifi- cantly outperforms systems based on sep- arate classification and chain formation steps, and that it achieves the best results in the state of the art for the same dataset and metrics.

### Glyn Morrill and Oriol Valentín. "Towards Logical Syntactic Structures as Graphs"

**Abstract: **In sequent calculus for the most basic system of categorial logical syntax and semantics, the Lambek calculus (Lambek 1958), a sequent comprises an antecedent configuration, which is a list of types, and a succedent type. A proof is a tree of sequents. But even in the space of Cut-free proofs there is a combinatorial explosion of equivalent proofs, differing only in “inessential rule reordering”. Such spurious ambiguity is the challenge to categorial parsing-as-deduction.

In the deeper format of proof nets (Girard 1987) a (Lambek) proof structure is a planar graph comprising a list of polar type trees with connections forming a 1-1 relation on the leaves. The idea is that local trees of proof structures represent the active types of sequent calculus with i- (unary sequent) or ii- (binary sequent) mother labels. There is no spurious ambiguity, but for a proof structure to really represent a proof it must in addition satisfy a global correctness condition corresponding to the partitioning of subproofs: in the Lambek case that every cycle crosses both edges of some i-local tree. Such proof nets are graphs effecting a perfect representation of the “essence même” (Girard) of proofs. But providing correctness criteria for the full range of connectives of categorial logic (64 at the last count) is a task beyond present technical reach. We can provide sequent calculus, but as we have commented this suffers from spurious ambiguity.

We can look at sequent calculus as a metatheory of proof nets, reading a sequent proof as instructions for building a proof net. When we do this we see that spurious ambiguity corresponds to non-systematicity of such building: the sequent proof search constructing a proof net is jumping from place to place unnecessarily in a “gymkhana” (Girard). Systematizing the synthesis of proof nets corresponds to what is called focalization (Andreoli 1992) in sequent calculus. The aim of this paper is to advocate focalization for categorial logic as a solution to spurious ambiguity in parsing-as-deduction, and as a state-of-the-art bridge to graphical categorial proof nets.

### Toni Valles. "Extracting information from a network by grouping nodes into different partitions"

**Abstract: **Complex networks enable an accurate representation of a large amount of in- formation, with which a wide variety of problems can be solved. For instance, the spreading of a disease can be studied with the worldwide air transportation network, and metabolic networks provide methods to uncover novel drug targets within the every-increasing amount of biological data available. By studying the structure of a network we can extract all the information that can be used in the process of decision making. A good approach is to identify the stochastic block models that capture the patterns of connections between nodes. However, a single stochastic block model might not suffice to capture all the different patterns of connection observable in the structure of a network, because connections between nodes might have arisen due to more than one independent mechanism. Our goal is to develop a method that identifies different stochastic block models, one for each mechanism responsible for the pattern of connection we observe. We apply the method to a karate class network, which nodes are the class members and they are connected depending on their friendship. Due to a conflict among the class members the club split into two different academies, our method allows the identification of the two sides

### David F. Nettleton. "User and adversary driven anonymization of online social networks represented as graphs"

**Abstract:** In this brief presentation we describe an approach for anonymizing online social networks represented as graphs. On the one hand, the end user of the data is able to specify the utility requirements and on the other hand we are able to define potential adversary queries on the data. These two aspects condition the way in which we anonymize the graph, and from which we derive measures for information loss, risk and privacy levels. We also consider sub-graph matching in this context.

### Norbert Martínez. "Towards query algebras for graph operations"

**Abstract: **In the last decade many different graph databases (GDBs) have been proposed to manage property graphs. DEX, a key-value GDB based on the combination of collections of oids represented as bitmaps, has proved to be one of the most efficient solutions when the graph size is larger than the memory available, as many public benchmarks have verified. But graph queries in DEX, as in other graph databases, are built using imperative languages that invoke the proprietary graph APIs of the GDB. Thus, the need of declarative graph query languages, still not standardized, will require specific query algebras with generic graph operations that will be combined in optimized query plans. In this presentation, we introduce a proposal for a graph query algebra for key-value based GDBs that allows the construction and execution of different types of graph queries such as interactive, traversals, pattern matching or graph analytics.

## 13:15 - 14:45 Poster session II (in conjunction with lunch)

## 14:45 - 16:10 Presentation session III (Chair: Peter Boncz)

### Toyotaro Suzumura. "ScaleGraph: A Billion-Scale Graph Analytics Library"

**Abstract: **Recently large-scale graphs with billions of vertices and edges have emerged in a variety of domains and disciplines especially in the forms of social networks, web link graphs, internet topology graphs, etc. Mining these graphs to discover hidden knowledge requires particular middleware and software libraries that can harness the full potential of large-scale computing infrastructures such as super computers. ScaleGraph is an open-source graph library based on the highly productive X10 programming language. The goal of ScaleGraph is to provide large-scale graph analysis algorithms and efficient distributed computing framework for graph analysts and for algorithm developers, respectively.

### Venelin Kotsev. "LDBC - Semantic Publishing Benchmark"

**Abstract: **The Semantic Publishing Benchmark (SPB) is an LDBC benchmark for testing the performance of RDF engines inspired by the Media/Publishing industry and particularly by BBC’s Dynamic Semantic Publishing approach. As of December 2013 SPB has reached the state of draft publication with its “base” version and an “advanced” version in progress. The application scenario behind SPB considers a media or publishing organization that deals with a large volume of streaming content: articles, news, events and other media assets. The content is enriched with metadata (creative works) that describes it and links it to other reference knowledge taxonomies and databases that include relevant concepts, entities and factual information. Metadata allows publishers to efficiently retrieve relevant content to their various business models thus providing a richer enduser experience that is more dynamic, adaptable and can deliver new content almost immediately. The benchmark offers a data generator that uses real reference data to produce datasets of various sizes and test the scalability aspect of the RDF systems. The benchmark workload consists of (a) editorial operations that add new metadata, alter or delete existing (b) aggregation operations that retrieve content according to various criteria. The aggregation queries define different “choke points”, that is technical challenges that a standard RDF store must overcome thus giving opportunities for improvement regarding query optimization. In addition, aggregation workload considers queries that include RDF and OWL reasoning, such as traversal of subclass hierarchies, use of property chain axioms, disjointness of classes among others. Results of the benchmark provide an overview of the performance of a RDF database in terms of update operations and aggregate operations per second. Semantic publishing is an attractive paradigm for several media sectors including: news, finance and scientific publications. For organizations that intend to adopt semantic publishing to drive their business, the LDBC publishing benchmark provides simple means to evaluate RDF databases and find a suitable one for integration into their publishing workflow.

### Joan Guisado. "Massive Query Expansion by Exploiting Graph Knowledge Bases"

**Abstract: **Keyword based search engines have problems with term am- biguity and vocabulary mismatch. In this paper, we propose a query expansion technique that enriches queries expressed as keywords and short natural language descriptions. We present a new massive query expansion strategy that en- riches queries using a knowledge base by identifying the query concepts, and adding relevant synonyms and semanti- cally related terms. We propose two approaches: (i) lexical expansion that locates the relevant concepts in the knowl- edge base; and, (ii) topological expansion that analyzes the network of relations among the concepts, and suggests se- mantically related terms by path and community analysis of the knowledge graph. We perform our expansions by using two versions of the Wikipedia as knowledge base, concluding that the combination of both lexical and topological expan- sion provides improvements of the system’s precision up to more than 27%.

### Christophe Scholliers, Coen De Roover, Elisa Gonzalez Boix. "A GraphDatabase Storage Model for Change-Oriented Software Development"

**Abstract:**** **Given the fast pace with which society evolves, the software supporting organizations, businesses and enterprises is driven by a constant need for change. This constant need for change drives the manner in which modern software systems are conceived, as can be witnessed in the movement towards iterative and agile development processes in industry. However, most development tools still assume that they act upon a single complete release of the system. The CHAQ (Change-centric Quality Assurance) project aims to strike a balance between agility and reliability through change-centric quality assurance tools. These tools are to share a first-class representation of changes to software artifacts that comprise a software system, as well as the complete history of all individual changes to these artifacts (e.g. source code, bugs, bug comments, project e-mails, etc.). In this talk, we present our graph representation for changes in large artifacts. Our representation consists of a set of specialized nodes and edges for representing the state, evolution and changes to source code, bugs and versioning information. Central to our approach is the concept of a snapshot node which represents the collective state of all of a system’s artifacts as seen by a developer at a particular point in time. The complete repre- sentation of an artifact can thus be viewed as a linked list of snapshots. Each snapshot has a set of edges to keep track of all changed, created and deleted entities. However, copying all entities into a new snapshot each time one of the entities has changed is prohibitively expensive for large software artifacts. In order to provide an efficient storage model, our graph representation keeps back pointers to previous snapshots so that only the changed entities need to be stored. On top of this graph representation we are building tools that allow us to extract first class changes and repeat these changes over similar artifacts. The novelty of our graph representation lies within the efficient storage and memory model to track the collective state of large artifacts. Our system is the first approach to represent changes as first-class entities for artifacts other than a system’s source code.

### Sergio Gomez "Community structure in complex networks at different resolution levels"

**Abstract: **In almost all real complex networks it is possible to find a community structure, in which groups of nodes are clearly more connected between them than with the rest of the nodes of the network. The analysis and identification of these communities has received a lot of attention in the last years, but it cannot be considered as a solved problem. One of the remaining goals is the determination of the resolution levels at which the communities really exist. At certain scales of resolution you may find a small number of large communities, while at other scales these communities could break into many small communities, probably forming a hierarchical structure. However, many of the community detection methodologies neglect this problem and just provide a partition of the network at a certain scale, which is implicitly defined in the algorithm, and which could result in unaware users analyzing the wrong communities.

** **

### J. Barceló, M.P. Linares and O.Serch. "On some Graph Related Problems in Transportation Analysis"

**Abstract: **Transportation analysis is usually based on mathematical models with underlying graph structures, two particular cases of special interest, prompted by the Smart Mobility applications, concern the optimal location of sensors in a urban network, when the objectives of the data collection are the generation of mobility information; and the resort to dynamic flow models to describe the propagation of traffic flows along time-dependent paths in the network. This presentation analyzes the formulation of detection layout models for the first problem, as a special case of node covering problem with topological and technological constraints especially suited to the location of ICT traffic sensors; and the use of time-dependent shortest paths, as the Dynamic Network Loading component of Dynamic User Equilibrium assignment. That is the case of shortest paths when the cost of using a link, expressed in terms of travel times, depend on the congestion in the link and the arrival time to the link origin. The numerical results for some real cases will be presented and discussed.

### Ali Naderi, Jordi Turmo and Horacio Rodríguez. "Graph-Based Entity Linking"

**Abstract: **Recently, the needs of world knowledge for Artificial Intelligence applications are highly increasing. Knowledge Bases (KB) are involved to keep and categorize entities and their relations as a part of world knowledge, but the high cost of manual elicitation to create KBs forces toward automatic acquisition from text. This requires two main abilities: 1) extracting relevant information of mentioned entities (attributes and relations between them), and 2) linking these entities with entries in the KB. Our system focuses on the latter. Entity Linking (EL) is the task of linking an entity mention occurring in a document to a unique entry within a reference KB.

### Javier de San Pedro, Jordi Cortadella and Toni Roca. "Hierarchical Floorplanning of Chip Multiprocessors using Subgraph Discovery"

**Abstract:** Chip multiprocessors (CMPs) integrate more than one computing core in a single chip. For the past few years, industry trend has been towards CMPs with hundreds or thousands of cores. However, the design complexity of CMPs grows dramatically as the number of cores increases. An effective way to manage design complexity is through the use of hierarchical, reusable designs. In tiled CMPs, reusability is achieved by splitting a design into homogeneous tiles that are designed only once and then replicated. Yet, the rise of heterogeneous CMPs, which include multiple types of accelerator cores, make it no longer possible to assume that CMPs will have only a single type of tile.

Floorplanning is the design stage that creates the physical layout of the functional components in a CMP. Regular floorplans, in which the same layouts are reused as many times as possible, allow for reusable designs that minimize design time. In this work, we introduce a new hierarchical floorplanning scheme where hierarchy is automatically discovered. Frequent subgraph discovery is used to identify candidate repeating patterns in the netlist. Then, a linear programming model finds the graph cover that covers the entire input netlist using the least number of patterns.

In order to construct a hierarchy, this process is iteratively applied. The graph is compressed using the best cover found and new patterns are discovered in the resulting graph.

Every pattern in the hierarchy is then treated as a separate floorplanning subproblem. Regularity is exploited by reusing the same floorplan for multiple instances of a pattern. To efficiently represent the set of floorplans obtained during a floorplanning subproblem, bounding curves are used. Patterns located higher in the hierarchy tree also use these bounding curves to select the best shapes for contained patterns. Our current results show the scalability of the approach for large CMPs and the competitive results in area and wire length.

### A. Valdes-Jimenez, J.A. Reyes, D. Dominguez-Sal, J.L. Larriba-Pey, M.A. Arenas-Salinas. "A graph oriented database implemented in DEX for protein structural information"

**Abstract: **Proteins are essential macromolecules of great interest for the study of diseases and the development of novel drug-targets. The analysis and understanding of the tri-dimensional (3D) structure of these proteins could help us to comprehend the molecular mechanisms involved in living organisms, specifically how they are functionally related to other small-molecules called a “ligand or a “drug”, when they bind to each other. Proteins can be simply described as a linear sequence or chain of amino acids of diverse length. There are 20 different types of amino acids, which differ in their atom composition. However, proteins are not linear. Each of them has a particular shape and structure given by how the amino acid chain folds and is arranged in the 3-dimensional space, which is essential to determine the specific function of the protein.

The appropriate study of the protein 3D information is of high complexity. We believe that Graph Database Technologies offer some advantages compared to relational databases for this specific problem, particularly related to the efficient integration of big data and to the development of queries for pattern recognition. This presentation describes the implementation of diverse types of protein information following a graph representation employing the DEX graph database management system. We utilized the PDB (Protein Data Bank) database [2], which encompasses information of the 3D coordinates of every atom in the protein for around 95,000 characterized proteins up to date. The information of each of these proteins is contained in a unique file containing multiple sources of additional biological data.

The graph representation of the protein structural information incorporates diverse types of nodes including, PROTEIN (name of the protein), AMINO (type of amino acid), ATOM (3D coordinates of atoms for each amino acid), LIGAND (small molecule that bind the protein) and HETATM (3D coordinates of each atom comprising the ligand). In addition, the graph database includes diverse types of edges representing the relationship among different nodes: the linear sequence information of the amino acid chain, the ligands that bind the protein, the atoms that take part of the amino acid chain, and the distances between atoms of the protein and the ligands. The volume of information ingested up to date is over 73GB, including around 500 million of nodes and a similar number of edges. Examples of how the information stored in this graph-oriented database can be employed to make queries and find structural patterns among different proteins are also given in this presentation.

## 16:10 - 17:00 Poster session III (in conjunction with coffee break)