Research topics

The main research topics of DAMA-UPC are oriented to performance, exploration and quality in data management, focusing particularly on large data volumes. We investigate the creation of new data structures, algorithms, methods and applications in the area of Data Management that make it easier to manipulate large amounts of data.

Smart Mobility

Integrating data from various sources such as mobile apps, sensors, government data, private data and Open Data is playing a key role if we want to develop better understanding on how citizens behave and how cities evolve. Applying this knowledge is going to become a crucial for the next generation of Smart Cities. Graph databases allow the visualisation and analysis of all the data generated in a modern city, making it actionable through associated Mobile Apps.

Graph Databases

Manage huge data networks!

The size of the volume of data manipulated in any organization is always increasing. The analysis of these has an increasingly greater role In the decision-making of large enterprises or in the study of various fields, Academics and non-academics, which have an impact on the improvement of life Society in which we live.

The future of organizations of the information points clearly to a tendency to organize In a natural way data in the form of large graph and networks where the various entities Represent defined in a set of nodes and their relationships are expressed with a set of Edges that unite them.

Check out our spin-out company, Sparsity Technologies, which commercializes the graph database developed in DAMA-UPC: Sparksee (formerly known as DEX). http://www.sparsity-technologies.com/

Current Projects

Graph Benchmarking
Graph data transactions
Graph query algebra
Distributed Graph Databases

Social Networks and Graph Analysis

Graphs are everywhere!

Twitter, Facebook and the whole Internet is providing billions of related data items, which form huge networks of relationships. Visual inspection of such datasets to derive information from these datasets is not feasible, and it is necessary to design algorithms that perform graph mining. We are working on graph data analysis algorithms for such networks that are scalable and can process huge graphs.

Current Projects

SCD: Scalable Community Detection. (Link to GitHub code)
Query suggestion using knowledge bases

Past Topics

Relational Database Management Systems

The quick technological evolution is the Philosopher’s Stone for success in most of nowadays businesses. The computer era is not a novelty anymore. Every company, organization or industry has a digital library to store important data for their businesses.
The use of Relational Database Management Systems as powerful tools to store, modify and access data in a database is completely generalized world-wide. The complexity of RDBMSs range from the most simple applications, designed for home use or small companies with modest information storage requirements, like Microsoft Access, to very complex and sophisticated RDBMSs, such as DB2 UDB, Oracle or Microsoft SQL Server, used in critical situations where the huge amount of data to be manipulated requires advanced techniques to improve performance.
However, the rapid and continuous growth of the amount of data to be stored and manipulated in-creases beyond the possibilities of current hardware and software, jeopardizing the acceptable performance of RDBMSs.

Large Join Query Optimization
DBMS Buffer Pool Analysis and Database Workloads Characterization
Parallel Join Project and Stream Processing
Internal Sorting in DBMSs.
Hybrid In-Memory/Out-of-core Database Management Systems
Towards Autonomic Memory Management for Relational DBMS

Distributed Search Engines and Question Answering

Cooperative Caching and cache aware load balancing

Current Information Retrieval systems deal with giant data repositories: the major search engines crawl more than a trillion unique URLs now and the number will continue to grow. The location of useful information in these huge repositories requires very efficient architectures and algorithms to achieve a good performance.

The details of the architecture in major search engines have evolved with the new technology and algorithms available, however some fundamental characteristics are latent in their designs: distributed computing and data caching. One single computer is far from achieving the throughput required by major search engines, and the engineers deploy these systems on clusters of computers, often based on commodity hardware. Although this architecture accumulates the processing power of several computing nodes, it is not enough to rely on the accumulation of hardware because the amount of resources needed would become prohibitive. This project targets the improvement of cache-aware techniques for distributed systems in order to improve the system performance.

Cooperative caching for question answering
Search Environments for Media (Semedia, EU-FP6 project)

Performance Aspects of Data Privacy and Anonymization for Very Large Datasets

When size matters!

With the increase of available public data sources and the interest for analyzing them, privacy issues are becoming the eye of the storm in many applications. The vast amount of data collected on human beings and organizations as a result of cyberinfrastructure advances, or that collected by statistical agencies, for instance, has made traditional ways of protecting social science data obsolete. This has given rise to different techniques aimed at tackling this problem and at the analysis of limitations in such environments. The growing accessibility to high-capacity storage devices allows keeping more detailed information from many areas. While this enriches the information and conclusions extracted from this data, it poses a serious problem for most of the previous work presented up to now regarding privacy, focused on quality and paying little attention to performance aspects. In our group we explore data privacy and anonymization requirements related to the area of high performance and very large data volumes management (i.e. algorithms and structures for efficient data management, parallel or distributed systems, etc).

Genetic Algorithms for Multivariate Microaggregation
Improving Performance Aspects of Anonymization Methods