The importance of Data Management nowadays makes it necessary to find new paradigms to extract and obtain information out of the huge data repositories maintained by companies. There are three aspects that are crucial in this regard. First, data volumes are usually very large, with companies storing terabytes of data that need to be processed in shorter amounts of time. Second, quality of the data is essential, since it allows to extract sound and significant conclusions out of the information obtained from those repositories. Third, the capability to extract information out of the huge amounts of processed data, which allows the different actors to obtain the information required for each business process. There are different solutions and research efforts regarding these issues. Business intelligence, Master Data Management or Custormer Data Integration are three examples of such areas where companies and research institutions place significant efforts.
Our Main Research Topics
For the different reasons explained above, the main research topics of DAMA-UPC are oriented to performance, exploration and quality in data management, focusing particularly on large data volumes. We investigate the creation of new data structures, algorithms, methods and applications in the area of Data Management that make it easier to manipulate large amounts of data.
We consider two main research areas for our expertise development. First, we are working on maximizing performance in Graph Database Management Systems. In this area, we propose a new way to devise data storage and management. Information tends to be organized in large networks where not only data about certain entities are important, but also the relationship between those entities. Examples of this may include social networks, biochemical investigation on complex organisms, communication networks, etc. We propose a new and sound system that allows for the efficient manipulation of data in large networks. This type of system poses new research challenges to be explored. Second, we study performance in Relational Database Management Systems. Most of the data in the real world is still organized following the traditional relational model. In this situation, it is mandatory to be able to return reliable and fast answers to user queries on complex databases.
Also, we are doing research in other topics such as Data Cleansing and Integration, Data Privacy and Cooperative Caching. These are important aspects related to the quality of results and the management of huge repositories in a fast and efficient way. In the following, we include a brief description of our main research projects.
Data tends to be organized in huge data networks!
The size of the volume of data manipulated in any organization is Today constantly drier. The analysis of these has an increasingly greater role In the decision-making of large enterprises or in the study of various fields, Academics and non-academics, which have an impact on the improvement of life Society in which we live.
The future of organizations of the information points clearly to a tendency to organize In a natural way data in the form of large graph and networks where the various entities Represent defined in a set of nodes and their relationships are expressed with a set of Edges that unite them.
Graph Database Projects
Relational Database Management Systems
Managing large amounts of relational data!
The quick technological evolution is the Philosopher’s Stone for success in most of nowadays businesses. The computer era is not a novelty anymore. Every company, organization or industry has a digital library to store important data for their businesses.
The use of Relational Database Management Systems as powerful tools to store, modify and access data in a database is completely generalized world-wide. The complexity of RDBMSs range from the most simple applications, designed for home use or small companies with modest information storage requirements, like Microsoft Access, to very complex and sophisticated RDBMSs, such as DB2 UDB, Oracle or Microsoft SQL Server, used in critical situations where the huge amount of data to be manipulated requires advanced techniques to improve performance.
However, the rapid and continuous growth of the amount of data to be stored and manipulated in-creases beyond the possibilities of current hardware and software, jeopardizing the acceptable performance of RDBMSs.
Active Relational DBMS Projects
Past Relational DBMS Projects
Distributed cache techniques for search engines
Current Information Retrieval systems deal with giant data repositories: the major search engines crawl more than a trillion unique URLs now and the number will continue to grow. The location of useful information in these huge repositories requires very efficient architectures and algorithms to achieve a good performance.
The details of the architecture in major search engines have evolved with the new technology and algorithms available, however some fundamental characteristics are latent in their designs: distributed computing and data caching. One single computer is far from achieving the throughput required by major search engines, and the engineers deploy these systems on clusters of computers, often based on commodity hardware. Although this architecture accumulates the processing power of several computing nodes, it is not enough to rely on the accumulation of hardware because the amount of resources needed would become prohibitive. This project targets the improvement of cache-aware techniques for distributed systems in order to improve the system performance.
Performance Aspects of Data Privacy and Anonymization for Very Large Datasets
When size matters!
With the increase of available public data sources and the interest for analyzing them, privacy issues are becoming the eye of the storm in many applications. The vast amount of data collected on human beings and organizations as a result of cyberinfrastructure advances, or that collected by statistical agencies, for instance, has made traditional ways of protecting social science data obsolete. This has given rise to different techniques aimed at tackling this problem and at the analysis of limitations in such environments. The growing accessibility to high-capacity storage devices allows keeping more detailed information from many areas. While this enriches the information and conclusions extracted from this data, it poses a serious problem for most of the previous work presented up to now regarding privacy, focused on quality and paying little attention to performance aspects. In our group we explore data privacy and anonymization requirements related to the area of high performance and very large data volumes management (i.e. algorithms and structures for efficient data management, parallel or distributed systems, etc).
Active Privacy and Anonymization for Very Large Datasets Projects