# Large Join Query Optimization

#### Description

As the stored data grow, the relational schemas needed to organize all these data get more complex, increasing the number of relations in the database. The direct consequence of this growth is the need for creating new SQL queries that involve an increasing number of relations in the database. A clear example of this problem are some advanced applications like SAP R/3, where the schema may contain more than 10,000 relations and may join more than 20 of these in a single SQL query; or some important banking companies like the Dexia Bank or the Fortis Bank which work with very large schemas, where the number of relations in the schema ranges from 5000 to 80001. In these scenarios, users often need to combine a large set of tables to reconstruct complex business objects.

Once a SQL query is introduced into the DBMS, several steps have to be performed in order to return the corresponding results. One of the first steps involves executing the query optimizer. It is the task of the query optimizer to transform a declarative SQL statement into a query execution plan (QEP) by constructing alternative QEPs, determining the cost of each QEP and selecting the cheapest one for execution.

State-of-the-art query optimizers, which typically employ dynamic programming techniques, are faced with problems handling such a large number of joins due to the exponential explosion of the search space. In these situations, optimizers either resort to heuristics or fall back to greedy algorithms. Greedy algorithms iteratively construct a QEP by initially selecting the cheapest two-way join, and then determining the cheapest join with a third table and so forth until all the query tables have been joined. However, they do not consider the entire search space and thus may overlook the optimal plan. This may result in bad query performance and cause queries to run for hours instead of seconds. In some cases, since the change from dynamic programming to greedy takes place when dynamic programming runs out of memory, plans may not be reused when changing system configurations internally.

The problem of accessing a large number of relations from a query has been referred as the Very Large Join Query Problem (VLJQP). Several randomized approaches have been presented in order to solve this problem. Randomized search techniques like the Iterative Improvement or Simulated Annealing methods remedy the exponential explosion of dynamic programming techniques by iteratively exploring the search space and converging to a nearly optimal solution. The big advantage of this type of random-walk algorithms is that they have a constant space overhead.

However, on the other hand, they sometimes need a long time to find a near optimal plan, or they can also get trapped in local minima. Alternatively to those algorithms, the use of evolutionary approaches for optimizing queries involving a large number of joins was tested showing that it can be a very competitive approach.However, the application of evolutionary algorithms in these scenarios has not been studied in detail. There exist two main aspects which are still unclear and need further investigation, namely, the quality of the results and the speed to converge to an optimum solution. Although most of the research done in this area occurred 10 to 20 years ago, with the continuous increase in complexity and data volume, in the petabytes nowadays, the VLJQP is still an open problem.

#### The Carquinyoli Genetic Optimizer (CGO)

We have implemented, validated, studied and improved a query optimizer based on evolutionary strategies Carquinyoli Genetic Optimizer (CGO). CGO aims at revealing some answers to the open questions that still arise around the Large Join Query Optimization Problem. With this project, we provide a formal analysis of a genetic programming based query optimizer by means of a set of statistical models that will help us to understand the behavior and performance of our proposals. We present a new class of query optimizer and we show that, together with other techniques, CGO optimizer outperforms the most efficient approaches presented in the literature to date.

This work is the conclusion of the PhD thesis of Victor Muntés-Mulero, which is available here. A public library for using and extending CGO is available on the net for research purposes here.

Current research lines:

- New algorithms for solving the Large Join Query Problem
- Advanced Parallel Genetic Optimizers