Skip to content

Latest commit

 

History

History
224 lines (180 loc) · 12.7 KB

features.md

File metadata and controls

224 lines (180 loc) · 12.7 KB

List of all features, ideas, discussions towards finishing this project

This is a place to write down everything that we talked about or worked on. The overall goal of the project is to have a product which:

  • Given a SQL query, generates a set of tests
  • Each test tests one of the query's path by inserting data rows and running the query and testing the output
  • A path of the query tests a single condition
  • These tests can also be combined into one big set of data rows
  • These tests should fail when the developer changes the query and it's behaviour

EvoSQL - in progress

The overall workflow of EvoSQL is:

  • Pass a connection to the actual database
  • Pass the SQL that needs to be tested
  • Extract all paths from SQLFpc
  • For each path:
    • Extract the table schemas needed from the actual db
    • Recreate the schema on the instrumented db
    • Run the GA
    • Store the output fixture
  • Generate test(s)

Todo from this workflow:

  • Extract all paths from SQLFpc
  • Generate test(s)

Testing EvoSQL - implemented

To make sure we don't break any functionalities during development there is need for a testing framework. This testing framework's base lies in the TestBase class. In here the method testExecute tests a single path.

Collecting data from HSQLDB - in progress

We instrument HSQLDB to extract the data that is generated whilst executing a query. The most important data we use are generated by the Expression classes. When conditions are evaluated, Expression.getValue is called to the top expression. This method returns a true/false whether the conditions are satisfied. To work with these conditions and know their importance & order, we also give the Instrumenter a reference to the Expression tree. This is done in QuerySpecification.buildResult.

Query Levels - implemented

For the instrumenter to know which expressions are more important than the others, we follow the same execution pipeline that HSQLDB does. The first part of this pipeline we call the query level. The idea behind query levels is that the database will finish subqueries before starting outer queries. A simple example is SELECT * FROM (SELECT * FROM t1 WHERE t1.a = 10) WHERE t1.b = 20. Here the inner query is evaluated completely first before the outer query is even started.

This means that if there is no output data from the inner query, the database will never execute the expression t1.b = 20. However if there is output data, the expression is executed and the instrumenter needs to know that this is more important than the inner query's expression. Therefor we store the max query level that is reached. Then when we compare two sets of data then the one that reaches the outer level will be preferred over the other.

Max RangeVariable Index - implemented

Within a single query level, there is another part of the pipeline, namely RangeVariables. HSQLDB converts every access to a source of data within one subquery to a RangeVariable. This RangeVariable allows for the exploring through a table's rows. Each RangeVariable will go through the rows, evaluate the conditions, and accept a row when the conditions pass. An example is SELECT * FROM t1 INNER JOIN t2 ON t1.id = t2.t1_id WHERE t1.a = 10. In this example the first RangeVariable will go through t1 and evaluate t1.a = 10. If no data passes through it will never go through t2 and evaluate t1.id = t2.t1_id. If data does pass through we will see it go through the entirety of t2 for every row that satisfied t1.a = 10.

Thus our instrumenter needs to also know the max RangeVariable that was reached to know if this data is better than some other data.

Linking comparisons to expressions - in progress

To calculate distance we need comparisons to link to the expression tree that HSQLDB extracts from the query. To make sure that a single subquery has one expression tree, we move all conditions to the last RangeVariable in RangeVariableResolver.java, method processConditions. So far all conditions have always been collected in the joinConditions[0].nonIndexCondition, hence this will work. There is also a variable whereConditions[0].nonIndexCondition, which seems to be filled when there are LEFT or RIGHT JOINs, and conditions need to be checked only after it is known whether the JOIN was succesful. As we never aim to match a JOIN in the case of a LEFT or RIGHT JOIN, we don't need to look at this.

Everytime a new subquery is processed or a HAVING is processed, a new expression tree is sent to the Instrumenter. This way each Comparison can be linked to an Expression.

In the case of a RIGHT JOIN, there may also be important conditions in the whereCondition variable. These conditions seem to only appear on LEFT/RIGHT JOINS as the join conditions happen on a different level than the wheres. However as LEFT JOINs are not solved the whereCondition does not matter. In case of a RIGHT JOIN, the whereCondition is crucial because your output is dependent on the right-hand side data.

Queries - in progress

  • SELECT
  • WHERE
  • HAVING
  • SUBQUERY
  • INNER JOIN
  • LEFT JOIN
  • RIGHT JOIN
  • Context-dependent subqueries
  • IN
  • BETWEEN (converts to a >= AND <=)
  • Aggregates
  • SUM
  • COUNT
  • AVG
  • MIN
  • MAX

LEFT/RIGHT/OUTER JOINs - in progress

When talking about joins, there are a couple of types of joins. In case of INNER JOINs, the condition for this join MUST be met, hence there is no problem in combining it with the WHERE condition. However, in the case of LEFT, RIGHT, or OUTER JOINs this is not the case. A query may well be satisfied without satisfying the JOIN criteria. This leads to a potential problem for our GA, as it attempts to generate data that matches the query, but the fitness will increase the closer it gets. If it is not known whether the join should be matched, it cannot be known if the fitness increases or decreases.

Luckily, we use the paths provided by SQLFpc. If SQLFpc encounters a join, it generates paths. Some of these paths will match the JOIN criteria, and others will not. The way this is translated into a SQL query helps us. Whenever a path has to match the JOIN criteria, it is written out as an INNER JOIN. Whenever a path has to not match the JOIN criteria, it is written out as a LEFT/RIGHT JOIN.

This means that as long as we assume that our input queries are paths from SQLFpc, we can safely assume that whenever we see a LEFT or RIGHT join, we need our fitness function to explicitly not match the join criteria. This can simply be done by negating the join expression.

To implement this, we added a boolean variable to Expression.java called invertDistance. This boolean is always false unless set in RangeVariableResolver.processConditions. Here we will set invertDistance to true when the condition comes from a LEFT or RIGHT JOIN.

Todo:

  • Confirm that outer joins are unnecessary (SQLFpc)

Context-dependent subqueries - implemented

Subqueries that do not depend on the data of the outer queries are processed before starting the query, and this is handled by the query levels. However, when a subquery does depend on outer query data, such as in: SELECT * FROM t1 WHERE t1.a = (SELECT MAX(t2.c) FROM t2 WHERE t2.b = t1.b), it is processed during the condition evaluation of the outer query. This means that within a query level it may be possible for HSQLDB to evaluate these subqueries (if it reaches the corresponding RangeVariable). When it does evaluate the subqueries, they need to have a lower level of importance than the current level.

To implement this functionality, three things were added:

  • ComparisonDataStores have subStores, which is a list of as many ComparisonDataStore objects that are subqueries from the current level. Note that these can be full queries on their own, as the definition is recursive.
  • InstrumenterState is introduced. The Instrumenter needs to switch into the sub state with the proper information, and when done it needs to go back to the previous state. This state will for example contain the expression tree, which we are in the middle of evaluating when we enter a subquery.
  • The fixture fitness includes a sub querylevel data set of comparisons.

COUNTs - in progress

In the case of counts, it is necessary for the GA to duplicate rows to create groups consisting of more than 1 initial row. This solves most counts, except for 1 case:

The query SELECT COUNT(*) FROM Products WHERE Price > 50 AND Price < 50 where the WHERE can never be solved. Our algorithm thus try endlessly until finally giving up, even though in this case any input would always give an output, namely 0.

Extracting paths from SQLFpc - implemented

The paths that are to be run through the GA need to be extracted from the SQLFpc web service.

Solvability of paths - tbd

Not all paths that come from SQLFpc are solvable, for example the query SELECT COUNT(*) FROM Products HAVING COUNT(*) > 0 produces a rule: SELECT COUNT(*) FROM Products HAVING (COUNT(*) = -1).

Genetic Algorithm - in progress

Fixture definition - in progress

A fixture or individual in the genetic algorithm represents a database state. It contains a set of FixtureTables, which in turn contain FixtureRows.

Each FixtureRow is a set of columns with data assigned to them. These columns are defined by the parent FixtureTable's TableSchema.

Todo:

  • Support schemas
  • Potentially support multiple databases within a single fixture (for queries that use multiple dbs, which is possible in some dbs, like SQL Server)

Calculating fitness

In the GA we needn't necessarily calculate a number that represents it fitness (as we are always searching for a solution), rather we need to be able to:

  • Detect whether a fixture is the solution
  • Find the better fixture between two fixtures

Detect whether a fixture is the solution - implemented

Whenever a fixture's last query level has a distance of 0, a row has passed through to the result and a solution is found.

Find the better fixture between two fixtures - implemented

A fixture is better than another fixture if:

  • The max QueryLevel is higher, or if equal:
  • From the top level downwards, for each level:
  • The max range variable index is higher, or if equal:
  • The distance is lower, or if equal:
  • Compare the sub-query levels from left to right. As soon as 1 is better, that fixture is better.

Avoiding local optima - tbd

Seeding - in progress

To aid our search algorithm, we add the seeding of data that are likely to be needed.

Seeding literals - implemented

This involves the seeding of literals that we can extract from the query. The SeedExtractor class takes a query and returns a Seeds object with a list of values, separated by type.

Seeding values from other rows/columns/tables - tbd

In case of a JOIN, it may be helpful to seed values from the joined table. This will increase the chance of a JOIN being succesful. This seeding will have to happen within a fixture.

  • Todo: no real implementation has been proposed

Mutation

Change

Change the values of columns within a row.

Insert - implemented

Add a new row, or with some probability copy an existing row.

Delete - implemented

Delete a row.

Crossover - in progress

Our crossover function takes two fixtures and swaps a table between the two fixtures. Because this has no effect in the case of a single table query, we might later want to adapt the crossover to swap multiple lines between two fixtures.

  • Todo: Improve the implementation to not only swap tables, but potentially swap a set of rows.

Fixture minimization - implemented

When a fixture is found that satisfies the query, it will contain data that is unwanted and has no effect. To remove this data, we perform a linear process by going by the rows one by one, removing them and checking if the query is still satisfied (it has output). If it does still have output, the row was either redundant or useless and we can remove it. If there is no more output, the row is necessary and it is retained.

Test generation - in progress

For the test generation there is a basic version which @mauricioaniche created for the first version of EvoSQL. This does not work with the current version.

Evaluation

A separate project is created to evaluate EvoSQL. The idea is to take a real system, start its database with schema, and then run EvoSQL on a set of real queries.

Baseline

As a baseline we will use a random search algorithm. The idea is that random search might prove to be as good or even better than GA for simple cases - but when it comes to more complicated cases, GA will win. And what we would want to do, is to implement random search for data generation and compare EvoSQL with it.

We will execute the random search n times, and take the best execution as the baseline.