Tuesday 10 April 2012

Semi-join in MySQL 5.6


MySQL 5.6.5 Development Milestone Release has a whole new set of algorithms for processing subqueries. It is based on transforming a subquery into a semi-join operation, and then treating semi-join like another join operation throughout the optimizer.

A subquery can be transformed to a semi-join if it matches these criteria:
  • The subquery is part of an IN or =ANY predicate. It cannot be e.g. NOT IN.
  • The subquery consists of a single query block (it must not contain UNION).
  • The subquery does not contain GROUP BY or HAVING.
  • The subquery is not implicitly grouped (it contains no aggregate functions).
  • The subquery predicate is part of a WHERE clause.
  • The subquery predicate must not be part of a disjunctive nor a negated search condition.
  • Neither query block contains the STRAIGHT_JOIN qualifier.
  • The statement must be a SELECT or INSERT statement. Semi-joins are not allowed with UPDATE or DELETE statements.
Example subquery that can be transformed to a semi-join:

  SELECT * FROM nation
  WHERE n_regionkey IN (SELECT r_regionkey FROM region
                        WHERE r_name='AFRICA');

What is a semi-join operation


A semi-join operation is similar to a regular join operation: It takes two (sets of) tables and combines them using a join condition.

Like an outer join, a semi-join is a noncommutative join operator. We denote the tables of the containing query block the outer tables and the tables of the subquery the inner tables.

Two factors distinguish a semi-join from a regular join:
  • In a semi-join, the inner tables do not cause duplicates in the result.
  • No columns from the inner tables are added to the result of the operation.
This means that the result of the semi-join is a subset of the rows from the outer tables. It also means that much of the special handling of semi-join is about efficiently eliminating duplicates from the inner tables.

We can represent the above query using the artificial SEMI JOIN operator as follows:

  SELECT nation.*
  FROM nation SEMI JOIN region
       ON nation.n_regionkey = region.r_regionkey AND
          region.r_name='AFRICA';

Semi-join optimization


If possible, a subquery is transformed into a semi-join during the query resolution stage. The query blocks containing the outer tables and the inner tables are then combined into a larger query block.

The next step of the optimization is to determine inner tables that are candidate for table pullout. If the inner table of a semi-join has an eq-ref relationship to the outer part of the query, there can be no duplicates from the inner side, and the semi-join can be converted to a regular join. We call that a table pullout, because the table is "pulled out" from the semi-join and joined with the outer tables.

The above subquery can be transformed using table pullout, because there is a unique index on the r_regionkey column. The explain result of the query is as follows:


| id | select_type | table  | type | possible_keys | key           | key_len | ref                     | rows | Extra       |
+----+-------------+--------+------+---------------+---------------+---------+-------------------------+------+-------------+
|  1 | PRIMARY     | region | ALL  | PRIMARY       | NULL          | NULL    | NULL                    |    5 | Using where |
|  1 | PRIMARY     | nation | ref  | i_n_regionkey | i_n_regionkey | 5       | dbt3.region.r_regionkey |    2 |             |


As you can see, there is no difference between this explain result and the explain result of a regular join between these two tables.

The cost-based optimization procedure is then applied to the tables of the query block. Tables taking part in semi-joins are combined in various ways, with the appropriate cost added to the total plan cost. The end result is an optimal table order and an optimal set of semi-join execution strategies.

Semi-join execution strategies


As said before, semi-join is a regular join operation, combined with removal of possible duplicates from the semi-join inner tables. MySQL implements four different semi-join execution strategies, which have different ways of removing duplicates:
  1. FirstMatch
  2. DuplicateWeedout
  3. Materialization
  4. LooseScan
I will handle the FirstMatch strategy in more detail below. The other semi-join strategies will be handled in later blogs. How semi-join strategies can be combined with join buffering will also be handled.

Controlling semi-join strategies


Use of semi-join is controlled with the optimizer_switch flag semijoin:

  set optimizer_switch='semijoin=on'

This command enables transformation of subqueries into semi-join operations. This flag is on by default in MySQL 5.6.5.

In addition, individual semi-join strategies can be turned on and off with the following optimizer_switch flags:

  firstmatch, materialization, loosescan

These flags enable or disable the use of the respective semi-join strategies. Notice that there is no way to disable the DuplicateWeedout strategy, as this is the "last resort" strategy selected by the cost-based optimizer. There is also no way to disable the TablePullout strategy, when semi-join is enabled.

The default for all semi-join related optimizer_switch flags are:

  'semijoin=on,firstmatch=on,materialization=on,loosescan=on'

Notice also that the optimizer_switch flag materialization has a second meaning: It controls the use of the subquery materialization feature (which should not be confused with the semi-join materialization strategy).

Semi-join FirstMatch strategy


The semi-join FirstMatch strategy executes a subquery very similar to how the IN-TO-EXISTS strategy familiar from earlier versions of MySQL works: for each matching row in the outer table, check for a match in the inner table. When a match is found, return the row from the outer table, otherwise continue scanning the inner table until reaching the end.

Here is a query that is processed using FirstMatch strategy:

  SELECT * FROM nation
  WHERE n_nationkey IN (SELECT c_nationkey FROM customer);

Notice the FirstMatch(nation) indication in explain output, which means that after a match has been found in table customer, the query executor goes back to scan more rows in table nation:


| id | select_type | table    | type | possible_keys | key           | key_len | ref                     | rows | Extra                           |
+----+-------------+----------+------+---------------+---------------+---------+-------------------------+------+---------------------------------+
|  1 | PRIMARY     | nation   | ALL  | PRIMARY       | NULL          | NULL    | NULL                    |   25 |                                 |
|  1 | PRIMARY     | customer | ref  | i_c_nationkey | i_c_nationkey | 5       | dbt3.nation.n_nationkey | 1115 | Using index; FirstMatch(nation) |

Here is the result from IN-TO-EXISTS transformation, which can still be enabled by setting optimizer_switch flags:

set optimizer_switch='semijoin=off,materialization=off' :


| id | select_type        | table    | type           | possible_keys | key           | key_len | ref  | rows | Extra       |
+----+--------------------+----------+----------------+---------------+---------------+---------+------+------+-------------+
|  1 | PRIMARY            | nation   | ALL            | NULL          | NULL          | NULL    | NULL |   25 | Using where |
|  2 | DEPENDENT SUBQUERY | customer | index_subquery | i_c_nationkey | i_c_nationkey | 5       | func | 1115 | Using index |

As this is basically the same strategy that MySQL 5.5 would choose, there is usually no speedup to gain when FirstMatch is chosen. The advantage may however come when FirstMatch is not the most efficient strategy, or when the cost-based optimizer chooses a more efficient table order. The optimizer can determine a better table order, because it will estimate a realistic cost for a semi-join operation. This contrasts MySQL 5.5 where the placement of a subquery in the query plan was strictly rule-based.

Multiple subqueries


It is possible to combine multiple subqueries with AND and still have them transformed to a semi-join:

  SELECT * FROM nation
  WHERE n_regionkey IN (SELECT r_regionkey FROM region
                        WHERE r_name='AFRICA') AND
        n_nationkey IN (SELECT c_nationkey FROM customer);

The explain output shows one FirstMatch strategy:


| id | select_type | table    | type | possible_keys         | key           | key_len | ref                     | rows | Extra                           |
+----+-------------+----------+------+-----------------------+---------------+---------+-------------------------+------+---------------------------------+
|  1 | PRIMARY     | region   | ALL  | PRIMARY               | NULL          | NULL    | NULL                    |    5 | Using where                     |
|  1 | PRIMARY     | nation   | ref  | PRIMARY,i_n_regionkey | i_n_regionkey | 5       | dbt3.region.r_regionkey |    2 |                                 |
|  1 | PRIMARY     | customer | ref  | i_c_nationkey         | i_c_nationkey | 5       | dbt3.nation.n_nationkey | 1115 | Using index; FirstMatch(nation) |

Nested subqueries


MySQL can process nested subqueries and transform them into one semi-join operation. Here is a slightly artificial example:

  SELECT * FROM nation
  WHERE n_nationkey IN
     (SELECT c_nationkey FROM customer
      WHERE c_acctbal IN
         (SELECT o_totalprice FROM orders
          WHERE o_orderstatus = 'P'));

From the explain output, we see a FirstMatch strategy applied to tables customer and order, jumping back to table nation when a match is found:


| id | select_type | table    | type | possible_keys | key           | key_len | ref                     | rows    | Extra                           |
+----+-------------+----------+------+---------------+---------------+---------+-------------------------+---------+---------------------------------+
|  1 | PRIMARY     | nation   | ALL  | PRIMARY       | NULL          | NULL    | NULL                    |      25 |                                 |
|  1 | PRIMARY     | customer | ref  | i_c_nationkey | i_c_nationkey | 5       | dbt3.nation.n_nationkey |    1115 |                                 |
|  1 | PRIMARY     | orders   | ALL  | NULL          | NULL          | NULL    | NULL                    | 1486350 | Using where; FirstMatch(nation) |

Want to know more?


Check out the MySQL documentation for semi-join optimization here

1 comment:

  1. "The other semi-join strategies will be handled in later blogs. How semi-join strategies can be combined with join buffering will also be handled."

    I know it hass been more than 10 years. Will there be more blogs?

    ReplyDelete