How To Index For Joins With MySQL
时间:2007-03-11 来源:wenzk
From time to time I see people asking "What do I index to join these tables efficiently?" Often, someone just gives them an answer without really explaining the basic theory behind how to index for joins. The purpose of this document is to describe the basic theory behind indexing for joins with MySQL, starting first with a very simple example that demonstrates the basic principle of MySQL joins, and then applying the principle to a more complex 4 table query.
With no indexes, lets join all 3 tables as such:
Now what does this query need? Start with tblA and tblB. Find in the query where these tables are joined: AND tblB.q_num = tblA.q_num
AND tblB.se_num = tblA.se_num
AND tblB.ans = tblA.ans
MySQL needs at least one of q_num, se_num, or ans. I chose to index se_num and q_num because among all the other queries I was looking at, these columns were most often needed. Compromise is part of optimization; most professionals don't have the time to find the absolute best indexes for every single query, instead you find the best for the most common usage patterns. As we'll see later, this still pays off tremendously as, in the example of this query, we turn its performance around as night and day. Indexing (se_num, q_num) on tblB, EXPLAIN now says:
A Simple 3 Table Query
I want to make the test data as painlessly simple as possible so we can focus on the theory instead of trying to remember which table is which. Therefore, consider 3 tables: tblA, tblB, tblC. Each table has 3 columns: col1, col2, col3 (no, it's not normalized). It doesn't matter for now what the column types are, what the tables represent, or what kind of data this poor schema stores.With no indexes, lets join all 3 tables as such:
SELECTAnd EXPLAIN for the query:
*
FROM
tblA,
tblB,
tblC
WHERE
tblA.col1 = tblB.col1
AND tblA.col2 = tblC.col1;
+-------+------+---------------+------+---------+------+------+-------------+And finally, from the MySQL manual (7.2.1): The tables are listed in the output in the order that MySQL would read them while processing the query. MySQL resolves all joins using a single-sweep multi-join method. This means that MySQL reads a row from the first table, then finds a matching row in the second table, then in the third table, and so on. When all tables are processed, MySQL outputs the selected columns and backtracks through the table list until a table is found for which there are more matching rows. The next row is read from this table and the process continues with the next table. As that section of the MySQL manual says, MySQL reads the first table (tblA), then the second (tblB), then the third (tblC), etc., as the tables are output in EXPLAIN. Values from the preceding tables are used to find rows in the current table. In our case, values from tblA are used to find rows in tblB, then values from tblB are used to find values in tblC. Once one full sweep is finished (found rows in tblA, tblB, tblC), MySQL does not go back to tblA, it goes back to tblB to see if there are any more rows that match the current value from tblA. If there are, it gets that row and then finds matching rows in tblC again. The important thing to remember here is the basic principle of MySQL joins: Values from the preceding table (as listed in the output of EXPLAIN) are used to find rows in the current table.
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+-------+------+---------------+------+---------+------+------+-------------+
| tblA | ALL | NULL | NULL | NULL | NULL | 1000 | |
| tblB | ALL | NULL | NULL | NULL | NULL | 1000 | Using where |
| tblC | ALL | NULL | NULL | NULL | NULL | 1000 | Using where |
+-------+------+---------------+------+---------+------+------+-------------+
From Principle To Index
Knowing MySQL wants to use values from tblA to find rows in tblB, how do we add an index to help MySQL? To help MySQL (or someone, or anything for that matter) we need to know what it needs. What MySQL needs in a join is a matter of how it is joined. That is, how you join tables is what MySQL needs (as terribly circular as that sounds). Consider the join of tblA and tblB: The two are joined by "tblA.col1 = tblB.col1" (theta-style join, though if we had done the query ANSI-style, the result would be the same). We already have the value for tblA.col1, so what MySQL needs is a value for tblB.col1 so it can finish the equation. Therefore, if MySQL needs tblB.col1, then we should index tblB.col1. Doing so, here is the new EXPLAIN for the same query:+-------+------+---------------+----------+---------+-----------+------+-------------+As we see, MySQL now uses the ndx_col1 key to join tblB to tblA. That is, when MySQL goes looking for rows in tblB, instead of table scanning like it did before, it uses the value of tbbA.col1 with the ndx_col1 key to directly fetch matching rows. This is why the ref column for tblB says "tblA.col1". tblC is still table scanned, but that can be fixed the same way we fixed the join of tblA and tblB, by looking at what MySQL needs: From the part of the query that joins these tables, "tblA.col2 = tblC.col1," we see it needs tblC.col1 because we already have tblA.col2. Indexing this column, no surprise EXPLAIN now says:
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+-------+------+---------------+----------+---------+-----------+------+-------------+
| tblA | ALL | NULL | NULL | NULL | NULL | 1000 | |
| tblB | ref | ndx_col1 | ndx_col1 | 5 | tblA.col1 | 1 | Using where |
| tblC | ALL | NULL | NULL | NULL | NULL | 1000 | Using where |
+-------+------+---------------+----------+---------+-----------+------+-------------+
+-------+------+---------------+----------+---------+-----------+------+-------------+
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+-------+------+---------------+----------+---------+-----------+------+-------------+
| tblA | ALL | NULL | NULL | NULL | NULL | 1000 | |
| tblB | ref | ndx_col1 | ndx_col1 | 5 | tblA.col1 | 1 | Using where |
| tblC | ref | ndx_col1 | ndx_col1 | 5 | tblA.col2 | 1 | Using where |
+-------+------+---------------+----------+---------+-----------+------+-------------+
The Difficult Part?
You'll probably never see a query like the one we've been using in the real world. You're more likely to see something like:SELECTAt first that seems like a daunting query: 4 tables, an aggregate function, 9 WHERE conditions, and a GROUP BY. The great thing about EXPLAIN is that we can ignore all this for now, and simply approach it two tables at a time as we did before, determining at each step what MySQL needs. This is a real query actually, all tables and columns renamed to protect the identity of its origin. Before I started working it, EXPLAIN said:
COUNT(tblB.a_id) as correct,
tblA.type,
tblA.se_type
FROM
tblA,
tblB,
tblC,
tblD
WHERE
tblA.ex_id = tblC.ex_id
AND tblC.st_ex_id = tblB.st_ex_id
AND tblB.q_num = tblA.q_num
AND tblB.se_num = tblA.se_num
AND tblD.ex_id = tblA.ex_id
AND tblD.exp <> tblB.se_num
AND tblB.ans = tblA.ans
AND tblA.ex_id = 1001
AND tblC.r_id = 542
GROUP BY
tblA.type,
tblA.se_type;
+-------+--------+---------------+---------+---------+---------------+-------+----------------------------------------------+First a word on determining the impact of a join: Result sets. A result set is, obviously, the set of results from a query. For joins, an estimated way to figure this is to multiple the number of rows MySQL predicts it will need to read for each table. As an estimate, this is more towards the worst case end of the scale, since other WHERE conditions will almost always dramatically reduce the real number of rows the query produces. But for this query, the result set is 94 million rows. This is why indexless joins are so dangerous; a few thousand rows times a few thousand more and you're up in millions already.
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+-------+--------+---------------+---------+---------+---------------+-------+----------------------------------------------+
| tblA | ALL | NULL | NULL | NULL | NULL | 1080 | Using where; Using temporary; Using filesort |
| tblB | ALL | NULL | NULL | NULL | NULL | 87189 | Using where |
| tblC | eq_ref | PRIMARY | PRIMARY | 4 | tblB.st_ex_id | 1 | Using where |
| tblD | eq_ref | PRIMARY | PRIMARY | 4 | tblA.ex_id | 1 | Using where |
+-------+--------+---------------+---------+---------+---------------+-------+----------------------------------------------+
Now what does this query need? Start with tblA and tblB. Find in the query where these tables are joined: AND tblB.q_num = tblA.q_num
AND tblB.se_num = tblA.se_num
AND tblB.ans = tblA.ans
MySQL needs at least one of q_num, se_num, or ans. I chose to index se_num and q_num because among all the other queries I was looking at, these columns were most often needed. Compromise is part of optimization; most professionals don't have the time to find the absolute best indexes for every single query, instead you find the best for the most common usage patterns. As we'll see later, this still pays off tremendously as, in the example of this query, we turn its performance around as night and day. Indexing (se_num, q_num) on tblB, EXPLAIN now says:
+-------+--------+---------------+-------------+---------+------------------------+------+----------------------------------------------+The result set of the query has fallen 99.3% to 692,280 rows. But why stop there? We can easily get rid of tblA's table scan. Since it's the first table, we're not really indexing for a join because we just did that for the join of tblB to tblA. More often, indexing for the first table in a join is just like indexing for that table if it were all alone. In which case, you look at the query to see if the table is being restricted in any way. In this case we get lucky and tblA is: "AND tblA.ex_id = 1001". As we learned way back in Optimizing Queries Case 1: Basic Indexes, all we have to do is index column ex_id:
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+-------+--------+---------------+-------------+---------+------------------------+------+----------------------------------------------+
| tblA | ALL | NULL | NULL | NULL | NULL | 1080 | Using where; Using temporary; Using filesort |
| tblB | ref | ndx_secn_qn | ndx_secn_qn | 2 | tblA.se_num,tblA.q_num | 641 | Using where |
| tblC | eq_ref | PRIMARY | PRIMARY | 4 | tblB.st_ex_id | 1 | Using where |
| tblD | eq_ref | PRIMARY | PRIMARY | 4 | tblA.ex_id | 1 | Using where |
+-------+--------+---------------+-------------+---------+------------------------+------+----------------------------------------------+
+-------+--------+---------------+-------------+---------+------------------------+------+----------------------------------------------+Now the query's result set is 641 rows! Down from 94 million, you can practically call that a 100% decrease. Studying the query further, we may be able to get rid of the temp table and filesort, but overall the query is profoundly faster, and has served the purpose of demonstrating how to index for joins well. Despite initially looking challenging, we saw that if you take it two tables at a time, and isolate and index what MySQL needs, indexing for joins isn't so difficult.
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+-------+--------+---------------+-------------+---------+------------------------+------+----------------------------------------------+
| tblA | ref | ndx_ex_id | ndx_ex_id | 4 | const | 1 | Using where; Using temporary; Using filesort |
| tblB | ref | ndx_secn_qn | ndx_secn_qn | 2 | tblA.se_num,tblA.q_num | 641 | Using where |
| tblC | eq_ref | PRIMARY | PRIMARY | 4 | tblB.st_ex_id | 1 | Using where |
| tblD | eq_ref | PRIMARY | PRIMARY | 4 | tblA.ex_id | 1 | Using where |
+-------+--------+---------------+-------------+---------+------------------------+------+----------------------------------------------+
Conclusion
Making simple work of complex joins and knowing where to index for joins is a matter of realizing two things:- No matter how complex the query, you only have to approach the join two tables at a time in the order the tables are listed by EXPLAIN.
- Values from the preceding table are already found; the work lies in helping MySQL use these values with indexes on the current table to find rows.
相关阅读 更多 +