Placeholder Image

Subtitles section Play video

  • In this tutorial, we're going to see why MySQL creates a query plan before each query request.

  • Then, we're going to use a series of examples to show how the EXPLAIN output can be used to identify where MySQL spends time on your query and why, and we'll learn which are the important pieces of information to look for in order to spot performance issues.

  • Once we have a candidate query for optimization, we need to analyze why is it slow or why it impacts the system so much.

  • The main tool to do that is the EXPLAIN statement, which provides information about the query plan chosen by the optimizer.

  • The optimizer has to make a few decisions before executing each query.

  • For example, what is cheaper from a response time perspective?

  • Fetch the data directly from a table, or go to an index and stop here because all the columns required by the client are in the index, or get the location of the records from the index and go to the table to get the actual data.

  • The first method, fetching data directly from the table, is called fullscan.

  • This is normally the most expensive because all rows must be fetched from the table and checked against a condition, yet this method works best for small tables.

  • In the second set of options, we access an index.

  • If the index has all the necessary data and there is no need to access the table, we have what is called a covering index.

  • However, this is less often the case, so the index is used to filter out rows and then access those rows from the table.

  • Usually, this is the cheapest way to access a table.

  • Still, if the client selects a large number of rows, this may not be valid anymore.

  • Therefore, the optimizer has to make a lot of decisions based on particular database statistics before the query is executed.

  • As a result, our goal will be to observe what the optimizer thinks is the most expensive subtask so that we could eliminate or enhance that part.

  • If you have a slow query, the first thing to try is running it with EXPLAIN.

  • This will show the query plan, in other words, the list of things expected to happen when the query is executed.

  • If you instead use EXPLAIN ANALYZE before the statement, you'll get both the estimation of what the planner expected, along with what actually happened when the query ran.

  • Consider the following statement.

  • EXPLAIN DELETE FROM city.

  • This query is not executed, so it won't delete all the rows.

  • Therefore, it's safe to obtain the query plan.

  • To actually execute the query, we can use EXPLAIN ANALYZE, as we'll see later in more detail.

  • This is not only going to show the query plan for deleting those rows, it is actually going to delete them.

  • Usually, it's more difficult to compare the timings of operations when doing inserts, updates, or deletes using EXPLAIN ANALYZE.

  • This is because the underlying data will change while executing the same queries.

  • Optionally, we can add the FORMAT option to specify whether we want the results returned in traditional table format, JSON, or tree-style format.

  • Keep in mind that each format will show more or less information about the query plan.

  • For instance, the JSON format is the most verbose of all.

  • Let's see some examples to dig deeper into the query plan.

  • We will use the sample database world link in the description, and for the first example we will execute a simple SELECT on the city table with a condition on a non-indexed column called name.

  • Since there is no index that can be used, it will require a full table scan to evaluate the query.

  • The table access types show whether a query accesses the table using an index, scan, and the similar.

  • Since the costs associated with each type fluctuate greatly, it is also one of the most important values to look for in the EXPLAIN output to determine which part of the query to work on to improve the performance.

  • The output has the access type set to ALL, which is the most basic access type, because it scans all rows for the table.

  • It is also the most expensive one, and for this reason, this type is written in ALL uppercase.

  • It's estimated that 4046 rows will be examined, and for each row a WHERE clause will be applied.

  • It's expected that 10% of the rows examined will match the WHERE clause.

  • Here, the optimizer uses default values to estimate the filtering effect of various conditions, so we can't really use this filtering value to estimate whether an index is useful or not.

  • This is the traditional format, however, it doesn't show the relationship between the executed subtasks, so it's more difficult to have an overview of the query plan.

  • Which format is preferred depends on your needs.

  • For example, the traditional format it's easier to use to see the indexes used and other basic information about the query plan, while the JSON format provides much more details.

  • The TREE style is the newest format, and is the format we're going to use for the next examples.

  • The TREE format focuses on describing how the query is executed, in terms of the relationship between the parts of the query, and also the order in which these parts are executed.

  • In this case, EXPLAIN OUTPUT is organized into a series of nodes.

  • At the lowest level, there are nodes that scan tables or search indexes.

  • Higher level nodes take the results from the lower level nodes and operate on them.

  • Here, it can be easier to understand the execution by reading the output from the INSIDE OUT.

  • The TREE format is also the default format for EXPLAIN ANALYZE statement, which is as new as MySQL 8.0.18.

  • This TREE format output gives us a good overview of how the query is executed.

  • First, there is a table scan on the city table, and then a filter is applied for the name column.

  • Here, we also have an estimation cost for an expected 4000 rows.

  • Please note that this estimation cost is represented in some internal MySQL units.

  • For example, reading a row from the disk can have an associated cost of 2, while reading the same row from the memory will have a cost of 1.

  • Moving to the actual statistics, we see that the first row was read in about 4 milliseconds, and all other rows were read in roughly 5 milliseconds.

  • There was a single loop for this query, because there was no join-close involved.

  • In this case, the estimate was pretty accurate regarding the numbers of returned rows.

  • Then, these rows are passed to the second phase for filtering, where we see a slight increase in the actual time of execution.

  • The second example is similar to the first, except that the filter condition is changed to use the country code column, which has a secondary, non-unique index.

  • This should make it cheaper to access the matching rows.

  • For example, let's retrieve all French cities.

  • Select ALL from city, where country code is equal to FRA.

  • This time, only a lookup on the country code index can be used for the query.

  • It's estimated that 40 rows will be accessed, which is exactly as InnoDB responds when asked how many rows will match.

  • This is because an index will also bring some statistics with it.

  • For the next example, we'll use the country language table, which has a primary key with two columns, country code and language.

  • Imagine that we want to find all languages spoken in a single country.

  • In that case, we'll need to filter on the country code, but not on language.

  • A query that can be used to find all languages spoken in China is Select ALL from country language, where country code is CHN.

  • The index on the primary key can still be used to perform the filtering.

  • The EXPLAIN output shows that with a primary tag and also the column of the index that was used.

  • As always, only the left part of the index can be used for filtering.

  • For the last example, we'll use a mix of various features and with multiple query blocks.

  • To save time, I'll just copy-paste the query, but you can find it in the description.

  • This query will select the top 10 smallest countries by area, and then for these countries, find the largest cities by population.

  • The query plan starts out with a subquery that uses the country table to find the 10 smallest countries by area.

  • Here we can see how the CO table is a materialized subquery created by first doing a table scan on the country table, then applying a filter for the continent, then sorting based on the surface area, and then limiting the result to 10 rows.

  • Once the derived table has been constructed, it can be used as the first table for the join with the CI table.

  • The second part of the nested loop is simpler, as it just consists of an index lookup on the CI table using the country code index.

  • Here, the estimated cost was about 4 for an expected 17 rows.

  • There were 10 loops, one for each of the 10 countries, and each fetching an average of 2 rows for a total of 20 rows.

  • So, in this case, the estimate was not very accurate, because the query exclusively picks small countries.

  • It is estimated that the join will return 174 rows, which comes from the 10th rows in the derived table, multiplied with the estimated 70 rows per index lookup in the CI table.

  • When the nested loop has been resolved using the inner join, the result is streamed, that is, not materialized, for the sorting, and the first 5 rows are returned.

  • The total cost of the query is estimated to be around 4 milliseconds.

  • So, what can we do with this information?

  • When looking at a plan, there are two questions that you might want to ask yourself.

  • Is the runtime shown by the EXPLAIN ANALYZE clause justified for the given query?

  • If the query is slow, where does the runtime jump?

  • Looking for jumps in the execution time of the query will reveal what is really going on.

  • Some general advice, it's not possible here, because there are too many things that can cause issues, but here are some tips.

  • What we should pay attention to is whether the estimates and the actual costs differ from each other.

  • If there is a big difference, the optimizer will make poor decisions.

  • Possible causes for this difference could be that either the optimizer doesn't have up-to-date statistics, or the optimizer estimates are off for some reason.

  • Running an ANALYZE TABLE clause is definitely a good thing to start with.

  • This will collect statistics for the table.

  • However, under normal circumstances, you don't need to analyze the table, as this happens automatically when certain thresholds are reached.

  • For example, the index cares for himself in everyday situation, but on high-traffic tables, for example, with lots of deletes, the index might get confused and needs more time to select the right rows.

  • In this case, we might run ANALYZE TABLE when no other traffic is made.

  • So, it's definitely worth considering other options that are causing the bad estimates.

  • Let's see such an example.

  • We'll switch to a clean database, and we'll create a simple table.

  • This table has an autoincrement primary key and two integer values.

  • Then, we add an index on the first integer value column, alter table, test estimates, and add an index.

  • Then, we'll load some data using a simple Python script in the MySQL CH tool.

  • This script will execute the insert statement 100,000 times.

  • The integer value is also an autoincremented value from 1 to 100,000.

  • After loading the data, we make sure that optimizer statistics are created.

  • And now, let's execute a query and check the estimates.

  • Explain analyze, select all from test estimates, where 2 multiplied with value is less than 3.

  • In many cases, MySQL might not be able to estimate the where clause properly, even though it has an index.

  • This is because it only has statistics on columns and not on expression.

  • What we see here is a major overestimation of the data returned from the where clause.

  • Because MySQL thinks that it will need to read the whole table, it considers that the index is overhead, so it goes ahead and scans the whole table, ignoring the index.

  • In this case, trying to avoid the expression on database level is the best choice.

  • However, if that is not possible, adding an index will fix statistics and will also ensure significantly better performance.

  • So, let's add an index on the expression.

  • So, let's add an index on the expression.

  • The key takeaway from this example is to take a look at the row counters for estimates versus actual execution.

  • A large difference, that is, a couple of orders of magnitude or more between the estimated number of rows and the actual numbers of rows is a sign that you need to look closer at it.

  • The optimizer chooses its plan based on the estimates, but looking at the actual execution may tell you that another plan would have been better.

  • If you run a query twice, the second will likely be much faster simply because of caching, regardless of whether the plan was better or worse.

  • This represents hot cache behavior, meaning that the data needed for the query was already in the database or the operating system caches.

  • Basically, it was left in the cache from where the data was loaded in the first place.

  • Whether your cache is hot or cold is a thing to be very careful of when analyzing queries.

  • One way to solve this problem is to repeatedly run the query and check if it takes the same amount of time each run.

  • This means that the amount of cached data is staying constant and not impacting the results.

  • In this case, it's 100% cached.

  • Explain Analyze is a profiling tool for your queries that will show you where MySQL spends time on your query and why.

  • It will plan the query, instrument it, and execute it while counting rows and measuring time spent at various points in the execution plan.

  • It is important to understand how the queries are actually executed based on the statistics available.

  • Once you see how to read query plans and understand how each of the underlying nodes type work on MySQL, then you should be confident to manage the queries on a production database.

In this tutorial, we're going to see why MySQL creates a query plan before each query request.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it