Saying is one thing and doing is another.—Michel Eyquem de Montaigne Essays, Bk. II, Ch. 31
This chapter covers two classes of tuning techniques for
controlling execution plans: universal techniques that work independently of
your choice of database vendor, and techniques that are database-specific.
Database-specific techniques are covered well by your own vendor's
documentation, so you might know them well already. In general, you need both
types of techniques to get precisely the execution plan you want. Each of the
vendor-specific sections stands alone, even repeating material from the other
sections when applicable. Therefore, you can skip the vendor sections that you
don't need.
Much ado has been made over controlling execution plans,
sometimes with elaborate tools. This chapter focuses on the simplest ways to
control plans, with a strong emphasis on getting the types of plans you will
need to optimize real-world SQL. I have found that when you know which execution
plan you want in advance, getting it is easy and requires only simple tools.
4.1 Universal Techniques for Controlling Plans
This section describes a
number of database-independent techniques you can use to control execution
plans. The techniques are good for the following purposes:
-
Enabling use of the index you want
-
Preventing use of the wrong indexes
-
Enabling the join order you want
-
Preventing join orders you do not want
-
Choosing the order to execute outer queries and subqueries
-
Providing the cost-based optimizer with good data
-
Fooling the cost-based optimizer with bad data
These vendor-independent techniques often offer an alternative
method to achieve ends you could also achieve with vendor-specific methods. When
you have a choice, the vendor-specific methods are usually cleaner. However,
some problems are solvable only by these universal techniques, which offer
solutions that can sometimes work on SQL that is intended to run on multiple
vendor databases.
4.1.1 Enabling Use of the Index You Want
To enable efficient use of an index, you
need a reasonably selective condition on the leading column (or only column) of
that index. The condition must also be expressed in a way that enables the
database to establish a reasonably narrow index range for the index values. The
ideal form this takes is:
SomeAlias.Leading_Indexed_Column=
In less ideal cases, the comparison is with some range of
values, using BETWEEN, LIKE, <, >,
<=, or >=. These range comparisons also potentially
enable use of the index, but the index range is likely to be larger and the
resulting query therefore slower. If the index range is too large, the optimizer
might conclude that the index is not worth using and choose another path to the
data. When you combine equalities and range conditions for multicolumn indexes,
you should prefer indexes that lead with the columns that have equality
conditions and finish with columns that have range conditions. Note that the
left side of the comparison simply names the column, with no function around the
column, and no expression (such as addition) using the column. Use of a function, a type conversion, or an
arithmetic expression on the side with the indexed column will generally disable
use of that index.
Type
conversions are a particularly subtle way that SQL sometimes disables use of an
index. DB2 returns an error if you compare two expressions with incompatible
types. SQL Server prefers to perform the implicit conversion on the side of the
comparison that does not disable the index. Oracle implicitly converts
character-type expressions to the type of the other side, even when this
disables index use. For example, consider this expression:
P.Phone_Number=5551212
If Phone_Number were a character-type column, this
would likely evaluate internally on Oracle and SQL Server as:
On Oracle: TO_NUMBER(P.Phone_Number)=5551212 On SQL Server: P.Phone_Number=CAST(5551212 AS VARCHAR)
SQL Server preserves indexed access to the column. On Oracle,
the implicit use of TO_NUMBER( ) disables use of the index just as
surely as if you made the expression explicit. (The only real difference is that
the problem is harder to find in the implicit form.) The same problem can plague
index use for joins, as well as for single-table conditions. For example,
consider the join:
P.Phone_Number=C.Contact_Number
If Contact_Number were a number type and
Phone_Number were a character type, the implicit conversion on Oracle
would prevent an index-driven nested-loops join from C to P. A
join in the other direction would be unhindered.
The expression opposite the indexed column reference can be
arbitrarily complex. However, it must not reference columns in the same alias
with the indexed column. For example, consider the condition:
P.Phone_Number=P.Area_Code||'5551212'
The database cannot drive into the index on
P.Phone_Number with this condition, because the database must reach
alias P before it can evaluate the expression on the right side. This
chicken-and-egg problem prevents identifying (with the index) the subset of the
table that meets this condition until after the database examines the whole
table.
The final way that SQL
often disables index use is with conditions combined with OR. For
example, consider the query:
SELECT ... FROM Order_Details D, ... WHERE ... AND (D.Order_ID=:1 or :1 IS NULL) AND ...
In this example, the database can reach Order_Details
through an index on Order_ID if the bind variable :1 happens
to be nonnull. But if :1 is bound to a null value, there is no
restriction at all on Order_ID and thus no use for that index. Since
the database cannot tell what :1 will be bound to when it parses the
SQL and prepares the plan, it will find no good opportunity to use the index. In
this case, the solution is to create a two-part plan, with each part optimized
for one of the cases:
SELECT ... FROM Order_Details D, ... WHERE ... AND D.Order_ID=:1 AND :1 IS NOT NULL AND ... UNION ALL SELECT ... FROM Order_Details D, ... WHERE ... AND :1 IS NULL AND ...
When you view the execution plan for this query, it shows both
indexed access through the index on Order_Details(Order_ID) and
full-table-scan access to Order_Details. This might appear to be the
worst of both worlds, but you are saved by the conditions:
AND :1 IS NOT NULL ... AND :1 IS NULL
These conditions make no reference at all to any data in the
database, so the database can and does evaluate them before it even begins
reading data for that half of the combined statement. Therefore, it never
actually executes the full table scan when :1 is not null, and it never
actually executes an indexed read (or any other part of the execution plan for
the first half of the query) when :1 is null. This amounts to a method
to branch your execution plan depending on conditions on the bind variables, the
variables that determine which data is available to drive the query. The only
catch is that you must ensure that the conditions on the bind variables are
mutually exclusive, so that exactly one of the branches actually returns data.
For example, if you have another bind variable to provide
Customer_Name, you might put together a query like this:
SELECT ... FROM Order_Details D, Customers C, ... WHERE ... AND D.Order_ID=:1 AND :1 IS NOT NULL AND (C.Customer_Name=:2 OR :2 IS NULL) AND ... UNION ALL SELECT ... FROM Order_Details D, Customers C, ... WHERE ... AND :1 IS NULL AND :2 IS NOT NULL AND C.Customer_Name=:2 AND ... UNION ALL SELECT ... FROM Order_Details D, Customers C, ... WHERE ... AND :1 IS NULL AND :2 IS NULL AND ...
This could support a three-part plan, in which the database
would:
-
Drive into Orders on the index on Order_ID (your first choice), when possible.
-
Otherwise, drive into Customers on the index on Customer_Name (your second choice) when it has no Order_ID specified but has a customer name.
-
Otherwise, just get all the rows, probably beginning with a full table scan, when it has no selective conditions at all.
In any case, the conditions on the bind variables in the three
parts are contrived to be mutually exclusive:
AND :1 IS NOT NULL ... AND :1 IS NULL AND :2 IS NOT NULL ... AND :1 IS NULL AND :2 IS NULL
4.1.2 Preventing Use of the Wrong Indexes
Join expressions are usually simple, usually
between consistent types, and usually between numerical IDs. Conditions on the
driving table are usually simple and compatible with index use. A more frequent
problem than enabling use of the right index is preventing use of the wrong
indexes. In many queries, there are multiple single-table conditions that are
capable of reaching multiple indexes, but you want to use only a specific one of
those indexes. Join conditions are usually expressed to allow index-driven joins
in either direction, although only one of the possible join directions turns out
to be optimal. Occasionally, you'll prefer to disable use of an index on a join
altogether, to force a hash or sort-merge join.
To disable use of an
index, create the simplest possible expression around the indexed column
reference. For example, you should prevent use of an index on
Status_Code for the unselective condition on closed orders, as the
number of closed orders will eclipse open orders as you do more and more
business:
O.Status_Code='CL'
Since Status_Code is a character-type column, a simple
expression to disable index use without changing the results would simply
concatenate an empty string to the end of Status_Code:
On Oracle and DB2: O.Status_Code||''='CL' On SQL Server: O.Status_Code+''='CL'
For number-type columns, you can add 0:
O.Region_ID+0=137
All databases have some sort of function that evaluates to the
first argument when the argument is null and otherwise returns the second
argument. On Oracle, the function is NVL( ). On SQL
Server and DB2, it is COALESCE(
). If both arguments are the same column, the function always returns the
same result as the bare column, regardless of the column type. Therefore, this
makes a handy recipe to deactivate index use regardless of column type:
On Oracle: NVL(O.Order_Date,O.Order_Date)=On DB2 and SQL Server: COALESCE(O.Order_Date,O.Order_Date)=
In a join condition, a join that disables an indexed path to
O.Region_ID (but not to R.Region_ID) could look like this:
O.Region_ID+0=R.Region_ID
Using the type-independent approach, this same join would look
like this:
NVL(O.Region_ID,O.Region_ID)=R.Region_ID
4.1.3 Enabling the Join Order You Want
Apart from unintentionally disabled indexes,
there are two issues that sometimes disable desired join orders:
-
Outer joins
-
Missing redundant join conditions
4.1.3.1 Outer joins
SELECT ... FROM Employees E, Locations L WHERE E.Location_ID=L.Location_ID(+)
or in the newer, universal notation:
SELECT ... FROM Employees E LEFT OUTER JOIN Locations L ON E.Location_ID=L.Location_ID
This query requests employee records with their matching
locations, when employees have locations; otherwise, null location data is used
for employees that have no matching locations. Based on the request, it is clear
that the query cannot effectively drive from Locations to
Employees, since even employees without locations are needed. Consider
a case in which this query is just a template to which an application adds
conditions that depend on search criteria provided by an end user. If the end
user wants to see employees for a particular location, the application might
create this query:
SELECT ... FROM Employees E LEFT OUTER JOIN Locations L ON E.Location_ID=L.Location_ID WHERE L.Description='Headquarters'
In the outer case of the join from Employees to
Locations, L.Description will be assigned a generated value of
null, and the condition on L.Description will be
false. Only the inner case of the join will return rows that might meet
the restriction on L.Description, so now it makes perfect sense to
drive the query in the other join order, from Locations to
Employees. However, the existence of the outer join often prevents
automated optimizers from allowing this reversed order on the outer joins, so
you need to make the join explicitly an inner join to get the reversed join
direction:
SELECT ... FROM Employees E INNER JOIN Locations L ON E.Location_ID=L.Location_ID WHERE L.Description='Headquarters'
4.1.3.2 Missing redundant join conditions
Normally, between any number of tables, the
join count is the number of tables minus one. For example, between three tables,
you expect to find two joins. Occasionally, a query permits an extra, redundant
join. For example, if you have an Addresses table that contains all
addresses significant to the company, it might have a one-to-zero or one-to-one
relationship with the earlier Locations table, which contains only
locations owned by the company and which references Addresses through a
matching primary key. In this case, you might find a query like the
following:
SELECT ... FROM Employees E, Locations L, Addresses A WHERE E.Location_ID=L.Location_ID AND E.Location_ID=A.Address_ID AND A.ZIP_Code=95628
By transitivity (if a=b and b=c, then a=c), you can deduce
that the condition L.Location_ID=A.Address_ID must be true for
all rows this query would return. However, that condition is not explicit in the
query, and not all databases will deduce it and fill it in if it is left out.
The best plan, in this case, will likely begin with all addresses within that
ZIP Code and immediately join to Locations to discard all addresses
except the one or two that correspond to company locations, before joining to
Employees. Since that join order requires the missing join condition to
support an indexed path from Addresses to Locations, you
should make the missing join condition explicit:
SELECT ... FROM Employees E, Locations L, Addresses A WHERE E.Location_ID=L.Location_ID AND E.Location_ID=A.Address_ID AND L.Location_ID=A.Address_ID AND A.ZIP_Code=95628
Since you do not want to follow the join from
Addresses to Employees directly, you could also remove, if
necessary, the redundant join condition E.Location_ID=A.Address_ID, to
discourage that unwanted join operation.
4.1.4 Preventing Join Orders You Do Not Want
Forcing joins in the direction you want,
using the earlier techniques for preventing use of the wrong indexes, will
prevent many undesired join orders. What do you do when you want the database to
follow a particular join direction eventually, but not too early in the
execution plan? You cannot afford to disable an index, because you must use that
index eventually, just not too early. Consider the following two joins, in which
you want to start the query with reads of T1 and then join to
T2 before joining to T3:
... AND T1.Key2_ID=T2.Key2_ID AND T1.Key3_ID=T3.Key3_ID ...
Here, you want to follow nested loops into both T2 and
T3, following indexes in the keys mentioned and reaching T2
before reaching T3. To postpone the join you want to happen later, make
it depend (or at least to appear to depend) on data from the join that must
happen earlier. Here is a solution:
... AND T1.Key2_ID=T2.Key2_ID AND T1.Key3_ID+0*T2.Key2_ID=T3.Key3_ID ...
You and I know that the second version is logically equivalent
to the first. However, the database just finds an expression on the left side of
the second join that depends on both T1 and T2 (not
recognizing that no value from T2 can change the result), so it won't
try to perform the join to T3 until after T2.
If necessary, you can string together joins like this to
completely constrain a join order. For each join after the first, add a
logically irrelevant component referencing one of the columns added in the
preceding join to the join expression. For example, if you want to reach tables
T1 through T5 in numerical order, you can use the following.
Notice that the join condition for the T3 table uses the expression
0*T2.Key2_ID to force the join to T2 to occur first. Likewise,
the join condition for the T4 table uses 0*T3.Key3_ID to force
T3 to be joined first.
... AND T1.Key2_ID=T2.Key2_ID AND T1.Key3_ID+0*T2.Key2_ID=T3.Key3_ID AND T1.Key4_ID+0*T3.Key3_ID=T4.Key4_ID AND T1.Key4_ID+0*T4.Key4_ID=T5.Key5_ID ...
I'll apply this method to a concrete example. Consider the
following SQL, adapted from Chapter 3:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Locations LE, Locations LM, Employees M, Employees E WHERE E.Last_Name = 'Johnson' AND E.Manager_ID=M.Employee_ID AND E.Location_ID=LE.Location_ID AND M.Location_ID=LM.Location_ID AND LE.Description='Dallas'
Assume that you have an execution plan that drives from the
index on the employee's last name, but you find that the join to the employee's
location (alias LE) to discard employees at locations other than
Dallas is unfortunately happening last, after the other joins (to
M and LM). You should join to LE immediately from
E, to minimize the number of rows you need to join to the other two
tables. Starting from E, the join to LM is not immediately
possible, so if you prevent the join to M before LE, you
should get the join order you want. Here's how:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Locations LE, Locations LM, Employees M, Employees E WHERE E.Last_Name = 'Johnson' AND E.Manager_ID+0*LE.Location_ID=M.Employee_ID AND E.Location_ID=LE.Location_ID AND M.Location_ID=LM.Location_ID AND LE.Description='Dallas'
The key here is that I've made the join to M dependent
on the value from LE. The expression 0*LE.Location_ID forces
the optimizer to join to LE before M. Because of the
multiply-by-zero, the added expression has no effect on the results returned by
the query.
4.1.5 Forcing Execution Order for Outer Queries and Subqueries
Most
queries with subqueries can logically drive from either the outer query or the
subquery. Depending on the selectivity of the subquery condition, either choice
can be best. The choice generally arises for queries with EXISTS or IN conditions.
You can always convert between an EXISTS condition on a correlated
subquery and the equivalent IN condition on a noncorrelated subquery.
For example, you can convert this:
SELECT ... FROM Departments D WHERE EXISTS (SELECT NULL FROM Employees E WHERE E.Department_ID=D.Department_ID)
to this:
SELECT ... FROM Departments D WHERE D.Department_ID IN (SELECT E.Department_ID FROM Employees E)
The first form implies that the database drives from the outer
query to the subquery. For each row returned by the outer query, the database
executes the join in the subquery. The second form implies that you begin with
the list of distinct departments that have employees, as found in the
noncorrelated subquery, and drive from that list into the matching list of such
departments in the outer query. Sometimes, the database itself follows this
implied join order, although some databases can make the conversion internally
if their optimizer finds that the alternate order is better. To make your own
SQL more readable and to make it work well regardless of whether your database
can convert the forms internally, use the form that implies the order you want.
To force that order even when the database could make the conversion, use the
same join-direction-forcing technique used in Section 4.1.4. Thus, an EXISTS
condition that forces the outer query to execute first would look like this:
SELECT ... FROM Departments D WHERE EXISTS (SELECT NULL FROM Employees E WHERE E.Department_ID=D.Department_ID+0)
For the contrary order, an IN condition that forces
the implied driving order from the subquery to the outer query would look like
this:
SELECT ... FROM Departments D WHERE D.Department_ID IN (SELECT E.Department_ID+0 FROM Employees E)
|
You can have several subqueries in which the database either
must drive from the outer query to the subquery (such as NOT EXISTS
subqueries) or should drive in that order. Such a case implies a choice of the
order of execution of the subqueries. You can also have choices about whether to
execute subqueries after completing the outer query, or at the first
opportunity, as soon as the correlation join is possible, or at some point
between these extremes.
The first tactic for controlling the order of subquery
execution is simply to list the subqueries in order in the WHERE clause
(i.e., the top subquery to be executed should be listed first). This is one of
the few times when WHERE-clause order seems to matter.
Rarely, the database will execute a subquery sooner than you
would like. The same tactic for postponing joins (described in Section 4.1.4) works for correlation joins, the joins in subqueries that
correlate the subqueries to the outer queries. For example, consider this
query:
SELECT ... FROM Orders O, Customers C, Regions R WHERE O.Status_Code='OP' AND O.Customer_ID=C.Customer_ID AND C.Customer_Type_Code='GOV' AND C.Region_ID=R.Region_ID AND EXISTS (SELECT NULL FROM Order_Details OD WHERE O.Order_ID=OD.Order_ID AND OD.Shipped_Flag='Y')
For this query you might find that the subquery runs as soon as
you reach the driving Orders table, but you might wish to perform the
join to Customers first, to discard nongovernmental orders, before you
take the expense of the subquery execution. In this case, this would be the
transformation to postpone the correlation join:
SELECT ... FROM Orders O, Customers C, Regions R WHERE O.Status_Code='OP' AND O.Customer_ID=C.Customer_ID AND C.Customer_Type_Code='GOV' AND C.Region_ID=R.Region_ID AND EXISTS (SELECT NULL FROM Order_Details OD WHERE O.Order_ID+0*C.Customer_ID=OD.Order_ID AND OD.Shipped_Flag='Y')
Notice the addition of +0*C.Customer_ID to the
subquery's WHERE clause. This ensures the join to Customers
occurs first, before the subquery executes.
4.1.6 Providing the Cost-Based Optimizer with Good Data
On any cost-based optimizer (that is, for
any query except one running on the Oracle rule-based optimizer, since only Oracle has
a rule-based optimizer), the second most common source of poor execution plans
(after missing indexes) is missing statistics on the tables, columns, and indexes involved in the
query. In all, cost-based optimizers do a fairly good job of finding the best
plan without help when they have good information to begin with. However, when
they are missing information—for example, because a table or index has been
rebuilt without regenerating statistics for that object—they tend to make
terrible assumptions.
If you are running on any database except Oracle, or if you are
on Oracle's cost-based optimizer (as is most common and as Oracle recommends)
and not forcing the rule-based optimizer, the first thing you should try if you
are not getting the execution plan you want is to regenerate statistics on every
table and index relevant to the query. Standard statistics will usually suffice
to get reasonable execution plans.
Cost-based optimizers usually assume that data is uniformly
distributed. For example, if the optimizer statistics show a table of 1,000,000
rows with 50,000 distinct values for some indexed foreign key, the database will
optimize on the assumption that every value of that key will match exactly 20
rows. For most indexed columns, like foreign keys, this assumption of a uniform
data distribution works well. However, some columns have highly skewed
distributions, such as status, code, or type columns, or foreign keys to status
or type tables. For example, consider this query:
SELECT ... FROM Orders WHERE Status_Code = 'OP'
There might only be three or four values of
Status_Code across a 1,000,000-row Orders table, but if
'OP' means this is an open order, not yet fulfilled or cancelled, this
condition is far more selective than the optimizer would expect based solely on
the number of distinct values. If the column had an index, the optimizer might
never use that index if it knew only the small number of distinct indexed
values. However, on some databases, you can generate added statistics that let
the database know not only the number of distinct values but also the
distribution of those values, a necessary step when you have such highly skewed
distributions.
4.1.7 Fooling the Cost-Based Optimizer with Incorrect Data
This last technique is
dangerous, and I recommend it only as a last resort. Sometimes, you want to
simulate a large database on a small, development database. If you can
extrapolate (or, better, measure from an actual database) statistics that apply
to a large database, you can manually modify the data-dictionary tables that
store those statistics for the optimizer, to fool the optimizer into thinking it
is working with a large database. The small database will have statistics that
show large tables with many distinct values on most indexes. This is a handy way
to see execution plans that will apply to production volumes when you have only
a test database with toy data volumes. For such toy-sized databases, there is no
great risk to this approach. On production databases, the optimizer will
occasionally make better choices if it has the wrong data, usually if it has
data that exaggerates the selectivity of desired indexes or that exaggerates the
size of a table when a full table scan is undesirable.
Imagine reversing the logic the optimizer follows: ask "What
would I need to believe about the tables and indexes of this query to find an
alternative plan (the alternative that you, the human optimizer, want) much more
attractive?" It is not hard to fool the optimizer into doing what you want
rather than what it would choose on its own, if you lie to it about the
statistics. However, on production systems, this is dangerous in several
ways:
-
As soon as anyone regenerates statistics for the tables or indexes, the optimizer will revert to the original error, unless the manual statistics-tweak is reapplied. You will have to rigorously control statistics generation to prevent this.
-
As soon as the database optimizer improves—with the next release, perhaps—it is denied the chance to exploit those improvements with correct data.
-
Most importantly, every other query against the tables and indexes with false statistics is at risk and will potentially be harmed, just to help the one query you wanted to tune when you fudged the statistics.
I have never needed to play this card to get an adequately
optimized plan on Oracle, SQL Server, or DB2, and I recommend you avoid it if
possible.
4.2 Controlling Plans on Oracle
Oracle currently offers two completely different optimizers,
the rule-based optimizer (RBO) and the cost-based optimizer (CBO), and the methods
for tuning on each differ.
The RBO is Oracle's original automated optimizer, back from the
days of Oracle Version 6 and earlier. By rule-based, Oracle means that the optimizer uses only
fixed properties of the tables, indexes, and SQL to guess an optimum execution
plan from a set of simple rules of thumb (or heuristics) built into the automated optimizer. The RBO
uses no data about the sizes of the tables or indexes, or about the distribution
of data within those objects. It does use data on the fixed properties of the
indexes: whether they are unique, which columns they cover, in which order, and
how well those match up with the most selective-looking filter conditions and
joins in the SQL. As tables grow and data distributions change, the RBO should
go right on delivering the same plan indefinitely, as long as you don't alter
the indexes (for example, from unique to nonunique) or change the table
structure (for example, from an ordinary table to a partitioned table). However,
at some future time, perhaps even in Oracle Database 10g, Oracle will drop all support for the rule-based
optimizer, and cost-based optimization will become your only choice.
Since Oracle7, the RBO has been even more stable than before,
because Oracle chose to freeze the RBO code beginning with Oracle7, except for
rare, slight changes necessary to deliver functionally correct (as opposed to
necessarily optimum) results. Therefore, an execution plan that is correct on
the RBO today will likely stay unchanged until Oracle drops the RBO altogether.
This is appealing from the perspective of stability, although the dark side of
this stability is that the execution plans never get any better either.
Execution plans on the RBO never change to adapt to changing
data distributions, and this is often cited as an argument to switch to the CBO.
However, in my own experience, data-distribution change is the least of the
reasons for cost-based optimization. In over 10 years, I have yet to find a
single case in which it was important to use different execution plans for
different real-world data distributions with the same SQL.
|
Another argument cited in favor of the CBO is that it can
deliver parallel execution
plans, plans that can bring multiple processors to bear on the SQL
statement at once. I have not found this to be a compelling argument, since I
have yet to find a real-world case in which the optimum SQL, with the optimum
database design, required parallel execution for adequate performance. I expect
some such cases exist in data-warehousing environments, which are not where most
of my experience lies, I admit, but almost all cases in which parallel execution
plans appear to shine are really covering up some mistake in database design,
indexing, or application design, compensating for design deficiencies with
horsepower. That, by itself, would not be such a bad thing; extra horsepower
might be cheaper than fixing the application. However, parallel plans are
usually in service of large batch processes, competing heavily for resources
with online processes that are more critical to end users. Therefore, parallel
plans often rob needed resources from other processes that are more
critical.
These are the strongest arguments against using the RBO:
-
It will become unavailable in some future release, perhaps during Oracle Database 10g, and you will not be able to use an older release forever.
-
The CBO keeps getting better, while the RBO is stuck with all the old problems it has ever had.
-
The CBO has a huge inherent advantage in the information available to it to calculate the best plan.
-
The RBO cannot take advantage of features created since the CBO appeared in Oracle7, and in most cases the RBO will simply push queries that involve newer object types, such as bit-mapped indexes, off to the CBO. (See the following section, Section 4.2.1, for details about which features the RBO cannot handle.)
That said, the RBO does a surprisingly good job; its heuristics
are well designed to get along with the tiny amount of information that the RBO
uses to guess the best plan. In Chapter 6, I will
describe properties of what I call a robust
execution plan, one that behaves well across a wide range of data distributions.
The RBO almost always delivers a robust plan when the necessary indexes are
available and when the developer has not prevented use of an index with some
index-disabling expression, as discussed earlier in this chapter. Given the
right indexes, you can almost always get the best robust plan on either
optimizer, with manual tuning. With automated tuning, the biggest advantage of
the CBO is that it is more resourceful when dealing with imperfect indexing and
nonoptimally written SQL; more often, it delivers at least an adequate plan in
these cases, without manual tuning. When more than one robust plan is possible,
the CBO is also more likely to find the best robust plan, while the RBO will
pick one without knowing relative costs, unless you manually tune the SQL .
4.2.1 Controlling the Choice of Oracle Optimizer
It is unrealistic to optimize Oracle queries
simultaneously for both the rule-based and the cost-based optimizers. Therefore,
you should understand the factors that lead Oracle to choose which optimizer it
applies, so that you can control those factors and get the optimizer you
choose.
The RBO cannot handle certain object types and object
properties that did not yet exist when Oracle froze the RBO code. However,
rather than simply have its code error out, Oracle modified the RBO code just
enough to let it recognize the cases it cannot handle and to have it pass those
cases on to the CBO. Thus, even if you think you have set up your system for
rule-based optimization, the following circumstances will absolutely force
cost-based optimization:
-
Bit-mapped indexes on any column of a table referenced in the SQL, even if those indexes are on columns the SQL does not touch.
-
Function-based indexes in a table referenced in the SQL, if such and index is on an expression the SQL references.
-
Partitioned tables touched by the SQL.
-
Tables or indexes configured with parallel degree. The optimizer interprets these as a command to find parallel execution plans, which the RBO does not know how to do. As for bit-mapped indexes, indexes configured with parallel degree will disable use of the RBO on a table referenced by your SQL, even if the parallel-degree index is on columns the SQL does not touch.
If the tables and indexes involved in your SQL do not prevent
using the RBO, Oracle chooses between the RBO and the CBO as follows:
-
If any SELECT keyword in the SQL (even in a subquery or a view definition) is followed by any valid hint other than /*+ RULE */ or /*+ CHOOSE */, Oracle will use the CBO.
-
Otherwise, if any SELECT keyword in the SQL (even in a subquery or a view definition) is followed by /*+ CHOOSE */ and there are any statistics on any table or index referenced by the SQL, Oracle will choose the CBO.
-
Otherwise, if any SELECT keyword in the SQL (even in a subquery or a view definition) is followed by /*+ RULE */, Oracle will choose the RBO.
-
Otherwise, if the session optimizer_mode parameter is set at the session level (by ALTER SESSION SET OPTIMIZER_MODE=
;), Oracle will choose according to that session-level parameter. -
Otherwise, if the optimizer_mode parameter is set for the database instance, in the init.ora file, Oracle will choose according to that instance-level parameter.
-
Otherwise, Oracle will choose according to the ultimate default optimizer_mode parameter, CHOOSE.
In the last three steps of this decision cascade, Oracle
chooses according to an optimizer_mode parameter, which you or your DBA
sets. These are the four possible parameter values and how they affect the
choice:
- RULE
-
Oracle uses rule-based optimization.
- ALL_ROWS
-
Oracle uses cost-based optimization with the goal of minimizing the cost of the whole query. This default version of cost-based optimization sometimes results in nonrobust plans (plans that use join methods other than nested-loops), with risks described in Chapter 6. However, the optimizer chooses these plans only when it calculates that they are faster than the best robust plans.
- FIRST_ROWS
-
Oracle uses cost-based optimization with the goal of minimizing the cost of reaching the first rows from the query. In practice, this tends to favor robust, nested-loops plans similar to those plans the rule-based optimizer favors but built with much more knowledge of the data distributions and probable execution costs. The FIRST_ROWS optimization level creates the same effect as the OPTIMIZE FOR 1 ROW hint on DB2 and the OPTION(FAST 1) hint on SQL Server.
- CHOOSE
-
Oracle uses cost-based optimization, as for the ALL_ROWS goal, unless no table or index involved in the query has optimization statistics available, in which case Oracle uses rule-based optimization.
Here's a quick way to check the instance-level parameter for
optimizer_mode:
SELECT VALUE FROM V$PARAMETER WHERE NAME = 'optimizer_mode';
When you have an execution plan in PLAN_TABLE, a quick
way to see whether it is cost-based is to run the following query:
SELECT POSITION FROM PLAN_TABLE WHERE ID=0;
This returns the cost of the entire execution plan, in
arbitrary units, when the plan is cost-based. When cost is not null, you have a
cost-based plan.
4.2.2 Controlling Oracle Rule-Based Execution Plans
Most of the methods for controlling rule-based execution plans
are the universal techniques of controlling plans, covered in the first section
of this chapter. The primary Oracle-specific method of tuning under a rule-based
default optimizer_mode is simply to switch modes to cost-based
optimization, usually with a hint such as /*+ FIRST_ROWS */.
In other words, you can always control a plan via hints, and hints (with the
exception of the /*+ RULE */ hint) in a statement cause Oracle to use
the CBO for that statement.
However, if you prefer not to use cost-based optimization, thus
precluding the use of hints, one
RBO-specific technique remains: in the FROM clause, list tables and
their aliases in exactly the reverse order from the join order you want. This
usually gives enough control of the join order, without using the techniques
described earlier in Section
4.1.4. In particular, eligible, unique joins toward primary keys happen in
the reverse order they are listed in the FROM clause, without changing
the join conditions. For example, consider this query:
SELECT /*+ RULE */ E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Locations LM, Employees M, Locations LE, Employees E WHERE E.Last_Name = 'Johnson' AND E.Manager_ID=M.Employee_ID AND E.Location_ID=LE.Location_ID AND M.Location_ID=LM.Location_ID AND LE.Description='Dallas';
Unlike the earlier version of this query in Chapter 3, which had
the wrong order in the FROM clause, you now get the correct join order.
In this correct execution plan, E joins to LE before joining
to M or LM, as shown by the plan output:
SQL> @ex PLAN ---------------------------------------------------------------------- SELECT STATEMENT NESTED LOOPS NESTED LOOPS NESTED LOOPS TABLE ACCESS BY INDEX ROWID 4*EMPLOYEES INDEX RANGE SCAN EMPLOYEE_LAST_NAME TABLE ACCESS BY INDEX ROWID 3*LOCATIONS INDEX UNIQUE SCAN LOCATION_PKEY TABLE ACCESS BY INDEX ROWID 2*EMPLOYEES INDEX UNIQUE SCAN EMPLOYEE_PKEY TABLE ACCESS BY INDEX ROWID 1*LOCATIONS INDEX UNIQUE SCAN LOCATION_PKEY
When the RBO otherwise has no preference based on the
conditions and indexes, the RBO joins tables by working from right to left in
the FROM clause. However, this method offers only limited control by
itself, because the RBO follows its other rules of thumb before considering the
join order in the FROM clause. For example, the RBO always chooses to
perform unique indexed reads and joins before doing indexed range scans, when it
can.
4.2.3 Controlling Oracle Cost-Based Execution Plans
-
Providing the optimizer with good statistics about the tables and indexes, so it can calculate the costs of alternatives accurately. This is effectively a prerequisite to any manual tuning on a CBO.
-
Adding hints to queries that the CBO fails to optimize well even with complete statistics about the tables and indexes that the queries reference.
4.2.3.1 Oracle cost-based optimizer prerequisites
Proving that a little
knowledge is a dangerous thing, cost-based optimizers often do a terrible job if
they do not have statistics on all the tables and indexes involved in the query.
It is therefore imperative to maintain statistics on tables and indexes
reliably, including regenerating statistics whenever table volumes change much
or tables or indexes are rebuilt. It is safest to regenerate statistics
periodically, during times that load is relatively quiet, such as nightly or at
least weekly. The best way to generate and update statistics is with Oracle's DBMS_STATS package, documented at
length in Oracle8i Supplied PL/SQL Packages
Reference and Oracle9i Supplied PL/SQL Packages
and Types Reference. Here is a simple example of using
DBMS_STATS to generate statistics for a whole schema,
Appl_Prod, sampling 10% of the data in the larger tables and cascading
statistics collection to the indexes:
BEGIN DBMS_STATS.GATHER_SCHEMA_STATS ('Appl_Prod',10, CASCADE => TRUE); END; /
Often, queries include conditions on highly skewed
distributions, such as conditions on special types, codes, or flags, when these
columns have only a few values. Normally, the CBO evaluates selectivity of a
condition based on the assumption that all nonnull values of a column are
equally selective. This assumption generally works well for foreign and primary
keys that join business entities, but it is inaccurate when the columns have
permanent special meanings and certain meanings apply much more rarely than
others.
For example, in an Orders table, you might have a
Status_Code column with three possible values: 'CL' for closed (i.e., fulfilled) orders, 'CA' for
cancelled orders, and 'OP' for open orders. Most orders, by far, would be fulfilled,
once the application has been running for a few months. A steady, significant
fraction of orders would end up cancelled, so that value would also eventually
point to a large list of orders. However, as long as the business keeps up with
incoming orders, the number of open orders would remain moderate and steady,
even as data accumulates for years. Quite early, a condition specifying
Status_Code='OP' would be selective enough to justify indexed access,
if you had an index with that leading column, and it is important to enable the
optimizer to realize this fact, preferably without a lot of manual tuning.
Enabling the CBO to recognize when a column is selective requires two
things:
-
The SQL must mention the specific selective value, rather than use a bind variable, prior to Oracle 9i Database. Use of bind variables is commonly attractive, since it makes SQL more general and easier to share between processes. However, this need to hardcode especially selective values is the exception to that rule. If you use Status_Code=:1 instead of Status_Code='OP', prior to Oracle 9i, you will deny the CBO potential knowledge of the selectivity of the condition at parse time, when it does not yet know whether the bind variable :1 will be assigned a common or a rare Status_Code. Fortunately, in these cases, the usual reason to prefer using bind variables does not generally apply; since these special codes have special business meanings, it is unlikely that the SQL ever requires substituting a different value than the single selective value.
-
You need to provide the CBO with special statistics that quantify how rare the uncommon code, type, or status values are, so it can know which values are highly selective.
Oracle stores special statistics on distribution when you
request them, based on sorting the rows for a column and arranging the sorted
list into a specified number of buckets that each contain the same number of
rows. Since Oracle already knows that the range each bucket holds has the same
number of rows, Oracle needs to know only the value-range endpoints in each
bucket. In the current example, with 20 buckets, the first bucket might hold the
range 'CA' to 'CA', and the second bucket might hold the range
'CA' to 'CL'. The next 17 buckets would hold the most common
range, 'CL' to 'CL'. The last bucket would hold the range
'CL' to 'OP', which includes the rarest value. From this,
Oracle can deduce that the selectivity of the column is 5-10% for the value
'CA', 85-95% for the value 'CL', and 0-5% for the value
'OP'. Since you want the optimizer to know more closely how selective
the 'OP' value is, you would choose more buckets than this, perhaps the
maximum of 254. (Oracle compresses the bucket information when so few values
apply, so the large number of buckets should be inexpensive.) To create 254
buckets for the example case, in the schema owned by Appl_Prod, use
this:
BEGIN DBMS_STATS.GATHER_TABLE_STATS ('Appl_Prod','Orders', METHOD_OPT => 'FOR COLUMNS SIZE 254 Status_Code'); END; /
Generate the histogram statistics after you generate the
general table statistics, because table-statistics generation deletes earlier
histogram statistics.
4.2.3.2 General hint syntax
Oracle uses hints for manual control of
cost-based optimization. Syntactically, these hints take the form of comments,
like /*+ */, immediately
following the SELECT keyword. Oracle recognizes that this syntax
encloses a hint, not a comment, by the + at the beginning and by the
location of the hint, which must immediately follow SELECT. However,
since these are comments from the point of view of standard SQL syntax, they do
not interfere with parsing the SQL if the SQL is also to be executed on
non-Oracle databases.
|
Each hint directly affects only the SELECT block that
has the comment. Thus, to control the order of joins and index choices within a
subquery, place the hint after the SELECT keyword that begins the
subquery. But to affect the outer-query order of joins and index choices, place
a hint immediately after the outer-query SELECT.
4.2.3.3 Approaches to tuning with hints
There are two basic extremes involved in tuning with hints:
-
Use as little direction as possible to get the execution plan you want, or at least to get close enough to the plan you want for reasonable performance. This approach reasons that the CBO has more information than you have and should be left free to adapt to changing data distributions and to take advantage of improvements in Oracle with future releases. By leaving the CBO the maximum degrees of freedom, you maximize its power to optimize well for you in the future. However, until you try, you won't know how much direction the CBO will need if it did not get the plan right in the first place, so this approach is likely to be iterative, involving the addition of one hint at a time until the CBO delivers a good plan.
-
If you did not get the plan you wanted from the CBO automatically, assume the CBO has made bad assumptions that will propagate to distort all of its calculations. Therefore, leave it with little freedom, specifying essentially the whole plan you want.
If you are confident in your chosen execution plan, as you
should be if you apply the methods I describe later in this book, there is
little reason to hold back from fully specifying that plan. I have yet to find a
case where a well-chosen, robust execution plan needed to evolve to handle new
data distributions or new database features. On the other hand, it is easy for
SQL with a partially restricting set of hints to go wrong, especially if some
table or index loses its statistics. When the CBO chooses incorrectly, the error
that made the CBO choose incorrectly will likely propagate over the entire plan.
For example, consider this query:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Locations LM, Employees M, Locations LE, Employees E WHERE E.Hire_Date > :1 AND E.Manager_ID=M.Employee_ID AND E.Location_ID=LE.Location_ID AND M.Location_ID=LM.Location_ID
At parse time, when the optimizer does its work, it cannot know
that the bind variable :1 will likely be set to a value in the current
week, so it makes a conservative assumption about the selectivity of that
condition on Hire_Date. Having made that assumption, it might not only
forego using an index on Hire_Date (depending on the data
distribution), but it might also further calculate that it will hit most of the
rows of all the joined tables as well, and the CBO might choose full table scans
with hash joins on them. Even if you instruct the CBO to use the index on
Hire_Date, it still retains its initial assumption that the driving
condition is unselective, and will likely retain its poor choices for the other
joins and table-access methods. This is really no flaw in the optimizer; it
cannot know what the application developer knows about the likely values to be
assigned to the bind variable. However, the consequence is that, if you need to
be any more specific than just specifying ALL_ROWS or FIRST_ROWS,
chances are relatively high that the optimizer will need help across the board,
to correct for some incorrect assumption somewhere.
|
4.2.3.4 Table-access hints
These are the main hints to control table-access methods:
- INDEX(
)
-
This directs Oracle, when possible, to access the alias
using the index named . Repeat this hint for each index/alias combination you need to control.
- FULL(
)
-
This directs Oracle, when possible, to access the alias
using a full table scan. Repeat this hint for each full table scan you require.
- INDEX_DESC(
-
This directs Oracle, when possible, to access the alias
using the index named , reaching the rows in descending order (the reverse of the normal index-sorted order). Repeat this hint for each index/alias combination you need to control, although it is unlikely you will need it more than once in a query.
The INDEX and FULL hints are common and easy
to use. The INDEX_DESC hint is useful only rarely, but it is
occasionally vital to use. For example, if you want to know all about the last
employee hired in April, you might use this query:
SELECT * FROM Employees E WHERE Hire_Date>=TO_DATE('2003-04-01','YYYY-MM-DD') AND Hire_Date< TO_DATE('2003-05-01','YYYY-MM-DD') ORDER BY Hire_Date DESC
You'll find the most recently hired employee you want at the
top of the list of rows returned by this query. To avoid reading all the data
for other employees hired in April, you might think to add a condition AND
ROWNUM=1 to the query. However, this sometimes will not yield the desired
result, because (depending on the data) Oracle will sometimes apply that
condition before performing the descending sort. If Oracle uses a full table
scan, it will return the first employee hired in April it finds in the table,
likely the least recently hired. If it uses a simple index range scan on an
index on Hire_Date, it will begin, as range scans generally do by
default, at the low end of the index range, returning the first employee hired
in April. However, the INDEX_DESC hint, with the index
Employee_Hire_Date on the Hire_Date column, neatly solves the
problem, returning the desired row with just a single logical I/O to the
table:
SELECT /*+ INDEX_DESC(E Employee_Hire_Date) */ * FROM Employees E WHERE Hire_Date>=TO_DATE('2003-04-01','YYYY-MM-DD') AND Hire_Date< TO_DATE('2003-05-01','YYYY-MM-DD') AND ROWNUM=1
Note that I removed the explicit ORDER BY clause,
since it gives the false impression that it has effect, given the condition on
ROWNUM.
|
There are several other table-access hints that I have not
described in this section, but I have never found them necessary.
4.2.3.5 Execution-order hints
These are the main hints to control the order of execution for
joins and subqueries:
- ORDERED
-
This directs Oracle, when possible, to join the tables in the FROM clause in the same order that they are listed.
|
- LEADING(
)
-
In the absence of an ORDERED hint, this selects the driving table, the first table in the join order. Although this gives less control over the join order than the ORDERED hint, it does not require modifying the FROM clause. Often, getting just the driving table correct is all you need to get at least close to the performance of the optimal plan. Later choices in the join order tend to matter less and will likely be well chosen by the optimizer, without your help.
- PUSH_SUBQ
-
This hint instructs the optimizer to perform correlated subqueries at the first opportunity, as soon as the outer query reaches the join columns needed to evaluate them. Oracle's CBO normally performs correlated subqueries only after completing all the joins in the outer query.
The ORDERED and LEADING hints are common and
straightforward to use. The PUSH_SUBQ hint is occasionally useful.
When it comes to subqueries, Oracle offers hint-based control
only at the two extremes: executing subqueries as early or as late as possible.
However, you can gain full control of when subqueries execute if you combine the
PUSH_SUBQ hint with the earlier methods of postponing correlated joins.
For example, consider the earlier query:
SELECT ... FROM Orders O, Customers C, Regions R WHERE O.Status_Code='OP' AND O.Customer_ID=C.Customer_ID AND C.Customer_Type_Code='GOV' AND C.Region_ID=R.Region_ID AND EXISTS (SELECT NULL FROM Order_Details OD WHERE O.Order_ID+0*C.Customer_ID=OD.Order_ID AND OD.Shipped_Flag='Y')
Without a hint, Oracle would execute the EXISTS check
after joining all three outer-query tables. The point of the expression
O.Order_ID+0*C.Customer_ID was to delay the EXISTS check until
after the join to C, but not after the join to R. However,
without any hint, all EXISTS conditions are automatically delayed until
after all outer-query joins. To force the EXISTS condition to execute
between the joins to C and R, use both the hint and the
correlating-join-postponing expression:
SELECT /*+ PUSH_SUBQ */ ... FROM Orders O, Customers C, Regions R WHERE O.Status_Code='OP' AND O.Customer_ID=C.Customer_ID AND C.Customer_Type_Code='GOV' AND C.Region_ID=R.Region_ID AND EXISTS (SELECT NULL FROM Order_Details OD WHERE O.Order_ID+0*C.Customer_ID=OD.Order_ID AND OD.Shipped_Flag='Y')
Now, the PUSH_SUBQ hint causes Oracle to execute the
EXISTS condition as early as possible, and the expression
O.Order_ID+0*C.Customer_ID ensures that "as early as possible" doesn't
come until after the join to C.
4.2.3.6 Join-method hints
These are the main hints to control the join methods:
- USE_NL(
)
-
This directs Oracle, when possible, to join the tables indicated in the alias list by using nested loops. The alias list is without commas—for example, USE_NL(T1 T2 T3).
- USE_HASH(
)
-
This directs Oracle, when possible, to join to the tables indicated in the alias list by using hash joins. The alias list is without commas—for example, USE_HASH(T1 T2 T3).
4.2.3.7 Example
Here's an example to illustrate the most frequently useful
hints to yield complete control of an execution plan. I'll force the join order,
the access method to every table, and the join method to every table. Consider
the earlier example tuned for the RBO, shown at the end of Section 4.2.2. To fully force the same
plan, but substitute a hash join for the first nested-loops join, with the
employee locations read through the index on Description, use this query:
SELECT /*+ ORDERED USE_NL(M LM) USE_HASH(LE) INDEX(E Employee_Last_Name) INDEX(LE Location_Description) INDEX(M Employee_Pkey) INDEX(LM Location_Pkey) */ E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Employees E, Locations LE, Employees M, Locations LM WHERE E.Last_Name = 'Johnson' AND E.Manager_ID=M.Employee_ID AND E.Location_ID=LE.Location_ID AND M.Location_ID=LM.Location_ID AND LE.Description='Dallas'
This results in the execution plan, as shown here:
SQL> @ex PLAN ---------------------------------------------------------------------- SELECT STATEMENT NESTED LOOPS NESTED LOOPS HASH JOIN TABLE ACCESS BY INDEX ROWID 1*EMPLOYEES INDEX RANGE SCAN EMPLOYEE_LAST_NAME TABLE ACCESS BY INDEX ROWID 2*LOCATIONS INDEX RANGE SCAN LOCATION_DESCRIPTION TABLE ACCESS BY INDEX ROWID 3*EMPLOYEES INDEX UNIQUE SCAN EMPLOYEE_PKEY TABLE ACCESS BY INDEX ROWID 4*LOCATIONS INDEX UNIQUE SCAN LOCATION_PKEY
4.3 Controlling Plans on DB2
DB2 offers relatively sparse vendor-specific tools to control execution plans, so the methods used to tune on DB2 are comparatively indirect. There are three main steps involved in tuning on DB2:
Provide the optimizer with good statistics about the tables and indexes, so it can calculate the costs of alternatives accurately. Choose the optimization level that DB2 applies to your query. Modify the query to prevent execution plans that you do not want, mainly using the methods described earlier in Section 4.1.4.3.1 DB2 Optimization Prerequisites
Proving that a little knowledge is a dangerous thing, cost-based optimizers often do a terrible job if they do not have statistics on all the tables and indexes involved in a query. It is therefore imperative to maintain statistics on tables and indexes reliably; this includes regenerating statistics anytime table volumes change much or anytime tables or indexes are rebuilt. It is safest to regenerate statistics periodically, during times when load is relatively quiet, nightly or at least weekly. Edit a file runstats_schema.sql from the Unix prompt and type the following commands, replacingwith the name of the schema that contains the objects you wish to run statistics on: -- File called runstats_schema.sql SELECT 'RUNSTATS ON TABLE.' || TABNAME || ' AND INDEXES ALL;' FROM SYSCAT.TABLES WHERE TABSCHEMA = ' '; To use this script, log into db2, escape to the shell prompt with quit;, and run the following two commands from the Unix shell:db2 +p -t < runstats_schema.sql > tmp_runstats.sql grep RUNSTATS tmp_runstats.sql | db2 +p -t > tmp_anal.outThese commands can be scheduled to run automatically. Check tmp_anal.out in case any of the analyses fail.Often, queries include conditions on highly skewed distributions, such as conditions on special types, codes, or flags, when these columns have only a few values. Normally, the CBO evaluates selectivity of a condition based on the assumption that all nonnull values of a column are equally selective. This assumption generally works well for foreign and primary keys that join business entities, but it breaks down when the columns have permanent special meanings and certain meanings apply much more rarely than others.For example, in an Orders table, you might have a Status_Code column with three possible values: 'CL' for closed (i.e., fulfilled) orders, 'CA' for cancelled orders, and 'OP' for open orders. Most orders, by far, would be fulfilled; so, once the application has been running for a few months, you'd expect 'CL' to point to a large and steadily increasing number of orders. A steady, significant fraction of orders would end up cancelled, so 'CA' would also eventually point to a large list of orders. However, as long as the business keeps up with incoming orders, the number of open orders would remain moderate and steady, even as data accumulates for years. Quite early, a condition that specified Status_Code='OP' would be selective enough to justify indexed access, if you had an index with that leading column, and it is important to enable the optimizer to realize this fact, preferably without a lot of manual tuning. This requires two things:
-
The SQL must mention the specific selective value, rather than use a bind variable. Use of bind variables is commonly attractive, since it makes SQL more general and easier to share between processes. However, this need to hardcode especially selective values is the exception to that rule. If you use Status_Code= ? instead of Status_Code='OP', you will deny the CBO potential knowledge of the selectivity of the condition at parse time, when it does not yet know whether the bind variable ? will be assigned to a common or a rare Status_Code. Fortunately, in these cases, the usual reason to prefer using bind variables does not generally apply; since these special codes have special business meanings, it is unlikely that the SQL will ever require substituting a different value than the single selective value.
-
You need to provide the CBO with special statistics that quantify how rare the uncommon code, type, or status values are, so it can know which values are highly selective.
DB2 stores special statistics on distribution, when you request
them. To create distribution statistics for the example case, given an index
named Order_Stts_Code and the schema owned by ApplProd, use
the following command:
RUNSTATS ON TABLE ApplProd.Orders WITH DISTRIBUTION FOR INDEX ApplProd.Order_Stts_Code;
Anytime you have a column with a skewed distribution and an
index that you wish to use when your condition on the column has a high degree
of selectivity, be sure to create distribution statistics in the manner shown
here.
4.3.2 Choosing the Optimization Level
DB2 offers multiple optimization levels. An
optimization level is basically a ceiling on how
clever the optimizer attempts to be when it considers the range of possible
execution plans. At optimization level 0, DB2 chooses the lowest cost
plan within a subset of the plans it considers at level 1; at level
1, it considers just a subset of the plans it considers at level
2; and so on. Nominally, the highest optimization level should always
yield the best plan, because it chooses the lowest cost plan from the widest
possible range of alternatives. However, the plans enabled by the higher
optimization levels tend to be less robust and often prove disappointing. In
spite of the optimizer's calculations to the contrary, these less robust plans
often run longer than the best robust plan that the lower-level optimization
sees. Higher levels of optimization can also take longer to parse, since the
optimizer has additional degrees of freedom to explore. Ideally, you parse every
statement at the lowest level that is capable of finding the best execution plan
for a given query.
DB2 offers seven levels of optimization: 0,
1, 2, 3, 5, 7, and 9.[1]
Level 5 is normally the default, although database administration can
override this default choice. I have never needed levels of optimization higher
than 5; levels 7 and 9 appear mainly to enable
relatively exotic query transformations that are rarely useful. However, I have
frequently found excellent results with the lowest level of optimization, level
0, when level 5 produced a poor plan. Before executing a query
(or checking an execution plan), set level 0 with the following SQL
statement:
[1] Levels 4, 6, and 8 are not available, presumably for historical reasons, although I have never found these reasons documented.
SET CURRENT QUERY OPTIMIZATION 0;
When you wish to return to level 5 for other queries
that require it, use the same syntax, replacing 0 with 5. If
you find a poor plan at level 5, I recommend trying level 0
after first verifying correct statistics on the tables and indexes involved.
Level 0 frequently yields just the sort of robust plans that usually
work best for real-world applications.
4.3.3 Modifying the Query
Most manual tuning on DB2 uses the SQL
changes described earlier in Section
4.1. However, one particular manual technique deserves special mention,
because it proves useful more often on DB2 than on Oracle and SQL Server. DB2 stores index records even for null
values of indexed columns, and it appears to treat null like just another
indexed value.
When DB2 lacks special statistics on distribution (see Section 4.3.1), DB2 estimates
the selectivity of Indexed_Column IS NULL to be just as high as
Indexed_Column = 198487573 or any other nonnull value. Therefore, older
DB2 versions often choose to drive to selective-looking IS NULL
conditions on indexed columns. Occasionally, this works out fine. However, in my
experience, IS NULL conditions are rarely anywhere near as selective as
the average individual nonnull value, and indexed access driven by IS
NULL conditions is almost always a mistake.
Therefore, when you find an IS NULL condition on an
indexed column in a DB2 query, you often should prevent use of the index. The
simplest equivalent condition that prevents index use is
COALESCE(Indexed_Column, Indexed_Column) IS NULL. This version is
perfectly equivalent to the original condition Indexed_Column IS NULL,
but the COALESCE( ) function prevents index use.
In addition to tuning techniques that can apply to any
database, there are three useful techniques specific to DB2 that I describe in
the following sections.
4.3.3.1 Place inner joins first in your FROM clause
One sometimes useful technique is simply to
list inner joins first in your FROM clause. This appears never to hurt, and on
older versions of DB2 I have seen this simple technique produce greatly improved
execution plans.
4.3.3.2 Prevent too many outer joins from parsing at once
Older versions of DB2
can take minutes to parse queries with more than about 12 outer joins, and even
then they might fail with errors. Fortunately, there is a workaround for this
problem, using the following template for the SQL. The workaround uses DB2's
nested-tables syntax, in which an outer query contains another query inside a
FROM clause that is treated like a single table for purposes of the
outer query:
SELECT ... FROM (SELECT ... FROM (SELECT ... FROMten outer joins> WHERE to this innermost nested table>) T1 LEFT OUTER JOIN through 20th outer join> WHERE to this outermost nested table>) T2 LEFT OUTER JOIN WHERE
This template applies to a query with 21-30 outer-joined
tables. With 11-20 outer-joined tables, you need only a single nested table.
With more than 30 outer-joined tables, you need even deeper levels of nesting.
In this syntax, DB2 effectively creates nested views on the fly, as defined by
the queries inside parentheses in the FROM clauses.
For purposes of handling outer joins, DB2 handles each of these smaller queries
independently, sidestepping the problem of too many outer joins in a single
query.
|
4.3.3.3 Let DB2 know when to optimize the cost of reading just the first few rows
Normally, DB2 calculates
the cost of executing the entire query and chooses the plan it expects will run
the fastest end to end. However, especially for online queries, you often care
only about the first few rows and prefer to optimize to get the first rows
soonest.
The technique to read the first rows fast, usually following
nested loops, is to add the clause OPTIMIZE FOR
ROWS (or OPTIMIZE FOR 1 ROW), where
is the number of rows you actually need to see fast
out of the larger rowset that the query might theoretically return. This clause
goes at the very end of the query and instructs DB2 to optimize the cost of
returning just those first rows, without regard to the
cost of the rest of the query execution. If you actually know how many rows you
want and trust the optimizer to calculate the best plan, you can choose
on that basis. If you want to force a robust,
nested-loops plan as strongly as possible, just use OPTIMIZE FOR 1
ROW.
In practice, this technique tends to dictate nested-loops
joins, because they avoid reading whole rowsets before even beginning a join.
However, it is possible for an explicit ORDER BY clause to defeat any
attempt to reach the first rows fast. The ORDER BY clause
usually requires a sort following the complete query, usually postponing return
of the first row regardless of the execution plan. You can leave out a sort
condition if you want to force nested-loops joins by this technique, performing
the sort in your application if necessary. The OPTIMIZE FOR 1 ROW hint
is the equivalent of the FIRST_ROWS hint on Oracle and the
OPTION(FAST 1) hint on SQL Server.
Techniques to force precisely chosen execution plans on DB2 are
sparse, in contrast to the extraordinary detail that DB2 reveals about the
execution plan you already have and why DB2 chose it. However, in fairness, I
should mention that the available techniques, in combination with DB2's fairly
good optimizer, have proven sufficient in my own experience.
4.4 Controlling Plans on SQL Server
Provide the optimizer with good statistics about the tables and indexes, so it can calculate the costs of alternatives accurately. Modify the query to prevent execution plans that you do not want, mainly using methods specific to SQL Server. Force a simple execution plan with FORCEPLAN when necessary.4.4.1 SQL Server Optimization Prerequisites
Proving that a little knowledge is a dangerous thing, cost-based optimizers often do a terrible job if they do not have statistics on all the tables and indexes involved in the query. It is therefore imperative to maintain statistics on tables and indexes reliably; this includes regenerating statistics anytime table volumes change much or anytime tables or indexes are rebuilt. It is safest to regenerate statistics periodically, during times when load is relatively quiet, nightly or at least weekly. Run the following from Query Analyzer, then cut and paste the resulting UPDATE STATISTICS commands into the query window and run them as well:-- file called updateall.sql -- update your whole database SELECT 'UPDATE STATISTICS ', name FROM sysobjects WHERE type = 'U'Often, queries include conditions on highly skewed distributions, such as conditions on special types, codes, or flags, when these columns have only a few values. SQL Server automatically maintains statistics on the distribution of indexed column values, enabling SQL Server to estimate selectivities automatically, even when indexed columns have skewed distributions.Occasionally, it is useful to help the SQL Server estimate the selectivity of condition with a skewed distribution even when the distribution applies to a nonindexed column. In such a case, you need to specially request data on that column. For example, to request a statistics group named Eflag on the nonindexed column Exempt_Flag of the Employees table, run:CREATE STATISTICS EFlag on Employees(Exempt_Flag)As an example of a case in which such skewed distributions might apply, consider an Orders table in which you have a Status_Code column with three possible values: 'CL' for closed (i.e., fulfilled) orders, 'CA' for cancelled orders, and 'OP' for open orders. Most orders, by far, would be fulfilled once the application has been running for a few months, causing a steady rise in 'CL' values. A steady, significant fraction of orders would end up cancelled, so the value 'CA' would also eventually point to a large list of orders. However, as long as the business keeps up with incoming orders, the number of open orders would remain moderate and steady, even as data accumulates for years. Quite early, a condition that specified Status_Code='OP' would be selective enough to prefer joining to the corresponding table early, even if Status_Code is not indexed, and it is important to enable the optimizer to realize this fact, preferably without a lot of manual tuning. This requires that your SQL actually mention the specific value that applies to the condition, rather than use a generic stored procedure that only fills in the value of the constant after the parse, at execution time.4.4.2 Modifying the Query
You should usually tune SQL Server with hints. Hints generally go in either the FROM clause, when they apply to a specific table access, or in the SQL Server OPTION( ) clause at the very end of a query. These are the most useful hints:
- WITH
(INDEX(
))
-
Immediately following a table alias in a FROM clause, this hint instructs SQL Server to use the specified index to access that table alias. The older alternative syntax INDEX=
is also supported, but it might be dropped in the future, so I don't recommend it. Even more obsolete and dangerous is the still-supported method of naming the internal object ID that corresponds to the desired index. Naming the index you want with the ID is horribly unreliable, because the index will get a new ID if anyone ever drops it and recreates it, or if the application moves to a new SQL Server database.
- WITH
(INDEX(0))
-
Immediately following a table alias in a FROM clause, this hint instructs SQL Server to use a full table scan to reach that table alias.
- WITH
(NOLOCK)
-
Immediately following a table alias in a FROM clause, this hint instructs SQL Server to read the specified table alias without requiring read locks or otherwise enforcing a consistent read. Read locks on SQL Server can create a bottleneck when combined with heavy update activity on a table. This hint avoids such a bottleneck, potentially at the cost of a consistent view of the data as of a single moment in time.
- LOOP and HASH
-
These two different hints can each immediately precede the JOIN keyword in the FROM clause, instructing SQL Server to perform the specified join with the specified join method. These hints require the new-style join syntax with the JOIN keyword in the FROM clause. The presence of even a single hint of this type also forces all joins to take place in the same order the aliases are listed in the FROM clause.
- OPTION(LOOP JOIN)
-
This hint goes at the end of a query and forces all joins to follow nested loops.
- OPTION(FORCE ORDER)
-
This hint goes at the end of a query and forces all joins to take place in the same order the aliases are listed in the FROM clause.
- OPTION(FAST 1)
-
This hint simply instructs SQL Server to attempt to reach the first returned rows as quickly as possible, which generally favors a nested-loops execution plan. Its effect is much like the OPTION(LOOP JOIN) hint, although in theory SQL Server might recognize that no execution plan could reach the first rows quickly in a query with an explicit ORDER BY, nullifying any effect from OPTION(FAST 1). The OPTION(FAST 1) hint is the equivalent of the FIRST_ROWS hint on Oracle and the OPTIMIZE FOR 1 ROW hint on DB2.
These hints can be combined. You can place multiple hints
within a single WITH clause, separating them with commas—for example,
WITH (INDEX(Employee_First_Name), NOLOCK). Multiple hints in a
single OPTION clause also are separated by commas—for example,
OPTION(LOOP JOIN, FORCE ORDER). Together, these hints give full control
of the join order, the join methods, and the table access methods.
4.4.3 Hint Examples
I'll demonstrate tuning with hints on a couple of queries. If
you choose a robust all-nested-loops plan that drives from the employee last
name to the other tables in optimum order, using the primary keys to reach the
other tables, this query's hints force the desired plan:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Employees E WITH (INDEX(Employee_Last_Name)) INNER JOIN Locations LE WITH (INDEX(Location_PKey)) ON E.Location_ID=LE.Location_ID INNER JOIN Employees M WITH (INDEX(Employee_PKey)) ON E.Manager_ID=M.Employee_ID INNER JOIN Locations LM WITH (INDEX(Location_PKey)) ON M.Location_ID=LM.Location_ID WHERE E.Last_Name = 'Johnson' AND LE.Description='Dallas' OPTION(LOOP JOIN, FORCE ORDER)
SET SHOWPLAN_TEXT ON (as described in Chapter 3) generates
the following results when you run this query from SQL Server Query
Analyzer:
StmtText ----------------------------------------------------------------- |--Bookmark Lookup(...(...[Locations] AS [LM])) |--Nested Loops(Inner Join) |--Bookmark Lookup(...(...[Employees] AS [M])) | |--Nested Loops(Inner Join) | |--Filter(WHERE:([LE].[Description]='Dallas')) | | |--Bookmark Lookup(...(...[Locations] AS [LE])) | | |--Nested Loops(Inner Join) | | |--Bookmark Lookup(...(...[Employees] AS [E])) | | | |--Index Seek(...(... (wrapped line) [Employees].[Employee_Last_Name] (wrapped line) AS [E]), SEEK:([E].[Last_Name]='Johnson') ORDERED) | | |--Index Seek(...(...[Locations].[Location_PKey] (wrapped line) AS [LE]), SEEK:([LE].[Location_ID]=[E].[Location_ID]) ORDERED) | |--Index Seek(...(...[Employees].[Employee_PKey] (wrapped line) AS [M]), SEEK:([M].[Employee_ID]=[E].[Manager_ID]) ORDERED) |--Index Seek(..(...[Locations].[Location_PKey] (wrapped line) AS [LM]), SEEK:([LM].[Location_ID]=[M].[Location_ID]) ORDERED) (12 row(s) affected)
If you don't want to specify all nested loops, you might need
the join HASH and LOOP hints, as shown in the following
alternative to the last query:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Employees E WITH (INDEX(Employee_Last_Name)) INNER HASH JOIN Locations LE WITH (INDEX(Location_Description)) ON E.Location_ID=LE.Location_ID INNER LOOP JOIN Employees M WITH (INDEX(Employee_PKey)) ON E.Manager_ID=M.Employee_ID INNER LOOP JOIN Locations LM WITH (INDEX(Location_PKey)) ON M.Location_ID=LM.Location_ID WHERE E.Last_Name = 'Johnson' AND LE.Description='Dallas'
The preceding query delivers the following execution plan,
triggered by SET SHOWPLAN_TEXT ON:
StmtText ------------------------------------------------------------- |--Bookmark Lookup(...(...[Locations] AS [LM])) |--Nested Loops(Inner Join) |--Bookmark Lookup(...(...[Employees] AS [M])) | |--Nested Loops(Inner Join) | |--Hash Match(Inner Join... (wrapped line) ([E].[Location_ID])=([LE].[Location_ID])...) | | |--Bookmark Lookup(...(...[Employees] AS [E])) | | | |--Index Seek(...(...[Employees].[Employee_Last_Name] (wrapped line) AS [E]), SEEK:([E].[Last_Name]='Johnson') ORDERED) | | |--Bookmark Lookup(...(...[Locations] AS [LE])) | | |--Index Seek(...(...[Locations].[Location_Description] (wrapped line) AS [LE]), SEEK:([LE].[Description]='Dallas') ORDERED) | |--Index Seek(...(...[Employees].[Employee_PKey] (wrapped line) AS [M]), SEEK:([M].[Employee_ID]=[E].[Manager_ID]) ORDERED) |--Index Seek(...(...[Locations].[Location_PKey] (wrapped line) AS [LM]), SEEK:([LM].[Location_ID]=[M].[Location_ID]) ORDERED) (11 row(s) affected)
There are two basic extremes involved in tuning with hints such
as those in this example:
-
Use as little direction as possible to get the execution plan you want, or at least to get close enough to the plan you want for reasonable performance. This approach reasons that SQL Server has more information than you have, and it should be left free to adapt to changing data distributions and take advantage of improvements in SQL Server with future releases. By leaving SQL Server the maximum degrees of freedom, you maximize its power to optimize well for you in the future. However, until you try, you won't know how much direction the SQL Server will need if it did not get the plan right in the first place, so this approach is likely to be iterative, involving adding one hint at a time until SQL Server delivers a good plan.
-
If you did not get the plan you wanted from SQL Server automatically, assume the database has made bad assumptions that will propagate to distort all of its calculations. Therefore, leave it with very little freedom, specifying essentially the whole plan you want.
If you are confident in your chosen execution plan, as you
should be if you apply the methods I describe later in this book, there is
little reason to hold back from fully specifying that plan. I have yet to find a
case in which a well-chosen, robust execution plan needed to evolve to handle
new data distributions or new database features. On the other hand, it is easy
for SQL with a partially restricting set of hints to go wrong, especially if
some table or index loses its statistics. When SQL Server chooses incorrectly,
the error that made the database choose incorrectly is likely to propagate over
the entire plan. However, the OPTION(FAST 1) hint is the sort of
instruction that can be useful even when SQL Server has perfect information,
simply specifying that the time to reach the first row is more important than
the time to reach the last row.
4.4.4 Using FORCEPLAN
An
older method to tune on both Microsoft SQL Server and Sybase is the
FORCEPLAN option. You execute this option with a standalone SQL
statement:
SET FORCEPLAN ON
This option affects all SQL for that connection until you
execute this statement:
SET FORCEPLAN OFF
When FORCEPLAN is ON, the database is
instructed to perform only the simplest optimization on the SQL it sees. It
generally uses nested-loops execution plans that drive through indexes and join
tables in the same order you list them in the FROM clause. When this is
the sort of plan you want, SET FORCEPLAN can be ideal, not only forcing
the plan simply, but even saving parse time that would otherwise be wasted
considering a much wider range of plans, especially for joins of many tables. It
is a blunt-edged sword, so to speak, so only use it when you know the
FROM-clause join order is the correct join order and you want nested
loops.
No comments:
Post a Comment