Databases: Chapter 5. Diagramming Simple SQL Queries

Look ere ye leap.

—John Heywood Proverbs [1546], Pt. I, Ch. 2

To convert the art of SQL tuning to a science requires a common language, a common paradigm for describing and solving SQL tuning problems. This book teaches, for the first time in print in any detail, a method that has served me well and served others I have taught over many years. I call this method the query diagramming method.

Like any new tool, the query diagramming method requires some up-front investment from the would-be tool user. However, mastery of this tool offers tremendous rewards, so I urge you to be patient; the method seems hard only for a while. Soon, it will lead you to answers you would never have found without the tool, with moderate effort, and in the end it can become so second-nature that (like the best tools) you forget you are using it.

5.1 Why a New Method?

Since I am asking for your patience, I begin with a discussion of why this tool is needed. Why not use a tool you already know, like SQL, for solving performance problems? The biggest problem with using SQL for tuning is that it presents both too much and not enough information to solve the tuning problem. SQL exists to describe, functionally, which columns and rows an application needs from which tables, matched on which join conditions, returned in which order. However, most of this information is wholly irrelevant to tuning a query. On the other hand, information that is relevant, essential even, to tuning a query—information about the data distributions in the database—is wholly missing from SQL. SQL is much like the old word problems so notorious in grade-school math, except that SQL is more likely to be missing vital information. Which would you find easier to solve—this:

While camping, Johnny cooked eight flapjacks, three sausages, one strip of bacon, and two eggs for himself and each of his friends, Jim, Mary, and Sue. The girls each gave one-third of their sausages, 25% of their flapjacks, and half their eggs to the boys. Jim dropped a flapjack and two sausages, and they were stolen by a raccoon. Johnny is allergic to maple syrup, and Mary had strawberries on half her flapjacks, but otherwise everyone used maple syrup on their flapjacks. How many flapjacks did the kids eat with maple syrup?

or this:

(8+(0.25 x 8)-1)+(0.75 x 8/2)+(0.75 x 8)=?

The query diagram is the bare-bones synthesis of the tuning essentials of the SQL word problem and the key distribution data necessary to find the optimum execution plan. With the bare-bones synthesis, you lose distracting, irrelevant detail and gain focus on the core of the problem. The result is a far more compact language to use for both real-world problems and exercises. Problems that would take pages of SQL to describe (and, in the case of exercises, days to invent, for realistic problems that did not illegally expose proprietary code) distill to a simple, abstract, half-page diagram. Your learning rate accelerates enormously with this tool, partly because the similarities between tuning problems with functionally different queries become obvious; you recognize patterns and similarities that you would never notice at the SQL level and reuse your solutions with little effort.

No tool that I know of creates anything like the query diagram for you, just as no tool turns math word problems into simple arithmetic. Therefore, your first step in tuning SQL will be to translate the SQL problem into a query diagram problem. Just as translating word problems into arithmetic is usually the hardest step, you will likely find translating SQL tuning problems into query diagrams the hardest (or at least the most time-consuming) step in SQL tuning, especially at first. However, it is reassuring to consider that, although human languages grew up haphazardly to foster communication between complex human minds, SQL was designed with much more structure to communicate with computers. SQL tuning word problems occupy a much more restricted domain than natural-language word problems. With practice, translating SQL to its query diagram becomes fast and easy, even something you can do quickly in your head. Once you have the query diagram and even a novice-level understanding of the query-diagramming method, you will usually find the tuning problem trivial.

As an entirely unplanned bonus, query diagrams turn out to be a valuable aid in finding whole classes of subtle application-logic problems that are hard to uncover in testing because they affect mostly rare corner cases. In Chapter 7, I discuss in detail how to use these diagrams to help find and fix such application-logic problems.

In the following sections, I describe two styles of query diagrams: full and simplified. Full diagrams include all the data that is ever likely to be relevant to a tuning problem. Simplified diagrams are more qualitative and exclude data that is not usually necessary. I begin by describing full diagrams, because it is easier to understand simplified diagrams as full diagrams with details removed than to understand full diagrams as simplified diagrams with details added.

5.2 Full Query Diagrams

Example 5-1 is a simple query that illustrates all the significant elements of a query diagram.

Example 5-1. A simple query with one join

SELECT D.Department_Name, E.Last_Name, E,First_Name
FROM Employees E, Departments D
WHERE E.Department_Id=D.Department_Id
  AND E.Exempt_Flag='Y'
  AND D.US_Based_Flag='Y';

This query translates to the query diagram in Figure 5-1.

Figure 5-1. A full query diagram for a simple query

First, I'll describe the meaning of each of these elements, and then I'll explain how to produce the diagram, using the SQL as a starting point.

5.2.1 Information Included in Query Diagrams

In mathematical terms, what you see in Figure 5-1 is a directed graph: a collection of nodes and links, where the links often have arrows indicating their directionality. The nodes in this diagram are represented by the letters E and D. Alongside both the nodes and the two ends of each link, numbers indicate further properties of the nodes and links. In terms of the query, you can interpret these diagram elements as follows.

5.2.1.1 Nodes

Nodes represent tables or table aliases in the FROM clause—in this case, the aliases E and D. You can choose to abbreviate table or alias names as convenient, as long as doing so does not create confusion.

5.2.1.2 Links

Links represent joins between tables, where an arrow on one end of the link means that the join is guaranteed to be unique on that end of the join. In this case, Department_ID is the primary (unique) key to the table Departments, so the link includes an arrow on the end pointing to the D node. Since Department_ID is not unique in the Employees table, the other end of the link does not have an arrow. Following my convention, always draw links with one arrow with the arrow pointing downward. Consistency is useful here. When you always draw arrows pointing downward, it becomes much easier to refer consistently to master tables as always being located below detail tables. In Chapter 6 and Chapter 7, I describe rules of thumb for finding optimum execution plans from a query diagram. Since the rules of thumb treat joins to master tables differently than joins to detail tables, the rules specifically refer to downward joins and upward joins as a shorthand means of distinguishing between the two types.

Although you might guess that Department_ID is the primary key of Departments, the SQL does not explicitly declare which side of each join is a primary key and which side is a foreign key. You need to check indexes or declared keys to be certain that Department_ID is guaranteed to be unique on Departments, so the arrow combines information about the nature of the database keys and information about which keys the SQL uses.

5.2.1.3 Underlined numbers

Underlined numbers next to the nodes represent the fraction of rows in each table that will satisfy the filter conditions on that table, where filter conditions are nonjoin conditions in the diagrammed SQL that refer solely to that table. Figure 5-1 indicates that 10% of the rows in Employees have Exempt_Flag='Y' and 50% of the rows in Departments have US_Based_Flag='Y'. I call these the filter ratios.

Frequently, there are no filter conditions at all on one or more of the tables. In that case, use 1.0 for the filter ratio (R), since 100% of the rows meet the (nonexistent) filter conditions on that table. Also, in such cases, I normally leave the filter ratio off the diagram entirely; the number's absence implies R=1.0 for that table. A filter ratio cannot be greater than 1.0. You can often guess approximate filter ratios from knowledge of what the tables and columns represent (if you have such knowledge). It is best, when you have realistic production data distributions available, to find exact filter ratios by directly querying the data. Treat each filtered table, with just the filter clauses that apply to that table, as a single-table query, and find the selectivity of the filter conditions just as I describe in Chapter 2 under Section 2.6.

During the development phase of an application, you do not always know what filter ratio to expect when the application will run against production volumes. In that case, make your best estimate based on your knowledge of the application, not based on tiny, artificial data volumes on development databases.

5.2.1.4 Nonunderlined numbers

Nonunderlined numbers next to the two ends of the link represent the average number of rows found in the table on that end of the join per matching row from the other end of the join. I call these the join ratios. The join ratio on the nonunique end of the join is the detail join ratio, while the join on the unique end (with the arrow) is the master join ratio.

Master join ratios are always less than or equal to 1.0, because the unique key ensures against finding multiple masters per detail. In the common case of a detail table with a mandatory foreign key and perfect referential integrity (guaranteeing a matching master row), the master join ratio is exactly 1.0.

Detail join ratios can be any nonnegative number. They can be less than 1.0, since some master-detail relationships allow zero, one, or many detail rows, with the one-to-zero case dominating the statistics. In the example, note that the average Employees row matches (joins to) a row in Departments 98% of the time, while the average Departments row matches (joins to) 20 Employees. Although you might guess approximate join ratios from knowledge of what the tables represent, query these values when possible from the real, full-volume data distributions. As with filter ratios, you might need to estimate join ratios during the development phase of an application.

5.2.2 What Query Diagrams Leave Out

As important as what query diagrams include is what they do not include. The following sections describe several things that query diagrams omit, and explain why that is the case.

5.2.2.1 Select lists

Query diagrams entirely exclude any reference to the list of columns and expressions that a query selects (everything between SELECT and FROM, that is). Query performance is almost entirely determined by which rows in the database you touch and how you reach them. What you do with those rows, which columns you return, and which expressions you calculate are almost irrelevant to performance. The main, but rare, exception to this rule is when you occasionally select so few columns from a table that it turns out the database can satisfy the query from data in an index, avoiding access to the base table altogether. Occasionally, this index-only access can lead to major savings, but it has little bearing on decisions you make about the rest of the execution plan. You should decide whether to try index-only access as a last step in the tuning process and only if the best execution plan without this strategy turns out to be too slow.

5.2.2.2 Ordering and aggregation

The diagram excludes any indication of ordering (ORDER BY), grouping (GROUP BY), or post-grouping filtering (HAVING). These operations are almost never important. The sort step these generally imply is not free, but you can usually do little to affect its cost, and its cost is usually not the problem behind a badly performing query.

5.2.2.3 Table names

Query diagrams usually abstract table names to table aliases. Once you have the necessary table statistics, most other details are immaterial to the tuning problem. For example, it makes no difference which tables a query reads or which physical entities the tables represent. In the end, you must be able to translate the result back to actions on the original SQL and on the database (actions such as creating a new index, for example). However, when you are solving the abstract tuning problem, the more abstract the node names, the better. Analogously, when your objective is to correctly add a series of numbers, it's no help to worry about whether you are dealing with items in an inventory or humans in a hospital. The more abstractly you view a problem, the clearer it is and the more clearly you notice analogies with similar, past problems that might have dealt with different entities.

5.2.2.4 Detailed join conditions

The details of join conditions are lost when you abstract joins as simple arrows with a couple of numbers coming from outside of the SQL. As long as you know the statistics of the join, the details of the join columns and how they are compared do not matter.

5.2.2.5 Absolute table sizes (as opposed to relative sizes)

The diagram does not represent table sizes. However, you can generally infer relative table sizes from the detail join ratio, by the upper end of the link. Given a query diagram, you need to know overall table sizes to estimate overall rows returned and runtime of the query, but it turns out that you do not need this information to estimate relative runtimes of the alternatives and therefore to find the best alternative. This is a helpful result, because you often must tune a query to run well not just on a single database instance, but also on a whole wide range of instances across many customers. Different customers frequently have different absolute table sizes, but the relative sizes of those tables tend not to vary so much, and join and filter ratios tend to vary far less; in fact, they tend to vary little enough that the differences can be ignored.

In my experience, it is a myth that one of the main advantages of cost-based optimizers is their ability to respond to varying table and index statistics with different plans for the same application's SQL. This theoretical possibility is cited as an argument against manual tuning of an application's SQL, since the result might (theoretically) hurt as many customers as it would help, by figuratively tying the optimizer's hands. On the contrary, I have yet to find a single query that didn't have an execution plan that worked fine on every data distribution I encountered or was likely to encounter. (These plans were not optimal for every distribution, but they were close enough to optimal so as not to matter.) As a result, careful manual tuning of a product's SQL need not harm one customer to help another.

5.2.2.6 Filter condition details

The details of filter conditions are lost when you abstract them to simple numbers. You can choose an optimum path to the data without knowing anything about the details of how, or using which columns, the database excludes rows from a query result. You need only know how effective, numerically, each filter is for the purpose of excluding rows. Once you know that optimum abstract path, you need to return to the detailed conditions, in many cases, to deduce what to change. You can change indexes to enable that optimum path, or you can change the SQL to enable the database to exploit the indexes it already has, but this final step is simple once you know the optimum abstract path.

5.2.3 When Query Diagrams Help the Most

Just as some word problems are too simple to really require abstraction (e.g., "If Johnny has five apples and eats two, how many apples does he have left?"), some queries, like the one in Example 5-1, are so simple that an abstracted representation of the problem might not seem necessary. If all queries were two-way joins, you could manage fine without query diagrams. However, in real-world applications, it is common to join 6 or more tables, and joining 20 or more is not as rare as you might think, especially among the queries you will tune, since these more complex queries are more likely to need manual tuning. (My personal record is a 115-way join, and I routinely tune joins with more than 40 tables.) The more fully normalized your database and the more sophisticated your user interface, the more likely it is that you will have many-way joins. It is a myth that current databases cannot handle 20-way joins, although it is no myth that databases (some more than others) are much more likely to require manual tuning to get decent performance for such complex queries, as compared to simple queries.

Apart from the obvious advantages of stripping away irrelevant detail and focusing attention on just the abstract essentials, query diagrams offer another, more subtle benefit: practice problems are much easier to generate! In this chapter, I begin each problem with realistic SQL, but I cannot use real examples from proprietary applications I have tuned. So, for good examples, I invent an application database design and imagine real business scenarios that might generate each SQL statement. Once I have thoroughly demonstrated the process of reducing a SQL statement to a query diagram, the rest of the book will have problems that start with the query diagrams. Similarly, most of a math text focuses on abstract math, not on word problems. Abstraction yields compact problems and efficient teaching, and abstraction further saves me the bother of inventing hundreds of realistic-sounding word problems that correspond to the mathematical concepts I need to teach. You might not worry much about how much bother this saves me, but it saves you time as well. To better learn SQL tuning, one of the best exercises you can do is to invent SQL tuning problems and solve them. If you start with realistic SQL, you will find this a slow and painful process, but if you just invent problems expressed through query diagrams, you will find you can crank out exercises by the hundreds.

5.2.4 Conceptual Demonstration of Query Diagrams in Use

I have stated without proof that query diagrams abstract all the data needed to answer the most important questions about tuning. To avoid trying your patience and credulity too severely, I will demonstrate a way to use the query diagram in Figure 5-1 to tune its query in detail, especially to find the best join order. It turns out that the best access path, or execution plan, for most multitable queries involves indexed access to the first, driving table, followed by nested loops to the rest of the tables, where those nested loops follow indexes to their join keys, in some optimum join order.

For small master tables, you sometimes can do slightly better by replacing the nested-loops joins to those small tables with hash joins, but the improvement is usually minor and likely not worth the bother or the robustness risk that comes with a hash join.

However, for many-way joins, the number of possible join orders can easily be in the high millions or billions. Therefore, the single most important result will be a fast method to find the best join order.

Before I demonstrate the complex process of optimizing join order for a many-way join using a query diagram, let's just see what the diagram in Figure 5-1 tells us about comparative costs of the two-way join in the underlying SQL query, if you were to try both possible join orders. To a surprisingly good approximation, the cost of all-indexed access, with nested loops, is proportional to the number of table rows the database accesses, so I will use rows touched as the proxy for query cost. If you begin with the detail node E, you cannot work out an absolute cost without assuming some rowcount for that table. However, the rowcount is missing from the diagram, so let it be C and follow the consequences algebraically:

Using an index on the filter for E, the database must begin by touching 0.1 / C rows in the driving table E.
Following nested loops to master table D, the master join ratio shows that the database will reach 0.1 / C x 0.98 rows of D.
Adding the results of Steps 1 and 2, the total count of table rows touched is (0.1 x C)+(0.1 x C x 0.98).
The total rowcount the database will reach, then, factoring out C and 0.1 from both terms, is C x (0.1 x (1+0.98)), or 0.198 x C.

Beginning with the other end of the join, let's still express the cost in terms of C, so first work out the size of D in terms of C. Given the size of D in terms of C, the cost of the query in terms of C follows:

For every row in D, the database finds on average 20 rows in E, based on the detail join ratio. However, just 98% of the rows in E even match rows in D, so the total rowcount C must be 20/0.98 times larger than the rowcount of D. Conversely, then, the rowcount of D is C x 0.98/20.
Following an index to reach only the filter rows on D results in reaching half these rows for the driving table for this alternative, or C / 0.5 x 0.98/20.
Following the join to E, then, the database reaches 20 times as many rows in that table: 20 x C x 0.5 x 0.98/20, or C x 0.5 x 0.98.
Adding the two rowcounts and factoring out the common terms as before, find the total cost: C x (0.5 x 0.98 x ((1/20)+1)), or 0.5145 x C.

The absolute cost is not really the question; the relative cost is, so you can choose the best plan and, whatever C is, 0.198 x C will surely be much lower than 0.5145 x C. The query diagram has all the information you need to know that the plan driving from E and joining to D will surely be much faster (about 2.6 times faster) than the plan driving from D and joining to E. If you found these alternatives to be much closer, you might worry whether the cost estimate based on simple rowcounts was not quite right, and I will deal with that issue later. The message here is that the query diagram answers the key questions that are relevant to optimizing a query; you just need an efficient way to use query diagrams for complex queries.

5.2.5 Creating Query Diagrams

Now that you see what a query diagram means, here are the steps to create one from an SQL statement, including some steps that will not apply to the particular query shown earlier, but that you will use for later queries:

Begin with an arbitrary choice of table alias in the FROM clause, and place this in the middle of a blank page. For now, I will refer to this table as the focus table, simply meaning that this table will be the current point from which to add further details to the query diagram. I will choose alias E as the starting point for the example, just because it is first in the query.
Search for join conditions that match up to a single value of the focus table's primary key. For each such join that you find, draw an arrow pointing down toward the focus table alias from above, labeling the other end of the arrow with the alias at the other end of the join. (You will usually find at most one join into a table from above, for reasons I will discuss later.) When the link represents an outer join, add an arrowhead halfway along the link and point the arrowhead toward the optional table.

In the example, there is no join to the Employees table's primary key, presumed to be Employee_ID. If you want to be certain that Department_ID is not a poorly named primary key of Employees, you can check the table's declared keys and unique indexes. If you suspect that it might really be unique but that the database designer failed to declare or enforce that uniqueness with either a declared key or index, you can check for uniqueness with SELECT COUNT(*), COUNT(DISTINCT Department_ID) FROM Employees;. However, you might be surprised how rarely there is any doubt about where the unique end of the join lies, just from column and table naming. You can generally get by just fine with intelligent guesses, revisiting those guesses only if the answer you come up with performs worse than you expect or require.
Search for join conditions that go from a foreign key in the focus table to a primary key in another table, and draw arrows for those joins pointing down from the focus table, with aliases for the joined-to tables on the lower end of each arrow. When a link represents an outer join, add an arrowhead pointing to the optional table halfway along the link.
Shift focus to another, previously unvisited node in the diagram and repeat Steps 2 and 3 until you have nodes that represent every alias in the FROM clause and arrows that represent every join. (For example, I represent a multipart join as a single arrow from a multipart foreign key to a multipart primary key.) Usually, you will find just a single arrow pointing down into a node, so you will normally search for fresh downward-pointing joins from nodes already on the lower, arrow end of a join. This usually leads to an upside-down tree structure, descending from a single detail table at the top.
After completing all the nodes and links, fill in the numbers for filter ratios and join ratios based, if possible, on production-application table statistics. If you don't have production data, then estimate as best you can. It is not necessary to add join ratios alongside links that represent outer joins. You almost always find that the optional table in an outer join (on the (+) side of the join, in old Oracle notation, or following the LEFT OUTER keywords in the newer style) has no filter condition, leaving the filter ratio equal to 1.0, which you denote by omitting it from the diagram.
Place an asterisk next to the filter ratio for any filter that is guaranteed to return at most a single row. This is not a function just of the ratio and the rowcount of the table, since a condition might return an average of a single row without guaranteeing a maximum of a single row. To guarantee at most a single row, you should have a unique index or a well-understood application constraint that creates a true guarantee.

Guaranteeing Uniqueness

Sometimes, a filter condition is so close to guaranteeing a unique match that it is tempting to treat it as unique. Usually, you can get away with this, but I find that it is useful to first try to make the guarantee complete, by filling in some missing condition or correcting the database design. For example, you might have a filter Order_ID=:1 for an Orders table, with the primary key of Orders that consists of the Order_ID and Company_ID columns, but Company_ID is a single, dominant value 99.9% of the time. The most likely reason the developer left out any condition on Company_ID is simply that he forgot that it is sometimes not equal to the dominant value and that Order_ID does not always specify a unique order by itself. I have found that when filter conditions are close to unique, the intent is almost invariably to make them unique, and the missing condition or conditions should be added to achieve the original intended functionality.

Similar comments apply to almost-unique joins. When the join conditions imply a many-to-almost-always-one join, chances are high that the query or the database design should be corrected to guarantee a perfect many-to-one join.

If you have available production data, it is the ideal source of filter and join ratios. For Example 5-1, you would perform the following queries (Q1 through Q5) to determine these ratios rigorously:

Q1: SELECT COUNT(*) A1 FROM Employees 
    WHERE Exempt_Flag='Y';
A1: 1,000

Q2: SELECT COUNT(*) A2 FROM Employees;
A2: 10,000

Q3: SELECT COUNT(*) A3 FROM Departments 
    WHERE US_Based_Flag='Y';
A3: 245

Q4: SELECT COUNT(*) A4 FROM Departments;
A4: 490

Q5: SELECT COUNT(*) A5 FROM Employees E, Departments D 
    WHERE E.Department_ID=D.Department_ID;
A5: 9,800

With A1 through A5 representing the results these queries return, respectively, do the following math:

To find the filter ratio for the E node, find A1/A2.
To find the filter ratio for the D node, find A3/A4.
To find the detail join ratio, find A5/A4, the rowcount of the unfiltered join divided by the master-table rowcount.
To find the master join ratio, find A5/A2, the rowcount of the unfiltered join divided by the detail-table rowcount.

For example, the results from queries Q1 through Q5—1,000, 10,000, 245, 490, and 9,800, respectively—would yield precisely the filter and join ratios in Figure 5-1, and if you multiplied all of these query results by the same constant, you would still get the same ratios.

5.2.6 A More Complex Example

Now I illustrate the diagramming process for a query that is large enough to get interesting. I will diagram the query in Example 5-2, which expresses outer joins in the older Oracle style (try diagramming this query yourself first, if you feel ambitious).

Example 5-2. A more complex query, with eight joined tables

SELECT C.Phone_Number, C.Honorific, C.First_Name, C.Last_Name, 
       C.Suffix, C.Address_ID, A.Address_ID, A.Street_Address_Line1, 
       A.Street_Address_Line2, A.City_Name, A.State_Abbreviation, 
       A.ZIP_Code, OD.Deferred_Shipment_Date, OD.Item_Count, 
       ODT.Text, OT.Text, P.Product_Description, S.Shipment_Date 
FROM Orders O, Order_Details OD, Products P, Customers C, Shipments S, 
     Addresses A, Code_Translations ODT, Code_Translations OT
WHERE UPPER(C.Last_Name) LIKE :Last_Name||'%'
  AND UPPER(C.First_Name) LIKE :First_Name||'%'
  AND OD.Order_ID = O.Order_ID
  AND O.Customer_ID = C.Customer_ID
  AND OD.Product_ID = P.Product_ID(+)
  AND OD.Shipment_ID = S.Shipment_ID(+)
  AND S.Address_ID = A.Address_ID(+)
  AND O.Status_Code = OT.Code
  AND OT.Code_Type = 'ORDER_STATUS'
  AND OD.Status_Code = ODT.Code
  AND ODT.Code_Type = 'ORDER_DETAIL_STATUS'
  AND O.Order_Date > :Now - 366
ORDER BY C.Customer_ID, O.Order_ID DESC, S.Shipment_ID, OD.Order_Detail_ID;

As before, ignore all parts of the query except the FROM and WHERE clauses. All tables have intuitively named primary keys except Code_Translations, which has a two-part primary key: (Code_Type, Code).

Note that when you find a single-table equality condition on part of a primary key, such as in this query on Code_Translations, consider that condition to be part of the join, not a filter condition. When you treat a single-table condition as part of a join, you are normally looking at a physical table that has two or more logical subtables. I call these apples-and-oranges tables: tables that hold somewhat different, but related entity types within the same physical table. For purposes of optimizing a query on these subtables, you are really interested only in the statistics for the specific subtable the query references. Therefore, qualify all queries for table statistics with the added condition that specifies the subtable of interest, including the query for overall rowcount (SELECT COUNT(*) FROM ), which becomes, for this example, a subtable rowcount (SELECT COUNT(*) FROM Code_Translations WHERE Code_Type='ORDER_STATUS').

If you wish to complete the skeleton of the query diagram yourself, as an exercise, follow the steps for creating a query diagram shown earlier up to Step 4, and skip ahead to Figure 5-4 to see if you got it right.

5.2.6.1 Diagram joins to the first focus

For Step 1, place the alias O, arbitrarily, in the center of the diagram. For Step 2, search for any join to O's primary key, O.Order_ID. You should find the join from OD, which is represented in Figure 5-2 as a downward arrow from OD to O. To help remember that you have diagrammed this join, cross it out in the query.

Figure 5-2. Beginnings of the diagram for Example 5-2

5.2.6.2 Diagram joins from the first focus

Moving on to Step 3, search for other joins (not to Order_ID, but to foreign keys) to Orders, and add these as arrows pointing down from O to C and to OT, since these joins reach Customers and Code_Translations on their primary keys. At this point, the diagram matches Figure 5-3.

Figure 5-3. After Step 3 for the second diagram

5.2.6.3 Change focus and repeat

This completes all joins that originate from alias O. To remember which part is complete, cross out the O in the FROM clause, and cross out the joins already diagrammed with links. Step 4 indicates you should repeat Steps 2 and 3 with a new focus, but it does not go into details about how to choose that focus. If you try every node as the focus once, you will complete the diagram correctly regardless of order, but you can save time with a couple of rules of thumb:

Try to complete upper parts of the diagram first. Since you have not yet tried the top node OD as a focus, start there.
Use undiagrammed joins to find a new focus that has at least some work remaining. Checking the list of joins that have not yet been crossed out would lead to the following list of potential candidates: OD, P, S, A, and ODT. However, out of this list, only OD is now on the diagram, so it is the only focus now available to extend the diagram. If you happened to notice that there are no further joins to C and OT, you could cross them out in the FROM clause as requiring no further work.

By following these rules of thumb you can complete the diagram faster. If at any stage you find the diagram getting crowded, you can redraw what you have to spread things out better. The spatial arrangement of the nodes is only for convenience; it carries no meaning beyond the meaning conveyed by the arrows. For example, I have drawn C to the left of OT, but you could switch these if it helps make the rest of the diagram fit the page.

The illustrators at O'Reilly have worked hard to make my example diagrams look nice, but you don't have to. In most cases, no one will see your diagrams except you, and you will probably find your own diagrams sufficiently clear with little bother about their appearance.

You find no joins to the primary key of Order_Details (to OD.Order_Detail_ID, that is), so all links to OD point downward to P, S, and ODT, which all join to OD with their primary keys. Since S and P are optional tables in outer joins, add downward-pointing arrowheads midway down these links. Cross out the joins from OD that these three new links represent. This leaves just one join left, an outer join from S to the primary key of A. Therefore, shift focus one last time to S and add one last arrow pointing downward to A, with a midpoint arrowhead indicating this outer join. Cross out the last join and you need not shift focus again, because all joins and all aliases are on the diagram.

You should quickly confirm that no alias is orphaned (i.e., has no join condition at all linking it to the rest of the diagram). This occasionally happens, most often unintentionally, when the query includes a Cartesian join.

At this stage, if you've been following along, your join diagram should match the connectivity shown in Figure 5-4, although the spatial arrangement is immaterial. I call this the query skeleton or join skeleton; once you complete it, you need only fill in the numbers to represent relevant data distributions that come from the database, as in Step 5 of the diagramming process.

Figure 5-4. The query skeleton

5.2.6.4 Compute filter and join ratios

There are no filter conditions on the optional tables in the outer joins, so you do not need statistics regarding those tables and joins. Also, recall that the apparent filter conditions OT.Code_Type = 'ORDER_STATUS' and ODT.Code_Type = 'ORDER_DETAIL_STATUS' are part of the joins to their respective tables, not true filter conditions, since they are part of the join keys to reach those tables. That leaves only the filter conditions on customer name and order date as true filter conditions. If you wish to practice the method for finding the join and filter ratios for the query diagram, try writing the queries to work out those numbers and the formulae for calculating the ratios before looking further ahead.

The selectivity of the conditions on first and last names depends on the lengths of the matching patterns. Work out an estimated selectivity, assuming that :Last_Name and :First_Name typically are bound to the first three letters of each name and that the end users search on common three-letter strings (as found in actual customer names) proportionally more often than uncommon strings. Since this is an Oracle example, use Oracle's expression SUBSTR(Char_Col, 1, 3) for an expression that returns the first three characters of each of these character columns.

Recall that for the apples-and-oranges table Code_Translations, you should gather statistics for specific types only, just as if each type was in its own, separate physical table. In other words, if your query uses order status codes in some join conditions, then query Code_Translations not for the total number of rows in that table, but rather for the number of order status rows. Only table rows for the types mentioned in the query turn out to significantly influence query costs. You may end up querying the same table twice, but for different subsets of the table. Example 5-3 queries Code_Translations once to count order status codes and again to count order detail status codes.

The queries in Example 5-3 yield all the data you need for the join and filter ratios.

Example 5-3. Queries for query-tuning statistics

Q1:  SELECT SUM(COUNT(*)*COUNT(*))/(SUM(COUNT(*))*SUM(COUNT(*))) A1
     FROM Customers 
     GROUP BY UPPER(SUBSTR(First_Name, 1, 3)), UPPER(SUBSTR(Last_Name, 1, 3));
A1:  0.0002

Q2:  SELECT COUNT(*) A2 FROM Customers;
A2:  5,000,000

Q3:  SELECT COUNT(*) A3 FROM Orders 
     WHERE Order_Date > SYSDATE - 366;
A3:  1,200,000

Q4:  SELECT COUNT(*) A4 FROM Orders;
A4:  4,000,000

Q5:  SELECT COUNT(*) A5 FROM Orders O, Customers C 
     WHERE O.Customer_ID = C.Customer_ID;
A5:  4,000,000

Q6:  SELECT COUNT(*) A6 FROM Order_Details;
A6:  12,000,000

Q7:  SELECT COUNT(*) A7 FROM Orders O, Order_Details OD 
     WHERE OD.Order_ID = O.Order_ID;
A7:  12,000,000

Q8:  SELECT COUNT(*) A8 FROM Code_Translations 
     WHERE Code_Type = 'ORDER_STATUS';
A8:  4

Q9:  SELECT COUNT(*) A9 FROM Orders O, Code_Translations OT
     WHERE O.Status_Code = OT.Code 
       AND Code_Type = 'ORDER_STATUS';
A9:  4,000,000

Q10: SELECT COUNT(*) A10 FROM Code_Translations 
     WHERE Code_Type = 'ORDER_DETAIL_STATUS';
A10: 3

Q11: SELECT COUNT(*) A11 FROM Order_Details OD, Code_Translations ODT
     WHERE OD.Status_Code = ODT.Code 
       AND Code_Type = 'ORDER_DETAIL_STATUS';
A11: 12,000,000

Beginning with filter ratios, get the weighted-average filter ratio for the conditions on Customers' first and last name directly from A1. The result of that query is 0.0002. Find the filter ratio on Orders from A3/A4, which comes out to 0.3. Since no other alias has any filters, the filter ratios on the others are 1.0, which you imply just by leaving filter ratios off the query diagram for the other nodes.

Find the detail join ratios, to place alongside the upper end of each inner join arrow, by dividing the count on the lower table (the master table of that master-detail relationship) by the count on the join of the two tables. The ratios for the upper ends of the joins from OD to O and ODT are 3 (A7/A4) and 4,000,000 (A11/A10), respectively.

To fit join ratios on the diagram, I will abbreviate millions as m and thousands as k, so I will diagram the last result as 4m.

The ratios for the upper ends of the joins from O to C and OT are 0.8 (from A5/A2) and 1,000,000 (or 1m, from A9/A8), respectively.

Find the master join ratios for the lower ends of each inner-join arrow, by dividing the count for the join by the upper table count. When you have mandatory foreign keys and referential integrity (foreign keys that do not fail to point to existing primary-key values), you find master join ratios equal to exactly 1.0, as in every case in this example, from A7/A6, A11/A6, A5/A4, and A9/A4.

Check for any unique filter conditions that you would annotate with an asterisk (Step 6). In the case of this example, there are no such conditions.

Then, place all of these numbers onto the query diagram, as shown in Figure 5-5.

Figure 5-5. The completed query diagram for the second example

5.2.7 Shortcuts

Although the full process of completing a detailed, complete query diagram for a many-way join looks and is time-consuming, you can employ many shortcuts that usually reduce the process to a few minutes or even less:

You can usually guess from the names which columns are the primary keys.
You can usually ignore ratios for the outer joins and the optional tables they reach. Often, you can leave outer joins off the diagram altogether, at least for a preliminary solution that shows how much room for improvement the query has.
The master join ratio (on the primary-key end of a join) is almost always exactly 1.0, unless the foreign key is optional. Unless you have reason to suspect that the foreign key is frequently null or frequently points to no-longer-existing primary-key values, just leave it off, implying a value of 1.0. The master join ratio is never greater than 1.0.
Some databases implement constraints that rigorously guarantee referential integrity. Even without such constraints, well-built applications maintain referential integrity, or at least near-integrity, in their tables. If you guarantee, or at least assume, referential integrity, you can assume that nonnull foreign keys always point to valid primary keys. In this case, you do not need to run the (comparatively slow) query for joined-table counts to find the join factors. Instead, you just need the percentage of rows that have a null foreign key and the two table counts. To compute those, begin by executing SQL similar to this:
```
SELECT COUNT(*) D, COUNT() F FROM ;
SELECT COUNT(*) M FROM
;
```
Letting D be the first count from the first query, F be the second count from the same query, and M be the count from the second query, the master join ratio is just F/D, while the detail join ratio is F/M.

Join ratios rarely matter; usually, you can just guess. The detail join ratio matters most in the unusual case when it is much less than 1.0, and it matters somewhat whether the detail join factor is close to 1.0 or much larger. You can usually guess closely enough to get in the right ballpark.

Informed guesses for join factors are sometimes best! Even if your current data shows surprising statistics for a specific master-detail relationship, you might not want to depend on those statistics for manual query tuning, since they might change with time or might not apply to other database instances that will execute the same SQL. (This last issue is especially true if you are tuning for a shared application product.)
If you tune many queries from the same application, reuse join-ratio data as you encounter the same joins again and again.
Filter ratios matter the most but usually vary by orders of magnitude, and, when filter ratios are not close, order-of-magnitude guesses suffice. You almost never need more precision than the first nonzero digit (one significant figure, if you remember the term from high school science) to tune a query well for any of the ratios, so round the numbers you show. (For example, 0.043 rounds to 0.04 with one significant figure, and 472 rounds to 500.)

Just as for join ratios, informed guesses for filter ratios are sometimes best. An informed guess will likely yield a result that works well across a range of customers and over a long time. Overdependence on measured values might lead to optimizations that are not robust between application instances or over time.

The key value of diagramming is its effect on your understanding of the tuning problem. With practice, you can often identify the best plan in your head, without physically drawing the query diagram at all!

5.3 Interpreting Query Diagrams

Before I go further, take some time just to understand the content of the query diagram, so that you see more than just a confusing, abstract picture when you look at one. Given a few rules of thumb, interpreting a query diagram is simple. You might already be familiar with entity-relationship diagrams. There is a helpful, straightforward mapping between entity-relationship diagrams and the skeletons of query diagrams. Figure 5-6 shows the query skeleton for Example 5-1, alongside the corresponding subset of the entity-relationship diagram.

Figure 5-6. The query skeleton compared to an entity-relationship diagram

The entity-relationship diagram encodes the database design, which is independent of the query. Therefore, the query skeleton that encodes the same many-to-one relationship (with the arrow pointing to the one) also comes from the database design, not from the query. The query skeleton simply designates which subset of the database design is immediately relevant, so you restrict attention to just that subset encoded in the query skeleton. When the same table appears under multiple aliases, the query diagram also, in a sense, explodes the entity-relationship diagram, showing multiple joins to the same table as if they were joins to clones of that table; this clarifies the tuning problem these multiple joins involve.

With join ratios on the query skeleton, you encode quantitative data about the actual data, in place of the qualitative many indications in the entity-relationship diagram. With the join ratios, you say, on average, how many for many-to-one relationships (with the detail join ratio) and how-often-zero with the master join ratio, when you find many-to-zero-or-one relationships. This too is a function of the underlying data, though not of the database design, and it is independent of the query. These join ratios can vary across multiple database instances that run the same application with different detailed data, but within an instance, the join ratios are fixed across all queries that perform the same joins.

It is fortunate and true that the optimum plan is usually quite insensitive to the join ratios, because this almost always enables you to tune a query well for all customers of an application at once.

Only when you add filter ratios do you really pick up data that is specific to a given query (combined with data-distribution data), because filter conditions come from the query, not from the underlying data. This data shows the relative size of each subset of each table the query requires.

Query diagrams for correctly written queries (as I will show later) almost always have a single detail table at the top of the tree, with arrows pointing down to master (or lookup) tables below and further arrows (potentially) branching down from those. When you find this normal form for a query diagram, the query turns out to have a simple, natural interpretation:

A query is a question asked about the detail entities that map to that top detail table, with one or more joins to master tables below to find further data about those entities stored elsewhere for correct normalization.

For example, a query joining Employees and Departments is really just a question about employees, where the database must go to the Departments table for employee information, like Department_Name, that you store in the Departments table for correct normalization.

Yes, I know, to a database designer, Department_Name is a property of the department, not of the employee, but I am speaking of the business question, not the formal database. In normal business semantics, you would certainly think of the department name (like other Departments properties) as a property of the employees, inherited from Departments.

Questions about business entities that are important enough to have tables are natural in a business application, and these questions frequently require several levels of joins to find inherited data stored in master tables. Questions about strange, unnatural combinations of entities are not natural to a business, and when you examine query diagrams that fail to match the normal form, you will frequently find that these return useless results, results other than what the application requires, in at least some corner cases.

Following are the rules for query diagrams that match the normal form. The queries behind these normal-form diagrams are easy to interpret as sensible business questions about a single set of entities that map to the top table:

The query maps to one tree.
The tree has one root, exactly one table with no join to its primary key. All nodes other than the root node have a single downward-pointing arrow linking them to a detail node above, but any node can be at the top end of any number of downward-pointing arrows.
All joins have downward-pointing arrows (joins that are unique on at least one end).
Outer joins are unfiltered, pointing down, with only outer joins below outer joins.
The question that the query answers is basically a question about the entity represented at the top (root) of the tree (or about aggregations of that entity).
The other tables just provide reference data stored elsewhere for normalization.

5.4 Simplified Query Diagrams

You will see that much of the detail on a full query diagram is unnecessary for all but the rarest problems. When you focus on the essential elements, you usually need only the skeleton of the diagram and the approximate filter ratios. You occasionally need join ratios, but usually only when either the detail join ratio is less than about 1.5 or the master join ratio is less than 0.9. Unless you have reason to suspect these uncommon values for the master-detail relationship, you can save yourself the trouble of even measuring these values. This, in turn, means that less data is required to produce the simplified join diagrams. You won't need table rowcounts for tables without filters. In practice, many-way joins usually have filters only on at most 3-5 tables, so this makes even the most complex query easy to diagram, without requiring many statistics-gathering queries.

Stripping away the usually unnecessary detail that I've just described, you can simplify Figure 5-5 to Figure 5-7.

Figure 5-7. Query diagram for Figure 5-5, simplified

Note that the detail join ratio from C to O in Figure 5-5 is less than 1.5, so continue to show it even in the simplified diagram in Figure 5-7.

When it comes to filters, even approximate numbers are often unnecessary if you know which filter is best and if the other competing filters do not share the same parent detail node. In this case, you can simply indicate the best filter with a capital F and lesser filters with a lowercase f. Further simplify Figure 5-7 to Figure 5-8.

Figure 5-8. Query diagram for Figure 5-7, fully simplified

Note that the detail join ratio from C to O is less than 1.5, so continue to show it even in the fully simplified diagram in Figure 5-8.

Although I've removed the filter ratios from Figure 5-8, you should continue to place an asterisk next to any unique filters (filters guaranteed to return no more than one row). You should also indicate actual filter values for lesser filters that share the same parent detail node. For example, if you have lesser filters on nodes B and C in Figure 5-9, show their actual filter ratios, as illustrated, since they share the parent detail node A.

Figure 5-9. Fully simplified query diagram, showing filter ratios for a shared parent

In practice, you can usually start with a simplified query diagram and add detail only as necessary. If the execution plan (which I explain how to derive in Chapter 6) you find from the simplified problem is so fast that further improvement is pointless, you are finished. (You might be surprised how often this happens.) For example, a batch query that runs in just a few seconds a few times per day is fast enough that further improvement is pointless. Likewise, you need not further tune any online query that runs in under 100 milliseconds that the end user community, as a whole, runs fewer than 1,000 times per day. If after this first round of tuning you think further improvement would still be worth the trouble, you can quickly check the feasibility of more improvement by checking whether you missed important join ratios. The fastest way to do this is to ask whether the single best filter accounts for almost all of the overall reduction in rowcount versus a wholly unfiltered query of just the most detailed table. Assuming that the join diagram maps out as an upside-down tree, the default expectation is that the whole query, without filters, would return the same number of rows as the most detailed table at the root of the join tree (at the top, that is).

With filters, you expect that each filter reduces that rowcount returned from the most detailed table (at the root of the join tree) by the filter ratio. If the best filter ratio times the rowcount of the most detailed table accounts for close to the number of rows that the whole query returns, you know you have not missed any important filter, and the simplified diagram suffices. On the other hand, if the most detailed table's rowcount times the best filter (or what you thought was the best filter!) would yield far more rows than the actual query yields, then you might have missed an important source of row reduction and you should gather more statistics. If the product of all filter ratios (calculated or guessed) times the rowcount of the most detailed table does not come close to the whole-query rowcount, you should suspect that you need further information. In particular, you might have hidden join filters, which are join ratios that unexpectedly turn out to be much less than 1.0; recognizing these and using them for a better plan can yield important further gains.

Pages

Wednesday, 26 September 2012

Chapter 5. Diagramming Simple SQL Queries