Databases: Chapter 2. Data-Access Basics

Come, fill the Cup, and in the fire of Spring
The Winter garment of Repentance fling:
The bird of Time has but a little way
To fly—and Lo! the Bird is on the Wing.

—Omar Khayyam, translated by Edward Fitzgerald The Rubaiyat

You need a clear understanding of the operations of arithmetic to solve an algebra problem. Similarly, you must understand how a database reaches data in individual tables and how it joins data from multiple tables before you can understand how to combine these operations for an optimized execution plan. This book focuses on access methods that are most important to real-world queries and points out which methods are rarely or never useful.

You may find this chapter to be deceptively named; some of these data-access "basics" are quite advanced and obscure, because even the most basic of database operations can be quite involved at the detail level. I urge you not to get discouraged, though. While I include lots of gory detail for those few who really want it and for the relatively rare cases for which it is useful, you can tune quite well with just a passing understanding of indexed access and nested-loops joins. Optimizing a query to make it run faster requires only a high-level understanding of the material in this chapter.

I present this chapter in all its gory detail, though, for two reasons:

Some readers will find the later material much easier to follow and remember if they have a concrete, detailed picture in mind when I refer to specific methods of table access and table joins in later chapters. For example, such readers would have a hard time following and remembering rules of thumb about when to prefer hash joins over nested-loops joins if they knew these join methods only as black-box processes. If you are such a concrete thinker (like myself), this chapter, in all its detail, will help you understand the rest of the book.
The same people who tune queries are often asked awkward questions, like "Why does this query take 12 times longer to return 200 rows than this other query takes to return 1,000 rows?" Another common question is "Shouldn't we be using to speed up this query?" Only with a detailed understanding of the basics discussed in this chapter is it possible to answer such questions well.

A word of explanation up front: many of the specifics in this chapter come from the behavior of Oracle. I find that highly specific descriptions help intuition in performance and tuning, because you can hold a detailed, concrete picture in your head. I could have chosen another database to describe table layouts and table-access and join methods, but no single choice would please everyone. I have found that, for the most part, the differences between database brands really do not matter to SQL tuning. In the few cases in which a vendor-specific implementation matters, I do describe the differences in detail.

2.1 Caching in the Database

All relational databases use some variant of the same general caching scheme to minimize physical I/O, which involves accessing disk storage, in favor of pure logical I/O, or memory-only data access. (Any data access is logical I/O. Memory-only I/O is pure logical I/O, while I/O from disk is both logical and physical I/O.) Figure 2-1 illustrates this basic caching approach.

Figure 2-1. Data caching

The long, horizontally stretched gray rectangle (which would be really long if it included the 100,000 blocks excised from the middle) represents a large segment of memory available to all the database sessions at once. This memory segment, known as block buffer cache, contains uniform-sized (usually 2KB-16KB, depending on the database configuration) blocks of data copied from disk. The blocks contain recently accessed data from the database tables and indexes. The narrow gray rectangles represent these individual blocks.

With minor variations, the cache is populated and maintained as follows: every time the database must access a block of data not already in cache, it requests a read from disk (physical I/O) and places the newly populated block at the head end of the buffer list. Since the list length is fixed while the database runs, adding a block at the head end forces the block at the tail end of the list to fall off (i.e., to no longer be cached).

Under the covers, operations in the cache are managed by pointers in a type of linked list. The new head block is actually the same piece of memory as the old tail block, overwritten by new data and with pointers flipped to change its place on the list.

When, more commonly, the database finds a data block it needs already in the cache (requiring pure logical I/O), it removes that block from its current location and relocates it to the head of the list. Since a block involved in a logical I/O is just moved rather than added to the list, no block is pushed off the tail of the list. Again, the database handles the logical block move by pointers; it doesn't physically copy the data within memory.

Since blocks move back to the head of the list with every logical I/O, the cache ends up ordered from most recently used (MRU) blocks at the head end, to least recently used (LRU) blocks at the tail end. Frequently, touched blocks are defined to be hot, while rarely touched blocks are cold. However, how hot or cold a block is depends on whether you are talking about it being touched by logical I/O or by physical I/O. The most frequently used blocks are hot from the point of view of logical I/O, though they might be cold from the point of view of physical I/O if they never leave the cache.

The block that falls off cache as a result of a physical I/O is the LRU block on the list. For this reason, the caching algorithm I'm describing is often called an LRU caching algorithm, and it is a common approach to caching throughout programming, not just for relational databases. The theory is that the hottest data will cycle to the head of the list and therefore live longer on the list. Really hot data might never leave the list at all. Keeping the most frequently used data in the cache enables rapid access to that data, which in turn enables queries to run faster than they would otherwise.

When a database requests physical I/O, such a request does not necessarily cause a physical disk to move a read head and spin to the correct sector on the disk. Disk subsystems and operating systems do their own caching, and the resulting average performance of what the database sees as physical I/O is much faster, on average, than people often realize. Physical I/O is more expensive than logical I/O, so cache misses (in the database cache) matter, but they don't matter as much as you might imagine.

The LRU caching scheme has several important implications for tuning:

The MRU blocks are the fastest to access, because they will be in the cache. It is a myth, though, that cached access is effectively free. When you count the full processing-time costs tied to logical I/O, it is from 30 to 200 times faster than physical I/O. However, this is not so fast that you can ignore logical I/O, because caching is usually so good that logical I/O happens on the order of 100 times more often than physical I/O. You can still have serious performance problems from the CPU costs of unnecessary logical I/O, even after you eliminate all physical I/O. If you work to reduce logical I/O, physical I/O will mostly take care of itself, as long as you have a reasonably sized cache.
It takes several minutes without logical I/O for a block on a typical, well-configured system to migrate from the head of the list to the tail and fall off, so the hottest any block can be in terms of physical I/O is one I/O every few minutes. With more frequent access than that, the block becomes well cached and sees less physical I/O.
Table rows and index entries that share a block with other data that is frequently accessed benefit from their proximity to that hot data. The more effectively you arrange for hot data to clump together within tables and indexes, the more effective the cache is.
End users benefit from recent access of hot data by other end users, since a database cache is a shared cache, usually finding most (often 99% or more) of the data they want without physical I/O.
You will do little physical I/O on blocks that you access repeatedly within the same query, even if these blocks are normally cold, because your query itself will make them hot, and therefore cached, for most of the query.
Small tables and indexes (tables with fewer than 10,000 rows and indexes with fewer than 90,000 entries) tend to be almost perfectly cached. Occasionally, a small table or index is touched so infrequently that it is not well cached. However, infrequent access implies that any resulting slow performance will not add up to a large problem. Even when a query against a small object requires physical I/O, not much physical I/O will be performed, because the first few physical I/Os will place the entire object in the cache for the duration of the query. I use the term self-caching to describe the process by which a query to normally cold blocks makes those blocks hot for the duration of the query. Self-caching lets you safely ignore potential physical I/O to small tables and indexes, even in the uncommon case in which small objects are cold.
Even the largest indexes tend to be at least moderately well cached, because indexes are much more compact than tables, and because access plans involving these indexes tend to hit mostly the hotter parts of the indexes.
Only large tables (tables holding over 1,000,000 rows) tend to be poorly cached, unless you systematically touch only some hot subset of a table. Minimizing the number of different blocks that you hit in the largest tables, to avoid excessive physical I/O, is particularly important to performance.
All else being equal (i.e., given alternatives that require the same number of logical I/Os), favor the alternative that drives to hotter blocks. This is often the alternative that goes through an index, since index blocks and table blocks reached by indexed access tend to be hotter than random table blocks.

2.2 Tables

Tables are the fundamental core of a relational database. Relational theory describes tables as abstract objects, ascribing no significance to the order of the rows or the columns that make up a table. However, tables also exist in physical form on disk in your database server, with physical ordering that affects performance. When an application queries for those physical bytes stored on disk or cached in memory, the server processes must have some way to reach them.

The physical layout of table rows affects the performance of reads of those rows, so it is important to understand the types of tables and how they affect the layout. Figure 2-2 shows four different physical tables, illustrating four basic patterns of table growth and aging, and shows how these patterns affect data layouts.

Figure 2-2. Physical table growth and aging

Tables occupy one or more contiguous areas of disk space (called extents on Oracle^[1]) that the server can read with minimal read-head movement and maximal efficiency. The database organizes table rows in blocks, which are too small to show here, usually 2KB-16KB in size. These blocks are constant-sized across most or all of the database (depending on the vendor). The blocks are the smallest units that the database reads from disk and caches, as discussed earlier. As formerly empty blocks within an extent become occupied, the high-water mark—the highest point of the table that the database needs to scan—rises toward the top of the extent until, reaching the top, it triggers allocation of a new extent.^[2] Above the high-water mark is space reserved for future writes, but not space that the database will ever touch for a read. The high-water mark does not move downward unless you rebuild or truncate the table. Figure 2-2 illustrates growth patterns described in the following sections.

^[1] Technically, an extent is contiguous according to virtual disk addressing, which is all the database software knows about. At a lower level, a RAID or other disk-striping/mirroring system can translate these contiguous virtual addresses to blocks on different disks, but you can safely ignore this subtlety when tuning SQL.

^[2] This most closely describes Oracle, but the differences between vendors, in this area, are not important to SQL tuning.

2.2.1 Continuous Growth

The continuous growth pattern, shown for T1 in Figure 2-2, is the most common pattern among transaction tables, which continually acquire new rows but almost never lose old rows. It is often regrettable that old rows stay around long after they have outlived their usefulness, but deciding what is truly safe to purge is hard (and scary) work, even ignoring the effort of writing the routines to do the work. Somehow, this work always ends up at the end of the priority list for a product's features (and who needs it in the initial release of a product?), much to the delight of disk vendors.

In continuous growth tables, the level of interest in rows tends to drop with their age, making the newest rows, conveniently stored together at the top of the table and most likely to be queried, best to cache. When the newest rows are the hottest, the natural clustering of new rows makes optimum use of the cache, and even a very large table can see a good cache-hit ratio (the fraction of logical I/Os that avoid physical I/Os) if you use indexed access that avoids the older rows.

A query that touches all of a continuous growth table (up to the high-water mark, that is), then discards all but some of the newest rows, will look good when the table is new and tiny. However, whole-table-access runtime grows linearly, assuming a constant table-growth rate, and will likely become intolerable. An access path that touches only new rows will maintain roughly constant efficiency, given steady table growth, since you will have a roughly constant number of rows created in the last week or so.

2.2.2 Purge Eldest

The purge eldest pattern, shown for T2 in Figure 2-2, I call the Ouroboros pattern, after the mythical snake that continually eats its own tail. In this table, the oldest rows are periodically purged (all of the oldest rows, not just some subset), completely emptying their blocks and making them available for inserts of the newest rows. The high-water mark need not move once this table reaches mature size, assuming you delete rows (once you begin deletes) at the same rate you insert them. The head of this snake (holding the newest rows) is always chasing the tail (holding the oldest rows), which retreats with every fresh delete. From the point of view of keeping the newest rows physically close together, this pattern is as good as the continuous growth pattern, with the added advantage that, since table growth halts once purging begins, the whole table has a better chance to be well cached. Note that this is an idealized case that is rarely seen, since retention of a few of the oldest rows, or a growth rate that exceeds the purge rate, will tend to gradually mix old and new rows ever more thoroughly.

2.2.3 Purge, Not by Age

The purge, not by age pattern, shown for T3 in Figure 2-2, reflects deletes that are not age-driven. Blocks become available for fresh inserts as soon as they have empty space that exceeds some threshold (typically 60% empty on Oracle), staying eligible until empty space falls below another threshold (typically 10% empty on Oracle). This happens to blocks that are scattered fairly randomly throughout the table, so new rows scatter accordingly, and caching becomes harder with time both as a result of average blocks being emptier and as a result of interesting rows being more spread out. However, such a pattern of purging tends to imply that the level of interest in rows is not age-related, tending to make such a table hard to cache regardless of whether purging scatters new rows.

2.2.4 Complete Purge and Regrowth

The complete purge and regrowth pattern, shown for T4 in Figure 2-2, reflects a wholesale purge with the beginnings of regrowth. In Figure 2-2, the entire table contents were recently deleted and the oldest rows shown are not that old, since the table has a long way to grow before it reaches its old high-water mark. This pattern is similar to the continuous growth pattern shown in T1, except that, since the table was formerly large and has not been rebuilt, the high-water mark has not fallen back, forcing full table scans to do just as much physical I/O as before the purge.

Oracle's TRUNCATE command, as an alternative to DELETE, can drop the high-water mark. On any database, you can also lower the high-water mark by dropping and rebuilding the table.

2.3 Indexes

Indexes are less fundamental, functionally, than tables. Indexes are really just a means to reach table rows quickly. They are essential to performance, but not functionally necessary.

2.3.1 B-Tree Indexes

The most common and most important type of index, by far, is the B-tree index, which reflects a tree structure that is balanced (hence the B) to a constant depth from the root to the leaf blocks along every branch. Figure 2-3 shows a three-deep B-tree, likely reflecting an index pointing to 90,000-27,000,000 rows in a table, the typical size range for three-deep B-trees.

Figure 2-3. A three-level B-tree index

Like an index in a book, a database index helps the database quickly find references to some value or range of values that you need from a table. For example, indexing Last_Name in a Persons table allows rapid access to the list of table rows where Last_Name='SMITH' or where Last_Name>='X' AND Last_Name<'Y'. Unlike indexes in a book, however, database indexes are almost effortless to use; the database is responsible for walking through the indexes it chooses to use, to find the rows a query requires, and it usually finds the right indexes to use without being told. Databases don't always make the right choice, however, and the material in this book exists largely to address such problems.

Every index has a natural sort order, usually ascending in the order that is natural to the type of the indexed column. For example, the number 11 falls between 10 and 12, but the character string '11' falls between '1' and '2'. Indexes often cover multiple columns, but you can think of such indexes as having a single sort key made up of the concatenation of the multiple column values, with suitable padding to cause sorting on the second column to begin only after completing the sort on the first column.

Index access always begins at the single root block, which holds up to about 300 ranges of index values covered by data in the table. These ranges usually carry pointers to branch blocks for each range, when the index as a whole covers more than about 300 rows (the actual number depending on block and column sizes, mostly).^[3] An index on a table with fewer than 300 rows typically has only the root block, which contains the pointers that lead directly to the indexed rows in the table. Each of these pointers takes the form of a block address and a row number within the block, for each indexed row. In any case, no matter how large the table, you can assume that the root block is perfectly cached, since every use of the index goes through that single block.

^[3] I speak here of indexed rows, as opposed to table rows, because not all indexes point to all rows of a table. Most notably, Oracle indexes do not contain entries that point to rows with null values on all the indexed columns, so an index on a mostly null column can be very small, even on a very large table, if few rows have non-null values for that column.

A block address and row number within the block, together, are called a rowid. Indexes use rowids to point to rows in tables.

Assuming a table has more than about 300 indexed rows, the database follows a pointer in the root block to a branch block that covers the beginning of the range of values you requested in your query. If the table has more than about 90,000 indexed rows, the branch block, in turn, contains subranges with pointers to blocks in the next lower level. Finally (at the first branch level, with a table in the 300-90,000 indexed-row range), the database arrives at a leaf block with an exact value that matches the beginning of the range you want (assuming the range you queried has any rows at all) and a rowid for the first row in the range.

If the condition that drives access to the index potentially points to a range of multiple rows, the database performs an index range scan over the leaf blocks. An index range scan is the operation of reading (usually with logical I/O from cache) an index range over as many leaf blocks as necessary. The leaf blocks consist of value/rowid pairs in a list sorted by indexed value. The database sorts entries that repeat a value according to the order of the rowids, reflecting the natural order in which the database stores rows in the physical table. At the end of each leaf-block list, a pointer identifies the next leaf block, where the sorted list continues. If the table has multiple rows that satisfy the index range condition, the database follows a pointer from the first row to the next index entry in the range (over 99% of the time, the next index entry is within the same index leaf block), and so on, until it reaches the end of the range. Thus, each read of a range of sorted values requires one walk down the index tree, followed by one walk across the sorted leaf-block values.

Usually, a range scan touches only a single leaf block, since a single leaf block holds 300 values, enough to satisfy most medium-sized range scans without leaving the block. However, a large range scan can traverse a long list of leaf blocks.

With a ballpark number of 300 entries per index block, you will find about 300 branch blocks on the first level. This is still a small enough number to cache well, so you can assume that reads of these index blocks are logical only, with no physical I/O. At the bottom, leaf level of the index, where the index uses three or more levels, there might be many more than 300 leaf blocks. However, even the leaf level of an index has a block count of roughly n/300, where n is the number of indexed rows, and the database can cache 1,000 or more blocks of an index efficiently, if it uses the index enough to really matter.

When the entire index is too large to cache well, you will see some physical I/O when accessing it. However, keep in mind that indexes generally cover just the entity properties that define the parts of the table that are most frequently accessed. Therefore, a database rarely needs to cache all of a large index—it needs to cache only a small subset of the index blocks that point to interesting rows—so even large indexes usually see excellent cache-hit ratios. The bottom line is this: when you compare alternative execution plans, you can ignore costs of physical I/O to indexes. When physical I/O happens at all, physical I/O to tables will almost always dominate physical I/O to indexes.

2.3.2 Index Costs

Having extra indexes around cannot hurt query performance, as long as you avoid using the wrong indexes. Indexes have their downsides, though. Optimizers choose the wrong indexes more often than you might think, so you might be surprised how many query performance problems are created by adding indexes to a database. Even if you have complete confidence that a new index will never take the place of a better index in a query execution plan, add indexes with caution and restraint.

In an ideal world, the only performance cost of adding indexes would come when you add, delete, or update rows. On quiet tables, the performance cost of indexes is never a problem, but on a busy, growing table, it can be a major cost. Inserts into an index are usually not a problem, especially on Oracle, which handles index-block locking for uncommitted work especially gracefully.

Deletes are more difficult than inserts, because B-trees do not behave reversibly when you add rows and then delete them. An index with heavy deletes ends up with nearly empty blocks that are less efficient to cache and read than the same index, pointing to the same rows, would otherwise be had all those deletes not occurred. Occasional, expensive index rebuilds are necessary to restore indexes that have experienced heavy deletes to full efficiency.

Updates are the most expensive index operation. When an update changes at least one indexed column, the database treats the update as both an insert (of the new value) and a delete (of the old value). This expensive two-part update, which is not necessary in tables, is necessary for indexes because updating index values actually changes where a row belongs in an index structure. Fortunately, indexes on primary keys are almost never updated, given correct database design, and updates of foreign keys are uncommon. The indexes that present the greatest danger to update performance are indexes on non-key columns that have real-world meanings that change with time, such as status columns on entities that frequently change status.

Some indexes exist for reasons that are independent of performance—for example, to enforce uniqueness. The need to enforce uniqueness is usually justification enough for a unique index, and, generally, unique indexes are also selective enough to be safe and useful. However, create nonunique indexes with caution; they are not free, and once you create them it is almost impossible to get rid of them without undue risk to performance, since it is very hard to prove that no important query requires any given index. When solving performance problems, I frequently advise creating new indexes. When I do so, though, I almost always have in mind at least one specific query, which runs often enough to matter and has proven unable to run fast enough without the new index.

2.4 Uncommon Database Objects

Simple tables and B-tree indexes suffice for almost all database needs. However, you should at least have a passing awareness of less common database object types, if only to argue against futile attempts to solve problems with wrong and exotic tools. This section describes the more popular special object types.

2.4.1 Index-Organized Tables

Index-organized tables are just indexes that don't point to tables. They are a nifty feature in Oracle, but you can approximate them in any database. Occasionally, a database will find all the columns it needs for a query in an index and will use that index without even touching the corresponding table. If you created an index that happened to have all the columns of a table, you might like to dispose of the table altogether. Index-organized tables handle just this scenario, saving space and the cost of maintaining a separate table. Since index-organized tables have no table to point to, they also avoid the need to include rowids, packing in more rows per block than ordinary indexes on the same columns. This better compactness makes index-organized tables easier to cache than ordinary indexes on the same columns. When you have an index just past the point at which it acquires an extra index level, replacing the index-table combination with a more compact, index-organized table will eliminate that extra level.

Consider using index-oriented organized tables under the following circumstances:

Rows are not much longer than their index key. Tables generally store data more compactly and efficiently than an index with the same data. If rows are long, compared to their key, a much more compact key index read followed by reads of a plain table will work better, or at least as well, and more flexibly.
You almost always reach rows through a single index, probably through all or part of the primary key index. You can create ordinary, secondary indexes on index-organized tables, but when you use those secondary indexes, you defeat the purpose of the index-organized table.
You sometimes read rows in large ranges based on the sort order of an index. Large range reads within an index are particularly efficient compared to reading the same rows scattered randomly throughout an ordinary table. This factor in favor of index-organized tables tends to be in conflict with the previous factor, since access through a primary key is usually unique access, not access over a large range. With multipart primary keys, though, you sometimes read row ranges on partial keys, so you will occasionally see both of these factors in favor of index-organized tables.
You add, delete, and modify rows infrequently. Ordinary tables handle frequent data changes better than index-organized tables. In Online Transaction Processing (OLTP) applications, large tables tend to have frequent data changes; otherwise, they would not grow large. This tends to argue against large, index-organized tables in the OLTP world.

If you like the idea of an index-organized table but you are not using Oracle, you can get almost the same read-time benefits by building ordinary indexes with all the columns you need in your common queries. Even on Oracle, you can follow this strategy when you prefer to leave large, rarely needed columns out of an index for greater compactness.

The biggest drawback of adding columns to ordinary indexes comes when adding columns to already-unique indexes. You can specify a leading subset of the columns in an index-organized table as the unique key, but ordinary unique indexes enforce uniqueness only over the whole combination of columns. Unique indexes usually exist both to speed performance and to enforce uniqueness over the key. However, if you add non-key columns to an ordinary unique key index, you defeat correct enforcement of key uniqueness. You can solve this problem with two indexes: a narrow one just to enforce key uniqueness and a wider one to provide index-only access to table columns. However, databases tend not to choose a wider index when a narrower one already has all the columns needed for a unique read, so you might need to put forth extra effort to force the use of the wider index.

2.4.2 Single-Table Clusters

As a feature, single-table clusters have been around longer than their closely related index-organized tables. A single-table cluster physically arranges table rows in the order of some key value. When interest in rows correlates well with that sort value, such an arrangement improves the hit ratio by keeping hot rows close together. The problem, a showstopper usually, is that it is hard to keep a table organized in sorted order unless rows arrive in sorted order. (And if they arrive in sorted order, you do not need to cluster; natural table ordering will work fine!) If rows do not arrive in sorted order, the database must leave room to shoehorn in new rows where they belong; otherwise, it must periodically reorder rows. Reordering is expensive, and leaving space for later rows wastes space and harms cache efficiency. The bottom line on single-table clusters is that I have never needed them to get good performance, and I do not expect you will either.

2.4.3 Multitable Clusters

Like single-table clusters, multitable clusters are presorted on some key, but multitable clusters have rows, in the same blocks, from multiple tables that join on that key, making the join between those tables extra fast. If you do not access the clustered tables together, having the other table's rows in the blocks you read just gets in the way, so a key question is how often you join the different tables in your application queries. If you have two or more always-joined tables that map one-to-one to each other (i.e., tables that share a unique key), multitable clusters probably work pretty well. But, in that case, why make them separate tables at all? Instead, just put the superset of all columns into a single table. More often, some master table has a one-to-many relationship with its details, such as Orders and Order_Details. Here, the problem becomes the fluidity of most one-to-many relationships. In the cluster block, you must allow for the possibility of many details and at the same time avoid wasting too much space when it turns out there is just one (or even no) detail row. As for single-table clusters, this leads to a tradeoff between wasted space and reordering costs. Just as for single-table clusters, I have never needed multitable clusters for good performance, and you probably will not either.

2.4.4 Partitioned Tables

Partitioned tables are another Oracle feature, originated largely to handle maintenance on truly huge tables. Imagine you have a trade-tracking system for a major brokerage. You probably have an immense history of individual trades, but you rarely have interest in trades older than one year. You want to keep efficient, continuous access to the recent trades, while having some chance to eventually archive really old trades without disrupting your system.

Partitioned tables look like a bunch of subtables, or partitions, that you can maintain independently. For example, you can take one partition offline without disrupting access to the others. Query statements need to refer explicitly only to the partitioned table name that represents the whole collection. The partitions are organized according to some range condition that determines which partition owns which rows. This range condition often uses a date, such as Trade_Date in our example, for which each partition covers a substantial date range, such as a month or a year. As long as the query conditions exclude some of the partitions, the query can skip access to those partitions; it can even run if those partitions are entirely offline. Partitioned tables have some great advantages for ease of administration, but in my own experience I have never needed them just to solve a performance problem. I expect, though, that they help for some very large tables, like those in the trade-tracking example, for which the rows naturally organize along a partition condition, usually a date. From the point of view of SQL tuning, you can treat partitioned tables essentially as normal large tables.

2.4.5 Bit-Mapped Indexes

Bit-mapped indexes are designed largely to address special concerns in data warehousing. The chief virtue of bit-mapped indexes is that they allow you to efficiently combine multiple, not-very-selective conditions covered by different indexes to obtain a single short list of rows that satisfy all conditions at once. However, a single, multicolumn index will allow much the same thing, but without the disadvantages that I describe in the next paragraph, so I do not expect that bit-mapped indexes are often useful; certainly, I have never needed one to tune a query. (I have done relatively little data-warehousing work, so bit-mapped indexes might be more useful to you than to me, if data warehousing is your venue.)

Each stored value of a bit-mapped index points to what amounts to a list of yes/no bits that map to the whole list of table rows, with yes bits mapping to table rows that have that value for the indexed column. These bit strings are easy to AND and OR together with other bit strings of other bit-mapped indexes to combine conditions across bit-mapped values on multiple indexes. The big catch is that such bit strings are expensive to maintain in sync with frequently changing table contents, especially updates in the middle of a table. Bit-mapped indexes work best for tables that are mostly read-only; however, large tables do not grow large by being read-only, and small tables do not require special indexes for efficient data access. The best scenario for success is precisely the data-warehouse scenario for which bit-mapped indexes were designed: a data-warehouse database that is read-only during the day and periodically, probably during nights or weekends, does whole-table refreshes from some transactional database in which the tables grow steadily.

2.5 Single-Table Access Paths

The most basic query requests some subset of a single table. Such queries are rarely interesting from a tuning perspective, but even the most complex query, joining many tables, starts with a single driving table. You should choose the access path to the driving table in a multitable query, just as you would for the single-table query that would result if you stripped the multitable query down, removing the joins, the nondriving tables, and the conditions on the nondriving tables. Every query optimization problem therefore includes the choice of an optimum single-table access path to the driving table. The table implementations and table-access methods differ slightly between the database vendors. To be both accurate and concrete, in this section I will describe table access on Oracle for illustration purposes, but the differences between Oracle and other database brands are not important to SQL tuning.

2.5.1 Full Table Scans

The most basic access path to a table is the full table scan, reading the whole table without an index. Figure 2-4 illustrates this method, as applied to a typical Oracle table.

Figure 2-4. A full table scan

Oracle recognizes that, since it is going to read the entire table, it ought to request physical I/O in parts larger than the block size—in this case, reading 64KB at a time. This results in fewer but larger physical reads, which are faster than many small physical reads covering the same blocks. Not all database vendors follow this method, but it turns out to matter less that you might expect, because disk subsystems and operating systems usually read larger segments, even when the database requests a single block. The database might issue many small read requests, but these translate, in the lower system layers, into a few large read requests, with many smaller requests satisfied out of the disk subsystem cache. The reads continue from the first block to the high-water mark, including empty blocks along the way. Caching only allows the database to avoid a physical multiblock I/O when every block in the 64KB multiblock set of blocks is already in cache. The database reads the blocks of small to medium-sized tables into cache in the usual way, and they expire in the usual few minutes if no other query touches them. Caching entire small or medium-sized tables is often useful, and they end up remaining in cache if the database sees frequent full table scans of such tables.

Large tables present a danger to the caching strategy: the average block in a large table is unlikely to be needed often. If the database followed the usual caching strategy, a large table scan could flush most of the more interesting blocks (from indexes and other tables) out of the cache, harming the performance of most other queries. Fortunately, this is not usually a problem, because blocks from a large full table scan generally go straight to the tail of the cache, where they stay only long enough to satisfy the ongoing query, usually getting replaced by the next group of blocks from the same scan. (This behavior of large full table scans is one of the exceptions to the LRU caching behavior I described earlier.)

2.5.2 Indexed Table Access

The most important costs of a full table scan usually come in the CPU: the cost of examining every block below the high-water mark and the cost of examining every row within those blocks.

Unless you are reading a tiny table (in which case, any access method is likely fine) or reading a large fraction of a larger table, you should ensure that your queries reach the rows you need through an index. Figure 2-5 illustrates an index-based table-access method.

Figure 2-5. Indexed table access

The database begins with some indexed value that defines the beginning of a range of indexed values that meet some query criteria. Starting at the root block, it finds ranges and subranges that lead it down the index tree to the leaf block or blocks that store indexed values that satisfy the query criteria. The database finds the rowids that go with these values and follows them to the precise table blocks, and rows within those blocks, that satisfy the query.

Let's compare the indexed and full table scan access methods with a concrete example. Figure 2-6 shows two paths to read five rows, shown in black, from a 40-block table, which would typically contain around 3,200 rows.

Figure 2-6. Indexed access versus a full table scan

In this example, the table is too small for its index to likely have more than two levels, so the database goes straight from the index root block to the leaf block that holds the start of the five-row range that the query requires. (The middle branch level, nonexistent for such a mid-sized table, is shown in gray.) Probably, the database finds all five index entries in the same leaf block, but it might need to hop to the next leaf block to complete the range if the range started right at the block boundary. Armed with the five rowids, the database goes to the table blocks that hold the five rows. In one case in the example, the database performs two sequential logical I/Os to the same table block, because the rows happen to be close together, but the second logical I/O will surely be cached. (This is an example of the benefit of interesting rows that happen to cluster together in a physical table.)

In the full-table-scan alternative, if the table blocks are not in cache to begin with, the database performs five 64KB reads, covering the whole table up to the high-water mark. Then, in CPU, the database steps through all 40 blocks and all 3,200 rows, discarding all but the 5 rows that match the query condition. If the database had no cache and you cared only about the time to move read heads to perform physical I/O, you would count seven physical I/O operations for the indexed plan and five for the full table scan and choose the full table scan. However, a small table and a smaller index, such as these, are likely completely cached, and 7 logical I/Os, which are to individual table blocks, are cheaper than 40 logical I/Os, even for a full table scan. Apart from logical-I/O costs, the indexed plan avoids CPU costs of looking at over 3,000 rows you do not need.

Both plans, you might suspect, would be fast enough that the difference would not matter,^[4] since efficiency against small tables is not that important in single-table reads. Expanding this example to larger-sized tables, though, the questions become more interesting, mixing physical and logical I/O with runtime differences that are long enough to matter.

^[4] I am not trying to waste your time with an unimportant example. The difference would matter if I scaled the example up to larger tables and indexes, but such an example would be impossible to illustrate in the detail shown in Figure 2-6. I use this smaller example to get the general point across.

2.5.3 Choosing Between a Full Table Scan and Indexed Access

A superficial analysis often favors full table scans. However, a more careful analysis requires you to take into account several considerations that commonly make indexed reads more favorable than they appear superficially:

Index reads are almost always cached.
Table blocks reached by an index tend to be hotter and are more likely to be cached, because indexed reads are specific to rows you (and others, too, probably) really want, while full table scans treat all rows equally.
A single block is more likely to be cached than a multiblock group, so the effective cache-hit ratio on a table is better for an indexed read. For example, if a randomly scattered half of the blocks in a table are cached, the hit ratio for single-block reads to the table is 50%, but the probability of finding all eight blocks of a multiblock read already cached is just 0.5 to the eighth power, or about 0.4%. To reach a 50% effective hit ratio for eight-block reads, you would need a 91.7% hit ratio on randomly cached individual blocks.
Disk subsystems usually make single-block reads effectively multiblock, converting nearby single-block reads into virtual I/O, so the seeming advantage of multiblock reads for full table scans is less than it would seem.
Indexed reads examine only a small part of each block, the rows you want, instead of every row in the block, saving CPU time.
Indexed reads usually scale better as a table grows, giving stable performance, whereas a full table scan becomes steadily worse, even while it might start out a little better than the indexed plan.

The choice to favor indexed access or a full table scan depends on the fraction of the table that the single-table query will read. The database's optimizer will make this choice for you, but not always correctly. If SQL is slow enough to merit tuning, you need to decide for yourself. Here are some general fraction-read ranges to use in choosing your best strategy:

>20% of the rows: Favor full table scans.
<0 .5=".5" i="i" of="of" rows="rows" the="the">: Favor indexed access.
0.5%-20% of the rows: It depends.

The 0.5%-20% range is awkward to handle. Conditions should especially favor indexed access for you to prefer an index at the 20% end of this range. Likewise, don't consider a full table scan at the 0.5% end of the range unless conditions strongly favor a table scan. Here are some factors pertaining to particular queries that tend to favor indexed access toward the higher end of the 0.5%-20% range:

The table is well-clustered on the indexed column, resulting in self-caching over the range. Multiple logical I/Os will hit the same blocks, and the later reads of those blocks will likely remain in cache after the earlier reads put them there (if necessary).
The query accesses hotter-than-average rows, resulting in better caching over the indexed range than the full table scan will see over the whole table.
The query goes in on one value only, reaching rows in rowid order. Where you have exact equality conditions on the fully indexed key, reading rowids for that single key value, the index scan returns those rowids in sorted order. Where the database requires physical I/O, this results in an access pattern much like the full table scan, with the read head moving smoothly from the beginning of the range to the end. Since close-together rows get read sequentially, self-caching is particularly likely, both in the database's cache and in the disk I/O subsystem's cache, when that subsystem does read-ahead.

On the other hand, if you access a range of values, such as Retirement_Date BETWEEN '2002/01/01' and '2003/01/01', you will find a whole series of sorted rowid lists for each date in that range. The read-head movement will look much more random and therefore will be less efficient. Self-caching might not even work in this case if the runtime of the query exceeds the life of the blocks in the cache. Even if you drive off an equality, you can get this less efficient alternative if you have a multicolumn index. For example, Last_Name='Smith' is really a range condition on an index of (Last_Name, First_Name), since this full pair has many values that satisfy the single-column condition.

The precise formulas that control the tradeoff between full table-scan performance and range-scan performance are complex and not very useful, because you'd only be able to guess at the inputs (such as the relative hit ratios between blocks reached by the range scan and other table blocks). All this sounds hideously complex, I know, if you happen to be in that awkward 0.5%-20% range, but, in practice, the problem of handling this middle range turns out to be pretty simple:

If a table is big enough for the difference between a full table scan and indexed table access to matter, you better have a condition that is selective enough to use an index; otherwise, you are likely returning more rows than are useful! I will later describe in more detail why few well-designed queries need to return a significant fraction (even 1%) of a large table. Real-world applications exist mostly to give end users convenient access to data. When end users work with data online, they find it inconvenient to handle large data volumes. End users are somewhat more tolerant of large data volumes in reports, but even then a report that provides more data than an end user can digest is not well designed.
If you're in doubt about whether to use a full table scan or indexed access, just time both alternatives; trial and error works fine when you have only a couple of choices. Keep in mind, though, that whichever alternative you test first will have an unfair disadvantage if you run the experiments close together, since the second experiment will find blocks cached by the first experiment. I usually run each alternative twice, in close succession, and use the second runtime for each, which generally finds perfect caching. If the runtimes come out close, I might repeat experiments 10 minutes or more apart to measure more realistic physical I/O costs, repeating the first experiment 10 minutes after the second, to observe for reproducibility.

Do not look for small improvements. If the two alternatives are close to the same, just stick with the execution plan you already have. Changing a single statement to squeeze a few percent improvement is not likely worth the effort. You should usually look for twofold or better improvements in runtimes when tuning—these are possible surprisingly often—when the performance is slow enough for such improvements to matter.

2.6 Calculating Selectivity

When tuning queries, you can best picture nonjoin (single-table) conditions as filters. These filters pass through the table rows that the application needs (the rows satisfying the conditions), while discarding the rows that the application does not need (the rows failing to satisfy the query conditions). The application has functional reasons to exclude unneeded rows, but from a performance point of view the job of the filter is to save the database work and time. The trick of tuning is to avoid work, as much as possible, on rows destined for the trash heap.

Selectivity is the power of a filter as a row-excluding tool, expressed as the fraction of table rows that the filter passes through. The database can apply filters at three points, with varying success at minimizing query cost:

Determining the index range condition: Conditions that determine the limits of an index range scan require no work at all on the excluded rows.
Determining the table rows reached from the index: Sometimes, conditions do not determine the index range limits but nevertheless can be evaluated in the index before reaching the table. These conditions require touching ultimately excluded row entries in the index, but not in the table.
Determining the rows returned after table access: If a condition requires columns held in the table but not in an index, a database cannot evaluate that condition until it reads the table rows. Filters evaluated in the table are of no value in reducing the costs of reaching the rows of that table, though they reduce network costs because the excluded rows do not need to be returned. If the filtered table is not the last or the only table touched, any filter also reduces the costs of joins to other tables later in the execution plan.

2.6.1 Filter Selectivity

In this section, I explain how to calculate the selectivity of conditions on a table. Let's begin with some definitions:

Individual-condition filter selectivity: The fraction of table rows that satisfy a single condition on that table.
Multiple-condition filter selectivity: The fraction of table rows that satisfy the combination of conditions referring exclusively to that table.
Filter independence: The assumption, usually true, that you can calculate multiple-condition selectivity as the simple product of individual-condition selectivity fractions. For example, a condition on a person's first name and a condition on a person's Zip Code are logically independent. You can assume that the fraction of rows that have both the correct first name and the correct Zip Code will be roughly the product of the fraction of rows that have the correct name and the fraction that have the correct Zip Code. For example, if 1/100^th of the rows contain the desired first name and 1/500^th of the rows contain the desired Zip Code, then the multiple-condition filter selectivity is 1/100 x 1/500 = 1/50,000.
Filter redundancy: The opposite of filter independence. The truth of one condition guarantees the truth of another. For example, a condition on the Zip Code likely guarantees a single value for telephone area code, so the selectivity of conditions on both would be no better than the selectivity of the ZIP Code alone. You can always test for complete or partial filter redundancy by calculating the multicondition filter selectivity based on filter independence and seeing whether it equals the actual selectivity of the combination of the conditions.

When you tune a query and evaluate filter selectivity, start by asking if the query stands alone or if it represents a whole group of queries. If it represents a group of queries, ask about the distribution of values within that group. For example, consider the query:^[5]

^[5] Note that I usually replace the SELECT list with ... in my examples. It turns out that the list of columns and expressions that you select has little impact on query performance, which is dominated by the list of tables in the FROM clause and the conditions in the WHERE clause.

SELECT ... FROM Orders WHERE Unpaid_Flag='Y';

This query has (we would hope) high selectivity, being true for a small fraction of the complete history of orders. If you can count on finding a value of 'Y' for Unpaid_Flag in the query, you might want an index on that column. But if the query is part of a group that just as often searches for the unselective condition Unpaid_Flag='N', you should likely avoid the index. In this example, the meaning of the flag is likely special to the query, driving the very purpose of the query (to find bills to send out), so you can count on finding primarily queries against 'Y', which is the rare value.

Yes, I promised earlier that you did not need to understand the application to tune the SQL. Really, you do not. You can always ask the developers if the application SQL will consistently point to the rare indexed value. However, you might be surprised how much you can deduce about an application with just a little thought about what makes sense, given the names of the tables and columns.

To calculate the selectivity of the condition Unpaid_Flag='Y', begin by executing the following two queries:

SELECT COUNT(*) FROM Orders WHERE Unpaid_Flag='Y';
SELECT COUNT(*) FROM Orders;

The selectivity of the condition is the first count divided by the second.

On the other hand, consider the query:

SELECT ... FROM Order_Details WHERE Order_ID=:id;

End users would query for details of orders roughly at random. You could assume this even if the application replaced the bind variable :id with an actual value, since an application would have no reason to always query the same order. The query against any particular ID does not stand alone, but represents a whole family of queries that you should tune as a unit. End users are as likely to query one order as another, so you could estimate the filter selectivity with:

SELECT 1/COUNT(DISTINCT Order_ID) FROM Order_Details;

A more subtle case arises when the end user might query on any value, but the end user is more likely to query on common values than uncommon values. For example, if operators bring up customers by finding all customers that have the last name of the calling customer, they will bring up common last names more often than uncommon ones with a query such as:

SELECT ... FROM Customers WHERE Last_Name = 'SMITH';

Here, if you just counted distinct names, as you counted order IDs earlier, you would see an over-optimistic selectivity that assumed you were just as likely to search for Last_Name='KMETEC' as for Last_Name='SMITH'. Each last name has a selectivity of n(i)/C, where n(i) is the count of rows with the ith nonnull last name and C is the count of all rows in the table. If choosing any last name were equally probable, you could just average n(i)/C over all the last names. That average would equal one over the number of distinct names. However, the probability of searching on a last name in this scenario is n(i)/C', where C' is the count of rows having nonnull last names. Therefore, you really need the sum of the selectivities times the probability of seeing each selectivity—i.e., the sum of (n(i)/C') x (n(i)/C)—over all last names. Since C' is also the sum of the individual n(i) values, you can compute the filter selectivity in SQL as follows:

SELECT SUM(COUNT(LastName)*COUNT(Last_Name))/
        (SUM(COUNT(Last_Name))*SUM(COUNT(*))) 
FROM Customers GROUP BY Last_Name;

2.6.2 Index Range-Condition Selectivity

Index range-condition selectivity is the fraction of table rows the database examines in an index while scanning the index. Every index range scan must have two endpoints that define the beginning and end of the range. These endpoints can be equivalent to positive or negative infinity, meaning that the range can be unbounded on either end, though not generally on both ends. The range can exclude either or both of the range endpoints, depending on the nature of the bounding condition. For example, the range condition (Salary > 4000 AND Salary < 8000) excludes both endpoints with > and < bounding conditions.

There are common problems that prevent a database from finding a concrete beginning and end for a range scan, even given a very selective condition on the indexed column, generally preventing effective use of the index. It is hard or even impossible (depending on the function) to translate a condition on SomeFunction(Some_Column) into a single range condition on a simple index leading with Some_Column. Generally, a database does not even try to do so.

Function-based indexes, supported on Oracle, are the main exception to this rule. They allow you to specifically index the result of some expression on a table's columns—for example, UPPER(Last_Name)—to drive indexed access to conditions on that expression—for example, UPPER(Last_Name) LIKE 'SMITH%'.

Therefore, expressions that do more than simply name a column do not usually enable indexed access to ranges defined on that column, often making indexes useless for queries using those expressions. This is a double-edged sword: it forces you to rewrite some queries to enable the index that you want to use, but it is also a useful tool, allowing to you rewrite other queries to prevent the database from using indexes that you do not want to use.

A subtle example of a function that disables use of an index is when you compare expressions of different types and the database applies a type-conversion function implicitly. For example, the type-inconsistent condition CharacterColumn=94303 implicitly becomes, in Oracle, TO_NUMBER(CharacterColumn)=94303. To resolve this problem and enable use of an index on the character column, perform the conversion explicitly on the other side. For example:

Replace

Comparing a character expression to anything else in Oracle results in the character value converting to the other type, unless you convert the other side of the comparison explicitly. This is a bit surprising, since numbers and dates can always convert without error to character strings, but character strings frequently fail to convert to numbers and dates.

Even when comparing numbers to numbers, type conversion can lead to difficulties when a database vendor supports different number types, such as integer and decimal types. Type conversions also interfere with efficient, indexed join paths for which a foreign key of one type points to a primary key of another type.

To avoid the problem of implicit type conversion preventing the use of indexes, your database design should always use foreign keys with the same type as the primary keys they reference.

Several rules apply when you figure out how large an index range scan will be:

Conditions on the leading column of the index under consideration are suited to provide a range start or endpoint. Conditions on later columns of the same index are not suited to provide a range start or endpoint, unless you also have exact equality conditions on all preceding columns of that index. For example, an index on (Date_Column, ID_Number) applied to the condition Date_Column >= TO_DATE('2003/01/01', 'YYYY/MM/DD') delivers a range scan fully determined by the date condition. Further conditions on the second column, such as ID_Number=137, do not further narrow the index range scan. To further narrow the range based on the second condition, you have to look at a long list of ranges, one for each possible value of Date_Column that satisfies the first condition, but databases do not do this. However, if you flip the index column order to (ID_Number, Date_Column), then the same two conditions together define the range endpoints and you get a much shorter, faster index range scan.

Since query conditions on dates (and even more so on date-times, which are dates that include a time component) tend not to be equality conditions, multicolumn indexes usually should not lead with a date-type column.

Depending on the database, the database usually assumes that Indexed_Col IS NULL denotes too large a range to be useful, and the database ignores such conditions for establishing range endpoints.

DB2 is the exception to this rule. Oracle does not include references to rows that have only null values for all columns in an index. DB2 does, though, and it appears to treat nulls as just another value, having the same likelihood of appearing as other values. This tends to make DB2 all too likely to choose a plan driving from an is-null condition, since nulls, where allowed, tend to be much more common than individual nonnull values.

The database assumes that Indexed_Col IS NOT NULL covers too large a range to be useful, so the database will not drive to an index from this condition. In rare cases, having any nonnull value is so rare that an index range scan over all possible nonnull values is beneficial. In such cases, if you can figure out a safe lower or upper limit to the range of all possible values, you can enable a range scan with a condition such as Positive_ID_Column > -1 or Date_Column > TO_DATE('0001/01/01','YYYY/MM/DD').
Indexed_Char_Col LIKE 'ABC%' establishes the start and end of a legal range for an index, with Indexed_Char_Col as the leading column of the index. (LIKE is SQL's pattern-matching comparator, and % is the pattern-matching wildcard.)
Indexed_Char_Col LIKE 'ABC%DEF' also establishes the start and end of a legal range, but in this case only the 'ABC%' part of the comparison string contributes to narrowing the index range scan.
Indexed_Number_Column LIKE '123%' does not establish the start and end of a legal range, because the LIKE comparison is meaningful only when applied to character strings and the database must implicitly convert Indexed_Number_Column to a character string to check this condition, disabling any index leading with Indexed_Number_Column. In terms native to numbers, this condition would point to a whole series of ranges:
```
(Indexed_Number_Column >= 123 AND Indexed_Number_Column < 124) OR 
(Indexed_Number_Column >= 1230 AND Indexed_Number_Column < 1240) OR 
(Indexed_Number_Column >= 12300 AND Indexed_Number_Column < 12400) OR...
```
Indexed_Char_Col LIKE '%ABC%' does not establish the start and end of a legal range, because the leading wildcard implies this pattern might be matched anywhere in the entire range of the index.
Equality (=), BETWEEN, and most inequality (<, <=, >, >=) conditions on first columns of indexes establish legitimate index range endpoints.
The not-equal-to inequality condition, usually expressed as != or <>, does not establish an index range, because the database assumes this condition to be too unselective to be worth indexed access. If the excluded value covers almost all the rows, with other values being rare, you can usefully enable indexed access by replacing the negative condition Column!='DominantValue' with Column IN ( of all other values, all of which are rare>), although it could prove inconvenient to keep this list up-to-date as the application evolves.
Series of conditions combined by OR or by an IN list can lead to a series of range scans, but only if every such condition points to a legal range. For example, IDColumn IN (123,124,125) yields three legal equality conditions that deliver three range scans, and (Name='Smith' OR Name='Jones') yields a pair of range scans, but (Name='Smith' OR Name IS NULL) does not enable use of an index (except on DB2), since IS NULL does not mark a legal range.

If you have conditions that do not specify a restricted range on at least the first column of an index, the only way to use an index at all is to perform a full index scan, a scan of every index entry, across all leaf blocks. Databases will not usually choose full index scans, because to read an entire index is expensive and full index scans usually end up pointing to much of the table as well, with a net cost greater than the competing full table scan. You can force full index scans (often without meaning to) by adding a query hint (more on these later) that instructs the database to use an index even though it would not choose to use that index on its own. More often than not, adding such a hint is a mistake, a desperate attempt to get something better than a full table scan. Usually, this happens when a developer values an index too much, actually leading to a worse plan than the plan the database would choose without help.

Even in cases for which a full index scan is better than the competing full table scan, it is almost surely worse than a third alternative: usually driving from a different, sometimes new index or changing a condition on a function to enable a narrow range scan.

2.6.3 Selectivity on Table Rows Reached from the Index

Since indexes are compact and better cached objects, compared to tables, selectivity of an index range scan matters less than efficient table access. Even when an index and table are perfectly cached, the selectivity on the table matters more than on the index, because you read about 300 index rows in a single logical I/O to a leaf block but you must do a separate logical I/O for every table row. This brings us to the second part of evaluating the usefulness of an index: how narrowly will the index restrict access to the table? Fortunately, database implementations are smart enough to check all conditions as early in the execution plan as possible, before touching a table, when the index allows. This reduces table reads whenever an index includes columns required to evaluate a condition, even if the database cannot use that condition to determine the range scan endpoints. For example, consider a Persons table with one index that includes the Area_Code, Phone_Number, Last_Name, and First_Name columns. Consider a query against this table with the conditions:

WHERE Area_Code=916 AND UPPER(First_Name)='IVA'

Only the first condition contributes to the endpoints of the index range scan. The second condition, on the fourth indexed column, fails to narrow that index range scan for three reasons:

The second index column, Phone_Number, has no equality condition specified (no condition at all, for that matter).
The third column, Last_Name, also has no equality condition specified.
The condition on First_Name is disabled as range-limiting by the UPPER function.

Fortunately, none of these reasons prevents the database from checking the second condition before accessing the table. Because the index contains the First_Name column, the database can test conditions against that column using data from the index. In this case, the database uses the Area_Code condition to define an index range scan. Then, as it looks at each index entry in the range, the database discards entries with first names other than Iva. The likely result is just one or two table-row reads on this selective combination of conditions.

Let's examine this situation more closely to decide whether it would be worthwhile to create a better index for this combination of conditions. Both Area_Code and First_Name will see skewed distributions. You will query the common area codes and common first names more often than the uncommon ones, so follow the approach for skewed distributions and find the individual selectivity factors:

SELECT SUM(COUNT(Area_Code)*COUNT(Area_Code))/
        (SUM(COUNT(Area_Code))*SUM(COUNT(*))) 
FROM Customers GROUP BY Area_Code;

SELECT SUM(COUNT(First_Name)*COUNT(First_Name))/
        (SUM(COUNT(First_Name))*SUM(COUNT(*))) 
FROM Customers GROUP BY First_Name;

Let's say that the first calculation yields a selectivity of 0.0086, and the second yields a selectivity of 0.018. Let's further assume that you have a 1,000,000-row table and the index references 200 rows per leaf block (less than usual, because this is an unusually wide index key). The condition on Area_Code alone determines the number of index blocks scanned, so first find the number of rows that condition references. The calculation is 0.0086 x 1,000,000=8,600, indicating that the database scans 43 (8600/20) leaf blocks. (You can always neglect root and branch block reads in single-table queries.) These 43 leaf blocks are likely well cached and require about a millisecond to read.

Databases can perform about 60,000 logical I/Os per second on typical CPUs. However, your mileage will vary considerably. Throughout this book, I include figures like 300 index row entries per leaf block and 60,000 logical I/Os per second because I have found them to be roughly correct and useful for an intuitive understanding of tuning priorities. Across a wide range of database and CPU vendors and a wide range of block and column sizes, these numbers can easily be off by a factor of four or more, and performance will no doubt improve greatly in the years after this book is published. Your intuitive understanding from knowing at least the order-of-magnitude ballpark for these numbers will still help, though, and I hope you find remembering a number easier than remembering a complex, conditional range.

Going to the table with 8,600 reads could be a problem, but, fortunately, you pick up the additional selectivity of the condition on First_Name before reaching the table, reducing the row count to about 155 (8,600 x 0.018) rows. You will see about 155 logical I/Os to the table, with a few physical I/Os, since you will see a less favorable hit ratio on the table, compared to the more compact index.

The multicolumn filter selectivity, 0.0086 x 0.018 = 0.000155, is well into the range that favors indexed access, even though the individual selectivity of the first column alone is in the gray area. Notice that even if you modify the query or the data model to pick up the full selectivity in the index range conditions, the cost will go down only marginally, eliminating most of the 43 well-cached index leaf-block reads and speeding the query by about a millisecond. This will have no effect on the 155 more expensive table-block reads. (You could modify the application to do a case-sensitive match on the first name, and create a new index to cover just these two columns. Alternatively, you could modify the table to store the uppercase first names and index the new column together with area code.) Therefore, even though this query and index are not perfectly suited to one another, the index is good enough to use. It is not worthwhile to add a new index and change the query for the small potential gain.

2.6.4 Combining Indexes

Occasionally, the database finds equality-condition filters that point to different single-column indexes. By combining index range scans on these indexes, the database can pick up the multicolumn filter selectivity before reaching the table. Let's reuse the earlier example of a query that specifies area code and customer first name, but replace the multicolumn index with single-column indexes on the two columns. Further, replace the condition on UPPER(First_Name) with a simple match on the column, assuming the column already stores uppercase values. Typical conditions are then:

WHERE Area_Code=415 AND First_Name='BRUCE';

You can now assume that you have closer to the typical 300 rows or more per index leaf block, since these are single-column indexes on short columns. Using the same single-column filter selectivity factors and assuming as before that the table holds 1,000,000 rows, you can again predict that the Area_Code condition will yield 8,600 (0.0086 x 1,000,000) rows, requiring 29 (8,600/300) index leaf blocks. The second index range scan, for First_Name, will yield 18,000 (0.018 x 1,000,000) rows, requiring 60 (18,000/300) index leaf blocks. Since these are both equality conditions, the database finds the two rowid lists for these two conditions presorted, and it can easily merge these two sorted lists to find the same 155 table rowids found with the earlier multicolumn index.

The operation just described, merging lists of rowids from two indexes, is called the index AND-EQUAL MERGE operation, and the database can do this between any number of equality conditions on single-column indexes. The cost of the table access is the same as finding the rows with one multicolumn index, but you must read more index leaf blocks—in this case, 89 (29+60) instead of the earlier example of 43.

As you can see, the AND-EQUAL MERGE is more expensive than a multicolumn index, but it might be better than using just one single-column index. But this case, in which the AND-EQUAL MERGE is reasonably attractive, is rare. Almost always, the most selective single-column condition is fine by itself and the cost of the less selective index range scan exceeds the savings in table access, or the added cost of the AND-EQUAL MERGE justifies creating a multicolumn index. When you see an AND-EQUAL MERGE in an execution plan, you should almost always either suppress use of the less selective index or create a multicolumn index to use instead. The new multicolumn index should start with the more selective column—in this case, Area_Code—and should usually take the place of any index on that column alone. Such an index sometimes enables you to drop the index on the other column as well.

2.7 Joins

Single-table queries quickly pale as interesting tuning problems. Choices are few enough that even trial and error will quickly lead you to the best execution plan. Multitable queries yield much more interesting problems. To tune multitable queries, you need to understand the different join types and the tradeoffs between the different join execution methods.

2.7.1 Join Types

Let's begin by trying to understand clearly what a multitable query means. First, consider how databases interpret joins, the operations that combine rows from multiple tables into the result a query demands. Let's begin with the simplest imaginable multitable query:

SELECT ... FROM Orders, Order_Details;

With no WHERE clause at all, the database has no instructions on how to combine rows from these two large tables, and it does the logically simplest thing: it returns every possible combination of rows from the tables. If you had 1,000,000 orders and 5,000,000 order details, you would get (if you could wait long enough) 5,000,000,000,000 rows back from the query! This is the rarely used and even more rarely useful Cartesian join. The result, every combination of elements from two or more sets, is known as the Cartesian product. From a business perspective, you would have no interest in combining data from orders and order details that had no relationship to each other. When you find Cartesian joins, they are almost always a mistake.

The most common, but still very rare, exception to this rule is when one of the tables returns only a single row. In that case, you can view a Cartesian join query as a more sensible combination of results from a single-row query appended, for convenience, to results of a logically independent multirow query.

2.7.1.1 Inner joins

Any given order-processing application would surely need details pertaining to given orders, so you aren't too likely to see a Cartesian join. Instead, you would more likely see a join condition that tells the database how the tables relate:

SELECT ... FROM Orders O, Order_Details D WHERE O.Order_ID=D.Order_ID;

Or, shown in the newer-style notation:

SELECT ... FROM Orders O 
              INNER JOIN Order_Details D ON O.Order_ID=D.Order_ID;

Logically, you can still think of this as a Cartesian product with a filtered result: "Give me all the combinations of orders and order details, but discard those combinations that fail to share the same order ID." Such a join is referred to as an inner join. Even in the worst cases, databases virtually always come up with a better way to find the requested rows than this brute-force method of Cartesian product first, followed by discards based on unmet join conditions. This is fortunate indeed, since a many-way Cartesian join between several large tables could take years or even eons to complete. Most joins, like the one shown, consist of an equality between a foreign key in some detail table matched to a primary (unique) key in some master table, but any condition at all that mentions two or (rarely) more tables is a join condition.

From a procedural perspective, the database has several choices for how best to join the tables in the preceding queries:

Start with the master table and find the matching details.
Start with the detail table and look up the matching master rows.
Go to both sets of rows independently (but not with a Cartesian product) and somehow merge those rows that match up on the joined column values.

While they yield identical functional results, these alternatives often yield radically different performance, so this book will explain at length how to choose the best alternative.

2.7.1.2 Outer joins

A common alternative to the inner join is the outer join. An outer join is easiest to describe in procedural terms: start with rows from a driving table and find matching rows, where possible, from an outer-joined table, but create an artificial all-nulls matching row from the outer-joined table whenever the database finds no physical matching row. For example, consider the old-style Oracle query:

SELECT ..., D.Department_Name FROM Employees E, Departments D
       WHERE E.Department_ID=D.Department_ID(+);

In the newer style notation, on any database, you query the same result with:

SELECT ..., D.Department_Name 
       FROM Employees E 
            LEFT OUTER JOIN Departments D 
                            ON E.Department_ID=D.Department_ID;

These queries return data about all employees, including data about their departments. However, when employees have no matching department, the query returns the employee data anyway, filling in inapplicable selected fields, such as D.Department_Name, with nulls in the returned result. From the perspective of tuning, the main implication of an outer join is that it eliminates a path to the data that drives from the outer-joined table to the other table. For the example, the database cannot start from the Departments table and find the matching rows in the Employees table, since the database needs data for all employees, not just the ones with matching departments. Later, I will show that this restriction on outer joins, limiting the join order, is not normally important, because you would not normally want to join in the forbidden order.

2.7.2 Join Execution Methods

The join types determine which results a query requires but do not specify how a database should execute those queries. When tuning SQL, you usually take as a given which results a query requires, but you should control the execution method to force good performance. To choose execution methods well, you need to understand how they work.

2.7.2.1 Nested-loops joins

The simplest method of performing an efficient join between two or more tables is the nested-loops join, illustrated in Figure 2-7.

Figure 2-7. Nested-loops joins

The query execution begins with what amounts to a single-table query of the driving table (the first table the database reads), using only the conditions that refer solely to that table. Think of the leftmost box in Figure 2-7, with the attached crank, as a machine to perform this single-table query. It separates uninteresting rows (destined for the wastebasket at lower left) from interesting rows (rows that satisfy the single-table query conditions) from the driving table. Since the query is a join, the database does not stop here. Instead, it passes the result rows from that first box to the next box. The job of the second box is to take rows from the first box, one at a time, find matching rows from the first joined-to table, then again discard rows that do not meet query conditions on tables so far touched, while passing on rows that meet these conditions. The database usually performs this matching step in a nested-loops join by indexed lookups on the join key, repeated for each row from the driving table.

When the join is an inner join, the database discards rows from the earlier step that fail to find matches in the joined-to table. When the join is an outer join, the database fills in joined-to table values with nulls when it fails to find a match, retaining all rows from the earlier step. This process continues with further boxes to join each of the rest of the tables in the same way until the query is finished, leaving a fully joined result that satisfies all query conditions and joins.

Internally, the database implements this execution plan as a nested series of loops—the outermost loop reads rows off the driving table, the next loop finds matching rows in the first joined table, and so on—which is why these are called nested-loops joins. Each point in the process needs to know only where it is at that moment and the contents of the single result row it is building, so this process requires little memory to execute and no scratch space on disk. This is one of the reasons that nested-loops plans are robust: they can deliver huge result sets from huge tables without ever running out of memory or disk scratch space, as long as you can wait long enough for the result. As long as you choose the right join order, nested loops work well for most sensible business queries. They either deliver the best performance of all join methods, or they are close enough to the best performance that the robustness advantage of this join method weighs in its favor, in spite of a minor performance disadvantage.

When speaking of robustness, I speak only of the join method. Independent of the join, the query might call for a large sorted result (for example, with an ORDER BY clause) that requires a large combination of memory and disk scratch space, regardless of the join method.

2.7.2.2 Hash joins

Sometimes, the database should access the joined tables independently and then match rows where it can and discard unmatchable rows. The database has two ways to do this: hash joins, which I discuss in this section, and sort-merge joins, which I discuss next. Figure 2-8 illustrates a hash join.

Figure 2-8. A hash join

In Figure 2-8, each of the top two boxes with cranks acts like an independently optimized single-table query. Based on table and index statistics, the cost-based optimizer^[6] estimates which of these two independent tables will return fewer rows after discarding filtered rows. It chooses to hash the complete result from that single-table query. In other words, it performs some randomizing mathematical function on the join key and uses the result of that function to assign each row to a hash bucket. The ideal hash algorithm spreads the rows fairly evenly between a number of buckets approximating the number of rows.

^[6] Oracle's rule-based optimizer will never choose a hash join, but its cost-based optimizer often will. The other database vendors do not offer rule-based optimizers.

Since the hash buckets are based on the smaller result set, you can hope those hash buckets fit entirely in memory, but, if necessary, the database temporarily dedicates scratch disk space to hold the buckets. It then executes the larger query (as shown in the upper-right box in Figure 2-8), returning the driving rowset. As each row exits this step, the database executes the same hash function in its join key and uses the hash-function result to go directly to the corresponding hash bucket for the other rowset. When it reaches the right hash bucket, the database searches the tiny list of rows in that bucket to find matches.

Note that there may not be any matches. When the database finds a match (illustrated in the lower-middle box with a crank), it returns the result immediately or sends the result on to the next join, if any. When the database fails to find a match, the database discards that driving row.

For the driving rowset, a hash join looks much like a nested-loops join; both require that the database hold just the current row in memory as it makes a single pass through the rowset. For the smaller, prehashed rowset, however, the hash join approach is less robust if the prehashed rowset turns out to be much larger than expected. A large prehashed rowset could require unexpected disk scratch space, performing poorly and possibly even running out of space. When you are fully confident that the smaller rowset is and always will be small enough to fit in memory, you should sometimes favor the hash join over a nested-loops join.

2.7.2.3 Sort-merge joins

The sort-merge join, shown in Figure 2-9, also reads the two tables independently, but, instead of matching rows by hashing, it presorts both rowsets on the join key and merges the sorted lists. In the simplest implementation, you can imagine listing the rowsets side-by-side, sorted by the join keys. The database alternately walks down the two lists, in a single pass down each list, comparing top rows, discarding rows that fall earlier in the sort order than the top of the other list, and returning matching rows. Once the two lists are sorted, the matching process is fast, but presorting both lists is expensive and nonrobust, unless you can guarantee that both rowsets are small enough to fit well within memory.

Figure 2-9. A sort-merge join

When you compare a sort-merge join with a nested-loops alternative, the sort-merge join has roughly the same potential advantages that hash joins sometimes have. However, when you compare a sort-merge join with a hash join, the hash join has all the advantages: it avoids placing the larger rowset into memory or scratch disk space, with essentially no downside cost. Therefore, when you have a choice between a sort-merge join and a hash join, never choose a sort-merge join.

2.7.2.4 Join methods summary

In summary, you should usually choose nested loops for both performance and robustness. You will occasionally benefit by choosing hash joins, when independent access of the two tables saves significantly on performance while carrying a small enough robustness risk. Never choose sort-merge joins, unless you would prefer a hash join but find that method unavailable (for example, as with Oracle's rule-based optimizer). In later chapters, I will treat nested-loops joins as the default choice. The rules that cover when you can benefit by overriding that default are complex and depend on the material in the later chapters. The main purpose of this section was to explain the internal database operations behind that later discussion.

Pages

Wednesday 26 September 2012

Chapter 2. Data-Access Basics