4.1 Mapping Collections
In any real application you'll be managing lists and groups of things. Java provides a healthy and useful set of library classes to help with this: the Collections utilities. Hibernate provides natural ways for mapping database relationships onto Collections, which are usually very convenient. You do need to be aware of a couple semantic mismatches, generally minor. The biggest is the fact that Collections don't provide 'bag' semantics, which might frustrate some experienced database designers. This gap isn't Hibernate's fault, and it even makes some effort to work around the issue.
NOTE
Bags are like sets, except that the same value can appear more than once.
NOTE
As usual, the examples assume you followed the steps in the previous chapters. If not, download the example source as a starting point.
The information we need to keep track of for artists is, at least initially, pretty simple. We'll start with just the artist's name. And each track can be assigned a set of artists, so we know who to thank or blame for the music, and you can look up all tracks by someone we like. (It really is critical to allow more than one artist to be assigned to a track, yet so few music management programs get this right. The task of adding a separate link to keep track of composers is left as a useful exercise for the reader after understanding this example.)
Example 4-1. Mapping document for the Artist class
1 <?xml version="1.0"?>
2 <!DOCTYPE hibernate-mapping PUBLIC"-//Hibernate/Hibernate Mapping DTD 2.0//EN"
3 "http://hibernate.sourceforge.net/hibernate-mapping-2.0.dtd">
4
5 <hibernate-mapping>
6
7 <class name="com.oreilly.hh.Artist" table="ARTIST">
8 <meta attribute="class-description">
9 Represents an artist who is associated with a track or album.
10 @author Jim Elliott (with help from Hibernate)
11 </meta>
12
13 <id name="id" type="int" column="ARTIST_ID">
14 <meta attribute="scope-set">protected</meta>
15 <generator class="native"/>
16 </id>
17
18 <property name="name" type="string">
19 <meta attribute="use-in-tostring">true</meta>
20 <column name="NAME" not-null="true" unique="true" index="ARTIST_NAME"/>
21 </property>
22
23 <set name="tracks" table="TRACK_ARTISTS" inverse="true">
24 <meta attribute="field-description">Tracks by this artist</meta>
25 <key column="ARTIST_ID"/>
26 <many-to-many class="com.oreilly.hh.Track" column="TRACK_ID"/>
27 </set>
28
29 </class>
30
31 </hibernate-mapping>
Our mapping for the name property on lines 18-21 introduces a couple of refinements to both the code generation and schema generation phases. The use-in-tostring meta tag causes the generated class to show the artist's name as well as the cryptic synthetic ID when it is printed, as an aid for debugging (you can see the result near the bottom of Example 4-3). And expanding the column attribute into a full-blown tag allows us finer-grained control over the nature of the column, which we use in this case to add an index for efficient lookup and sorting by name.
Notice that we can represent the fact that an artist is associated with one or more tracks quite naturally in this file (lines 23-27). This tells Hibernate to add a property named tracks to our Artist class, whose type is an implementation of java.util.Set. This will use a new table named TRACK_ARTISTS to link to the Track objects for which this Artist is responsible. The attribute inverse="true" is explained in the discussion of Example 4-2, where the bidirectional nature of this association is examined.
The TRACK_ARTISTS table we just called into existence will contain two columns: TRACK_ID and ARTIST_ID. Any rows appearing in this table will mean that the specified Artist object has something to do with the specified Track object. The fact that this information lives in its own table means that there is no restriction on how many tracks can be linked to a particular artist, nor how many artists are associated with a track. That's what is meant by a 'many-to-many' association. [4.1]
[4.1] If concepts like join tables and many-to-many associations aren't familiar, spending some time with a good data modeling introduction would be worthwhile. It will help a lot when it comes to designing, understanding, and talking about data-driven projects. George Reese's Java Database Best Practices (O'Reilly) has one, and you can even view this chapter online at www.oreilly.com/catalog/javadtabp/chapter/ch02.pdf.
On the flip side, since these links are in a separate table you have to perform a join query in order to retrieve any meaningful information about either the artists or the tracks. This is why such tables are often called 'join tables.' Their whole purpose is to be used to join other tables together.
Finally, notice that unlike the other tables we've set up in our schema, TRACK_ARTISTS does not correspond to any mapped Java object. It is used only to implement the links between Artist and Track objects, as reflected by Artist's tracks property.
As seen on line 24, the field-description meta tag can be used to provide JavaDoc descriptions for collections and associations as well as plain old value fields. This is handy in situations where the field name isn't completely self-documenting.
The tweaks and configuration choices provided by the mapping document, especially when aided by meta tags, give you a great deal of flexibility over how the source code and database schema are built. Nothing can quite compare to the control you can obtain by writing them yourself, but most common needs and scenarios appear to be within reach of the mapping-driven generation tools. This is great news, because they can save you a lot of tedious typing!
With that in place, let's add the collection of Artists to our Track class. Edit Track.hbm.xml to include the new artists property as shown in Example 4-2 (the new content is shown in bold).
Example 4-2. Adding an artist collection to the Track mapping file
...
<property name="playTime" type="time">
<meta attribute="field-description">Playing time</meta>
</property>
<set name="artists" table="TRACK_ARTISTS">
<key column="TRACK_ID"/>
<many-to-many class="com.oreilly.hh.Artist" column="ARTIST_ID"/>
</set>
<property name="added" type="date">
<meta attribute="field-description">When the track was created</meta>
</property>
...
This adds a similar Set property named artists to the Track class. It uses the same TRACK_ARTISTS join table introduced in Example 4-1 to link to the Artist objects we mapped there. This sort of bidirectional association is very useful. It's important to let hibernate know explicitly what's going on by marking one end of the association as 'inverse.' In the case of a many-to-many association like this one, the choice of which side to call the inverse mapping isn't crucial. The fact that the join table is named 'track artists' makes the link from artists back to tracks the best choice for the inverse end, if only from the perspective of people trying to understand the database. Hibernate itself doesn't care, as long as we mark one of the directions as inverse. That's why we did so on line 23 of Example 4-1.
While we're updating the Track mapping document we might as well beef up the title property along the lines of what we did for name in Artist:
<property name="title" type="string">
<meta attribute="use-in-tostring">true</meta>
<column name="TITLE" not-null="true" index="TRACK_TITLE"/>
</property>
With the new and updated mapping files in place, we're ready to rerun ant codegen to update the Track source code, and create the new Artist source. This time Hibernate reports that two files are processed, as expected. If you look at Track.java you'll see the new Set-valued property artists has been added, and toString() has been enhanced. Example 4-3 shows the content of the new Artist.java.
Example 4-3. Code generated for the Artist class
package com.oreilly.hh;
import java.io.Serializable;
import java.util.Set;
import org.apache.commons.lang.builder.EqualsBuilder;
import org.apache.commons.lang.builder.HashCodeBuilder;
import org.apache.commons.lang.builder.ToStringBuilder;
/**
* Represents an artist who is associated with a track or album.
* @author Jim Elliott (with help from Hibernate)
*
*/
public class Artist implements Serializable {
/** identifier field */
private Integer id;
/** nullable persistent field */
private String name;
/** persistent field */
private Set tracks;
/** full constructor */
public Artist(String name, Set tracks) {
this.name = name;
this.tracks = tracks;
}
/** default constructor */
public Artist() {
}
/** minimal constructor */
public Artist(Set tracks) {
this.tracks = tracks;
}
public Integer getId() {
return this.id;
}
protected void setId(Integer id) {
this.id = id;
}
public String getName() {
return this.name;
}
public void setName(String name) {
this.name = name;
}
/**
* Tracks by this artist
*/
public Set getTracks() {
return this.tracks;
}
public void setTracks(Set tracks) {
this.tracks = tracks;
}
public String toString() {
return new ToStringBuilder(this)
.append("id", getId())
.append("name", getName())
.toString();
}
public boolean equals(Object other) {
if ( !(other instanceof Artist) ) return false;
Artist castOther = (Artist) other;
return new EqualsBuilder()
.append(this.getId(), castOther.getId())
.isEquals();
}
public int hashCode() {
return new HashCodeBuilder()
.append(getId())
.toHashCode();
}
}
Once the classes are created, we can use ant schema to build the new database schema that supports them.
Of course you should watch for error messages when generating your source code and building your schema, in case there are any syntax or conceptual errors in the mapping document. Not all exceptions that show up are signs of real problems you need to address, though. In experimenting with evolving this schema, I ran into some stack traces because Hibernate tried to drop foreign key constraints that hadn't been set up by previous runs. The schema generation continued past them, scary as they looked, and worked correctly. This may be something that will improve in future versions (of Hibernate or HSQLDB), or it may just be a wart we learn to live with.
The generated schema contains the tables we'd expect, along with indices and some clever foreign key constraints. As our object model gets more sophisticated, the amount of work (and expertise) being provided by Hibernate is growing nicely. The full output from the schema generation is rather long, but Example 4-4 shows highlights.
Example 4-4. Excerpts from our new schema generation
[schemaexport] create table TRACK_ARTISTS (
[schemaexport] ARTIST_ID INTEGER not null,
[schemaexport] TRACK_ID INTEGER not null,
[schemaexport] primary key (TRACK_ID, ARTIST_ID)
[schemaexport] )
...
[schemaexport] create table ARTIST (
[schemaexport] ARTSIT_ID INTEGER NOT NULL IDENTITY,
[schemaexport] name VARCHAR(255) not null
[schemaexport] unique (name)
[schemaexport] )
...
[schemaexport] create table TRACK (
[schemaexport] Track_id INTEGER NOT NULL IDENTITY,
[schemaexport] title VARCHAR(255) not null,
[schemaexport] filePath VARCHAR(255) not null,
[schemaexport] playTime TIME,
[schemaexport] added DATE,
[schemaexport] volume SMALLINT
[schemaexport] )
...
[schemaexport] alter table TRACK_ARTISTS add constraint FK72EFDAD84C5F92B foreign
key (TRACK_ID) references TRACK
[schemaexport] alter table TRACK_ARTISTS add constraint FK72EFDAD87395D347
foreign key (ARTIST_ID) references ARTIST
[schemaexport] create index ARTIST_NAME on ARTIST (name)
[schemaexport] create index TRACK_TITLE on TRACK (title))
NOTE
Cool! I didn't even know how to do some of that stuff in HSQLDB!
Figure 4-1 shows HSQLDB's tree view representation of the schema after these additions. I'm not sure why two separate indices are used to set up the uniqueness constraint on artist names, but that seems to be an implementation quirk in HSQLDB itself, and this approach will work just fine.
Figure 4-1. The HSQLDB graphical tree view of our updated schema
4.2 Persisting Collections
Our first task is to beef up the CreateTest class to take advantage of the new richness in our schema, creating some artists and associating them with tracks.
Example 4-5. Utility methods to help find and create artists, and to link them to tracks
1 package com.oreilly.hh;
2
3 import net.sf.hibernate.*;
4
5 import net.sf.hibernate.cfg.Configuration;
6
7 import java.sql.Time;
8 import java.util.*;
9
10 /**
11 * Create more sample data, letting Hibernate persist it for us.
12 */
13 public class CreateTest {
14
15 /**
16 * Look up an artist record given a name.
17 * @param name the name of the artist desired.
18 * @param create controls whether a new record should be created if
19 * the specified artist is not yet in the database.
20 * @param session the Hibernate session that can retrieve data
21 * @return the artist with the specified name, or <code>null</code> if no
22 * such artist exists and <code>create</code> is <code>false</code>.
23 * @throws HibernateException if there is a problem.
24 */
25 public static Artist getArtist(String name, boolean create,
26 Session session)
27 throws HibernateException
28{
29 Query query = session.getNamedQuery(
30 "com.oreilly.hh.artistByName");
31 query.setString("name", name);
32 Artist found = (Artist)query.uniqueResult();
33 if (found == null && create) {
34 found = new Artist(name, new HashSet());
35 session.save(found);
36 }
37 return found;
38 }
39
40 /**
41 * Utility method to associate an artist with a track
42 */
43 private static void addTrackArtist(Track track, Artist artist) {
44 track.getArtists().add(artist);
45 }
As is so often the case when working with Hibernate, this code is pretty simple and self explanatory. (Do notice that line 8 has changed—we used to import java.util.Date, but we're now importing the whole util package to work with Collections. The '*' is bold to highlight this, but it's easy to miss when scanning the example.)
We'll want to reuse the same artists if we create multiple tracks for them— that's the whole point of using an Artist object rather than just storing strings—so our getArtist() method (starting at line 15) does the work of looking them up by name. The uniqueResult() method it uses on line 32 is a convenience feature of the Query interface, perfect in situations like this, where we know we'll either get one result or none. It saves us the trouble of getting back a list of results, checking the length and extracting the first result if it's there. We'll either get back the single result or null if there were none. (We'd be thrown an exception if there were more than one result, but our unique constraint on the column will prevent that.)
So all we need to do is check for null on line 33, and create a new Artist (lines 34-35) if we didn't find one and we're supposed to.
If we left out the session.save() call, our artists would remain transient. (Itinerant painters? Sorry.) Hibernate is helpful enough to throw an exception if we try to commit our transaction in this state, by detecting references from persistent Track instances to transient Artist instances.
The addTrackArtist() method (lines 40-45) is almost embarrassingly simple. It's just ordinary Java Collections code that grabs the Set of artists belonging to a Track and adds the specified Artist to it. Can that really do everything we need? Where's all the database manipulation code we normally have to write? Welcome to the wonderful world of objectrelational mapping tools!
You might have noticed that getArtist() uses a named query to retrieve the Artist record. In Example 4-6, we will add that at the end of Artist.hbm.xml (actually, we could put it in any mapping file, but this is the most sensible place, since it relates to Artist records).
Example 4-6. Artist lookup query to be added to the artist mapping document
<query name="com.oreilly.hh.artistByName">
<![CDATA[
from com.oreilly.hh.Artist as artist
where upper(artist.name) = upper(:name)
]]>
</query>
We use the upper() function to perform case-insensitive comparison of artists' names, so that we retrieve the artist even if the capitalization is different during lookup than what's stored in the database. This sort of caseinsensitive but preserving architecture, a user-friendly concession to the way humans like to work, is worth implementing whenever possible. Databases other than HSQLDB may have a different name for the function that converts strings to uppercase, but there should be one available.
Now we can use this infrastructure to actually create some tracks with linked artists. Example 4-7 shows the remainder of the CreateTest class with the additions marked in bold. Edit your copy to match (or download it to save the typing).
Example 4-7. Revisions to main method of CreateTest.java in order to add artist associations
1 public static void main(String args[]) throws Exception {
2 // Create a configuration based on the properties file we've put
3 // in the standard place.
4 Configuration config = new Configuration();
5
6 // Tell it about the classes we want mapped, taking advantage of
7 // the way we've named their mapping documents.
8 config.addClass(Track.class).addClass(Artist.class);
9
10 // Get the session factory we can use for persistence
11 SessionFactory sessionFactory = config.buildSessionFactory();
12
13 // Ask for a session using the JDBC information we've configured
14 Session session = sessionFactory.openSession();
15Transaction tx = null;
16 try {
17 // Create some data and persist it
18 tx = session.beginTransaction();
19
20 Track track = new Track("Russian Trance",
21 "vol2/album610/track02.mp3",
22 Time.valueOf("00:03:30"), new Date(),
23 (short)0, new HashSet());
24 addTrackArtist(track, getArtist("PPK", true, session));
25 session.save(track);
26
27 track = new Track("Video Killed the Radio Star",
28 "vol2/album611/track12.mp3",
29 Time.valueOf("00:03:49"), new Date(),
30 (short)0, new HashSet());
31 addTrackArtist(track, getArtist("The Buggles", true, session));
32 session.save(track);
33
34
35 track = new Track("Gravity's Angel",
36 "vol2/album175/track03.mp3",
37 Time.valueOf("00:06:06"), new Date(),
38 (short)0, new HashSet());
39 addTrackArtist(track, getArtist("Laurie Anderson", true, session));
40 session.save(track);
41
42 track = new Track("Adagio for Strings (Ferry Corsten Remix)",
43 "vol2/album972/track01.mp3",
44 Time.valueOf("00:06:35"), new Date(),
45 (short)0, new HashSet());
46 addTrackArtist(track, getArtist("William Orbit", true, session));
47 addTrackArtist(track, getArtist("Ferry Corsten", true, session));
48 addTrackArtist(track, getArtist("Samuel Barber", true, session));
49 session.save(track);
50
51 track = new Track("Adagio for Strings (ATB Remix)",
52 "vol2/album972/track02.mp3",
53 Time.valueOf("00:07:39"), new Date(),
54 (short)0, new HashSet());
55 addTrackArtist(track, getArtist("William Orbit", true, session));
56 addTrackArtist(track, getArtist("ATB", true, session));
57 addTrackArtist(track, getArtist("Samuel Barber", true, session));
58 session.save(track);
59
60 track = new Track("The World '99",
61 "vol2/singles/pvw99.mp3",
62 Time.valueOf("00:07:05"), new Date(),
63 (short)0, new HashSet());
64addTrackArtist(track, getArtist("Pulp Victim", true, session));
65addTrackArtist(track, getArtist("Ferry Corsten", true, session));
66 session.save(track);
67
68 track = new Track("Test Tone 1",
69 "vol2/singles/test01.mp3",
70 Time.valueOf("00:00:10"), new Date(),
71 (short)0, new HashSet());
72 session.save(track);
73
74 // We're done; make our changes permanent
75 tx.commit();
76
77} catch (Exception e) {
78 if (tx != null) {
79 // Something went wrong; discard all partial changes
80 tx.rollback();
81 }
82 throw e;
83} finally {
84 // No matter what, close the session
85 session.close();
86}
87
88// Clean up after ourselves
89sessionFactory.close();
90 }
91 }
The changes to the existing code are pretty minimal. First we need to map our new Artist class, which takes just one method call on line 8 (again, thanks to the naming convention we've been following to link our mapping documents to their classes). The lines that created the three tracks from Chapter 3 need only a single new parameter each, to supply an initially empty set of Artist associations (lines 23, 30, and 38). Each also gets a new follow-up line establishing an association to the artist for that track. We could have structured this code differently, by writing a helper method to create the initial HashSet containing the artist, so we could do this all in one line. The approach we actually used scales better to multi-artist tracks, as the next section illustrates.
The largest chunk of new code, lines 42-66, simply adds three new tracks to show how multiple artists per track are handled. If you like electronica and dance remixes (or classical for that matter), you know how important an issue that can be. Because we set the links up as collections, it's simply a matter of adding each artist link to the tracks. Finally, lines 68-72 add a track with no artist associations to see how that behaves, too. Now you can run ant ctest to create the new sample data containing tracks, artists, and associations between them.
A useful trick if you're making changes to your test data creation program and you want to try it again starting from an empty database is to issue the command ant schema ctest. This tells Ant to run the schema and ctest targets one after the other. Running schema blows away any existing data; then ctest gets to create it anew.
NOTE
Of course, in real life you'd be getting this data into the database in some other way—through a user interface, or as part of the process of importing the actual
4.3 Retrieving Collections
Example 4-8. QueryTest.java enhanced in order to display artists associated with tracks
1 package com.oreilly.hh;
2
3 import net.sf.hibernate.*;
4 import net.sf.hibernate.cfg.Configuration;
5
6 import java.sql.Time;
7 import java.util.*;
8
9 /**
10 * Retrieve data as objects
11 */
12 public class QueryTest {
13
14 /**
15 * Retrieve any tracks that fit in the specified amount of time.
16 *
17 * @param length the maximum playing time for tracks to be returned.
18 * @param session the Hibernate session that can retrieve data.
19 * @return a list of {@link Track}s meeting the length restriction.
20 * @throws HibernateException if there is a problem.
21 */
22 public static List tracksNoLongerThan(Time length, Session session)
23 throws HibernateException
24 {
25 Query query = session.getNamedQuery(
26 "com.oreilly.hh.tracksNoLongerThan");
27 query.setTime("length", length);
28 return query.list();
29 }
30
31 /**
32 * Build a parenthetical, comma-separated list of artist names.
33 * @param artists the artists whose names are to be displayed.
34 * @return formatted list, or an empty string if the set was empty.
35 */
36 public static String listArtistNames(Set artists) {
37 StringBuffer result = new StringBuffer();
38 for (Iterator iter = artists.iterator(); iter.hasNext(); ) {
39 Artist artist = (Artist)iter.next();
40 result.append((result.length() == 0) ? "(" : ", ");
41 result.append(artist.getName());
42 }
43 if (result.length() > 0) {
44 result.append(") ");
45 }
46 return result.toString();
47 }
48
49 /**
50 * Look up and print some tracks when invoked from the command line.
51 */
52 public static void main(String args[]) throws Exception {
53 // Create a configuration based on the properties file we've put
54 // in the standard place.
55 Configuration config = new Configuration();
56
57 // Tell it about the classes we want mapped, taking advantage of
58 // the way we've named their mapping documents.
59 config.addClass(Track.class).addClass(Artist.class);
60
61 // Get the session factory we can use for persistence
62 SessionFactory sessionFactory = config.buildSessionFactory();
63
64 // Ask for a session using the JDBC information we've configured
65 Session session = sessionFactory.openSession();
66 try {
67 // Print the tracks that will fit in seven minutes
68 List tracks = tracksNoLongerThan(Time.valueOf("00:07:00"),
69 session);
70 for (ListIterator iter = tracks.listIterator() ;
71 iter.hasNext() ; ) {
72 Track aTrack = (Track)iter.next();
73 System.out.println("Track: \"" + aTrack.getTitle() + "\" " +
74 listArtistNames(aTrack.getArtists()) +
75 aTrack.getPlayTime());
76 }
77 } finally {
78 // No matter what, close the session
79 session.close();
80 }
81
82 // Clean up after ourselves
83 sessionFactory.close();
84 }
85 }
The first thing we add is a little utility method (lines 31-47) to format the set of artist names nicely, as a comma-delimited list inside parentheses, with proper spacing, or as nothing at all if the set of artists is empty.
NOTE
How easy was that?
Next, as with CreateTest, we need to tell Hibernate to map our new Artist class on line 59. Since all the interesting new multi-artist tracks are longer than five minutes, we increase the cutoff in our query to seven minutes so we can see some (line 68). Finally we call listArtistNames() at the proper position in the println() statement describing the tracks found (line 74).
Example 4-9 shows the new output from ant qtest .
Example 4-9. QueryTest output with artist information
% ant qtest
Buildfile: build.xml
prepare:
compile:
[javac] Compiling 1 source file to /Users/jim/Documents/Work/OReilly/
Hibernate/Examples/ch04/classes
qtest:
[java] Track: "Russian Trance" (PPK) 00:03:30
[java] Track: "Video Killed the Radio Star" (The Buggles) 00:03:49
[java] Track: "Gravity's Angel" (Laurie Anderson) 00:06:06
[java] Track: "Adagio for Strings (Ferry Corsten Remix)" (Ferry Corsten,
William Orbit, Samuel Barber) 00:06:35
[java] Track: "Test Tone 1" 00:00:10
BUILD SUCCESSFUL
Total time: 17 seconds
11.940u 1.510s 0:18.06 74.4% 0+0k 0+7io 0pf+0w
You'll notice two things. First, that this is much easier to interpret than the columns of numbers in Figure 4-2. And second, it worked! Even in the 'tricky' case of the test tone track without any artist mappings, Hibernate takes the friendly approach of creating an empty artists Set, sparing us from peppering our code with the null checks we'd otherwise need to avoid crashing with NullPointerExceptions.
NOTE
But wait, there's more! No additional code needed...
4.4 Using Bidirectional Associations
In our creation code, we established links from tracks to artists, simply by adding Java objects to appropriate collections. Hibernate did the work of translating these associations and groupings into the necessary cryptic entries in a join table it created for that purpose. It allowed us with easy, readable code to establish and probe these relationships. But remember that we made this association bidirectional—the Artist class has a collection of Track associations too. We didn't bother to store anything in there.
The great news is that we don't have to. Because of the fact that we marked this as an inverse mapping in the Artist mapping document, Hibernate understands that when we add an Artist association to a Track, we're implicitly adding that Track as an association to the Artist at the same time.
This convenience works only when you make changes to the 'primary' mapping, in which case they propagate to the inverse mapping. If you make changes only to the inverse mapping, in our case the Set of tracks in the Artist object, they will not be persisted. This unfortunately means your code must be sensitive to which mapping is the inverse.
Let's build a simple interactive graphical application that can help us check whether the artist to track links really show up. It will let you type in an artist's name, and show you all the tracks associated with that artist. A lot of the code is very similar to our first query test. Create the file QueryTest2.java and enter the code shown in Example 4-10.
Example 4-10. Source for QueryTest2.java
1 package com.oreilly.hh;
2
3 import net.sf.hibernate.*;
4 import net.sf.hibernate.cfg.Configuration;
5
6 import java.sql.Time;
7 import java.util.*;
8 import java.awt.*;
9 import java.awt.event.*;
10 import javax.swing.*;
11
12 /**
13 * Provide a user interface to enter artist names and see their tracks.
14 */
15 public class QueryTest2 extends JPanel {
16
17 JList list; // Will contain tracks associated with current artist
18 DefaultListModel model; // Lets us manipulate the list contents
19
20 /**
21 * Build the panel containing UI elements
22 */
23 public QueryTest2() {
24 setLayout(new BorderLayout());
25 model = new DefaultListModel();
26 list = new JList(model);
27 add(new JScrollPane(list), BorderLayout.SOUTH);
28
29 final JTextField artistField = new JTextField(30);
30 artistField.addKeyListener(new KeyAdapter() {
31 public void keyTyped(KeyEvent e) {
32 SwingUtilities.invokeLater(new Runnable() {
33 public void run() {
34 updateTracks(artistField.getText());
35 }
36 });
37 }
38 });
39 add(artistField, BorderLayout.EAST);
40 add(new JLabel("Artist: "), BorderLayout.WEST);
41 }
42
43 /**
44 * Update the list to contain the tracks associated with an artist
45 */
46 private void updateTracks(String name) {
47 model.removeAllElements(); // Clear out previous tracks
48 if (name.length() < 1) return; // Nothing to do
49 try {
50 // Ask for a session using the JDBC information we've configured
51 Session session = sessionFactory.openSession();
52 try {
53 Artist artist = CreateTest.getArtist(name, false, session);
54 if (artist == null) { // Unknown artist
55 model.addElement("Artist not found");
56 return;
57 }
58 // List the tracks associated with the artist
59 for (Iterator iter = artist.getTracks().iterator() ;
60 iter.hasNext() ; ) {
61 Track aTrack = (Track)iter.next();
62 model.addElement("Track: \"" + aTrack.getTitle() +
63 "\", " + aTrack.getPlayTime());
64 }
65 } finally {
66 // No matter what, close the session
67 session.close();
68 }
69 } catch (Exception e) {
70 System.err.println("Problem updating tracks:" + e);
71 e.printStackTrace();
72 }
73 }
74
75 private static SessionFactory sessionFactory; // Used to talk to Hibernate
76
77 /**
78 * Set up Hibernate, then build and display the user interface.
79 */
80 public static void main(String args[]) throws Exception {
81 // Load configuration properties, read mappings for persistent classes.
82 Configuration config = new Configuration();
83 config.addClass(Track.class).addClass(Artist.class);
84
85 // Get the session factory we can use for persistence
86 sessionFactory = config.buildSessionFactory();
87
88 // Set up the UI
89 JFrame frame = new JFrame("Artist Track Lookup");
90 frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
91 frame.setContentPane(new QueryTest2());
92 frame.setSize(400, 180);
93 frame.setVisible(true);
94 }
95 }
The bulk of the novel code in this example deals with setting up a Swing user interface. It's actually a rather primitive interface, and won't resize nicely, but dealing with such details would make the code larger, and really falls outside the scope of this book. If you want examples of how to build rich, quality Swing interfaces, check out our Java Swing, Second Edition (O'Reilly). It's much thicker so it has room for all that good stuff.
NOTE
Yes, this is a shameless plug.
The only item I want to highlight in the constructor is the KeyListener that gets added to artistField (lines 30-38). This rather tricky bit of code creates an anonymous class whose keyTyped() method is invoked whenever the user types in the artist text field. That method, lines 31- 37, tries to update the track display by checking whether the field now contains a recognized artist name. Unfortunately, at the time the method gets invoked, the text field has not yet been updated to reflect the latest keystroke, so we're forced to defer the actual display update to a second anonymous class (the Runnable instance created on lines 32-36) via the invokeLater() method of SwingUtilities. This technique causes the update to happen when Swing 'gets around to it,' which in our case means the text field will have finished updating itself.
The updateTracks() method that gets called at that point is where the interesting Hibernate stuff happens. It starts by clearing the list on line 47, discarding any tracks it might have previously been displaying. If the artist name is empty, that's all it does. Otherwise, it opens a Hibernate session on line 51 and tries to look up the artist using the getArtist() method we wrote in CreateTest. This time we tell it not to create an artist if it can't find the one we asked for, so we'll get back a null if the user hasn't typed the name of a known artist. If that's the case, we just display a message to that effect (line 55).
If we do find an Artist record, on the other hand, line 59 iterates over any Track records found in the artist's set of associated tracks, and lines 61-63 display information about each one. All this will test whether the inverse association has worked the way we'd like it to. Finally (no pun intended), lines 65-68 make sure to close the session when we're leaving the method, even through an exception. You don't want to leak sessions—that's a good way to bog down and crash your whole database environment.
The main() method starts out with the same Hibernate configuration steps we've seen before in lines 81-86, then creates and displays the user interface frame in lines 89-93. Line 90 sets the interface up to end the program when it's closed. After displaying the frame, main() returns. From that point on, the Swing event loop is in control.
Once you've created (or downloaded) this source file, you also need to add a new target, shown in Example 4-11, to the end of build.xml (the Ant build file) to invoke this new class.
Example 4-11. Ant target for running the new query test
<target name="qtest2" description="Run a simple Artist exploration GUI"
depends="compile">
<java classname="com.oreilly.hh.QueryTest2" fork="true">
<classpath refid="project.class.path"/>
</java>
</target>
NOTE
This is very similar to the existing 'qtest' target; copy and tweak that.
Now you can fire it up by typing ant qtest2 and play with it for yourself. Figure 4-3 shows the program in action, displaying tracks for one of the artists in our sample data.
Figure 4-3. A simple artist tracks browser
4.5 Working with Simple Collections
The collections we've been looking at so far have all contained associations to other objects, which is appropriate for a chapter titled 'Collections and Associations,' but isn't the only kind you can use with Hibernate. You can also define mappings for collections of simple values, like strings, numbers, and nonpersistent value classes.
4.5.1 How do I do that?
Suppose we want to be able to record some number of comments about each track in the database. We want a new property called comments to contain the String values of each associated comment. The new mapping in Tracks.hbm.xml looks a lot like what we did for artists, only a bit simpler:
<set name="comments" table="TRACK_COMMENTS">
<key column="TRACK_ID"/>
<element column="COMMENT" type="string"/>
</set>
Since we're able to store an arbitrary number of comments for each Track, we're going to need a new table to put them in. Each comment will be linked to the proper Track through the track's id property.
Rebuilding the databases with ant schema shows how this gets built in the database:
[schemaexport] create table TRACK_COMMENTS (
[schemaexport] TRACK_ID INTEGER not null,
[schemaexport] COMMENT VARCHAR(255)
[schemaexport] )
[schemaexport] alter table TRACK_COMMENTS add constraint FK105B26884C5F92B
foreign key (TRACK_ID) references TRACK
NOTE
Data modeling junkies will recognize this as a 'one-to-many' relationship.
After updating the Track class via ant codegen , we need to add another Set at the end of each constructor invocation in CreateTest.java, for the comments. For example:
track = new Track("Test Tone 1",
"vol2/singles/test01.mp3",
Time.valueOf("00:00:10"), new Date(),
(short)0, new HashSet(), new HashSet());
Then we can assign a comment on the following line:
track.getComments().add("Pink noise to test equalization");
A quick ant ctest will compile and run this (making sure you've not forgotten to add the second HashSet to any tracks), and you can check data/music.script to see how it's stored in the database. Or add another loop after the track println() in QueryTest.java to print the comments for the track that was just displayed:
for (Iterator comIter = aTrack.getComments().iterator() ;
comIter.hasNext() ; ) {
System.out.println(" Comment: " + comIter.next());
}
Then ant qtest will give you output like this:
...
[java] Track: "Test Tone 1" 00:00:10
[java] Comment: Pink noise to test equalization
Wednesday, 26 September 2012
A Look at HQL
HQL queries have already been used a few times in previous chapters. It's worth spending a little time looking at how HQL differs from SQL and some of the useful things you can do with it. As with the rest of this notebook, our intention is to provide a useful introduction and some examples, not a comprehensive reference.
Example 9-1. The simplest HQL query
from Track
There's not much to it, is there? This is the HQL equivalent of Example 8-1, in which we built a criteria query on the Track class and supplied no criteria.
By default, Hibernate automatically "imports" the names of each class you map, which means you don't have to provide a fully qualified package name when you want to use it, the simple class name is enough. As long as you've not mapped more than one class with the same name, you don't need to use fully qualified class names in your queries. You are certainly free to do so if you prefer—as I have in this book to help readers remember that the queries are expressed in terms of Java data beans and their properties, not database tables and columns (like you'd find in SQL). Example 9-2 produces precisely the same result as our first query.
Example 9-2. Explicit package naming, but still pretty simple
from com.oreilly.hh.Track
If you do have more than one mapped class with the same name, you can either use the fully qualified package name to refer to each, or you can assign an alternate name for one or both classes using an import tag in their mapping documents. You can also turn off the auto-import facility for a mapping file by adding auto-import="false" to the hibernatemapping tag's attributes.
You're probably used to queries being case-insensitive, since SQL behaves this way. For the most part, HQL acts the same, with the important exceptions of class and property names. Just as in the rest of Java, these are case-sensitive, so you must get the capitalization right.
Let's look at an extreme example of how HQL differs from SQL, by pushing its polymorphic query capability to its logical limit.
9.1.1 How do I do that?
A powerful way to highlight the fundamental difference between SQL and HQL is to consider what happens when you query "from java.lang.Object". At first glance this might not even seem to make sense! In fact, Hibernate supports queries that return polymorphic results. If you've got mapped classes that extend each other, or have some shared ancestor or interface, whether you've mapped the classes to the same table or to different tables, you can query the superclass. Since every Java object extends Object, this query asks Hibernate to return every single entity it knows about in the database.
We can test this by making a quick variant of our query test. Copy QueryTest.java to QueryTest3.java, and make the changes shown in Example 9-3 (most of the changes, which don't show up in the example, involve deleting example queries we don't need here).
Example 9-3. Look, Toto, we're not in SQL anymore
1 package com.oreilly.hh;
2
3 import net.sf.hibernate.*;
4 import net.sf.hibernate.cfg.Configuration;
5
6 import java.util.*;
7
8 /**
9 * Retrieve all persistent objects
10 */
11 public class QueryTest3 {
12
13 /**
14 * Look up and print all entities when invoked from the command line.
15 */
16 public static void main(String args[]) throws Exception {
17 // Create a configuration based on the properties file we've put
18 // in the standard place.
19 Configuration config = new Configuration();
20
21 // Tell it about the classes we want mapped, taking advantage of
22 // the way we've named their mapping documents.
23 config.addClass(Track.class).addClass(Artist.class);
24 config.addClass(Album.class);
25
26 // Get the session factory we can use for persistence
27 SessionFactory sessionFactory = config.buildSessionFactory();
28
29 // Ask for a session using the JDBC information we've configured
30 Session session = sessionFactory.openSession();
31 try {
32 // Print every mapped object in the database
33 List all = session.find("from java.lang.Object");
34 for (ListIterator iter = all.listIterator() ;
35 iter.hasNext() ; ) {
36 System.out.println(iter.next());
37 }
38 } finally {
39 // No matter what, close the session
40 session.close();
41 }
42
43 // Clean up after ourselves
44 sessionFactory.close();
45 }
46 }
We added line 24 because Hibernate will only return objects whose mappings it has been asked to load, and we'd like to see everything we can put in the database. To be sure you've got all the sample data created, run ant schema ctest atest . Line 33 invokes the odd and powerful query, and line 36 prints what we've found. We can't do much more than call toString() on the objects we get back because we don't know what they might be. Object is not very deep as a shared interface.
There's one more step we need to take before we can run this. Add another target to build.xml, as shown in Example 9-4.
NOTE
As usual, these examples assume you've set up the environment by following the preceding chapters. You can also just use the downloadable sample source for this chapter.
Example 9-4. Invoking the über-query
<target name="qtest3" description="Retrieve all mapped objects"
depends="compile">
<java classname="com.oreilly.hh.QueryTest3" fork="true">
<classpath refid="project.class.path"/>
<java>
</target>
Then you can test it by running ant qtest3 . You'll see results like Example 9-5.
Example 9-5. Everything Hibernate can find
% ant qtest3
Buildfile: build.xml
prepare:
compile:
[javac] Compiling 1 source file to /Users/jim/Documents/Work/OReilly/
Hibernate/Examples/ch09/classes
qtest3:
[java] com.oreilly.hh.Album@397cea[id=0,title=Counterfeit e.p.,tracks=[com.
oreilly.hh.AlbumTrack@5e60f2[track=com.oreilly.hh.
Track@aaa10[id=7,title=Compulsion,sourceMedia=Compact Disc]], com.oreilly.hh.
AlbumTrack@2ed1e8[track=com.oreilly.hh.Track@231e35[id=8,title=In a Manner of
Speaking,sourceMedia=Compact Disc]], com.oreilly.hh.AlbumTrack@d6f84a[track=com.
oreilly.hh.Track@9432e0[id=9,title=Smile in the Crowd,sourceMedia=Compact Disc]],
com.oreilly.hh.AlbumTrack@46ca65[track=com.oreilly.hh.
Track@9830bc[id=10,title=Gone,sourceMedia=Compact Disc]], com.oreilly.hh.
AlbumTrack@91e365[track=com.oreilly.hh.Track@a79447[id=11,title=Never Turn Your
Back on Mother Earth,sourceMedia=Compact Disc]], com.oreilly.hh.
AlbumTrack@e823ac[track=com.oreilly.hh.Track@f801c4[id=12,title=Motherless
Child,sourceMedia=Compact Disc]]]]
[java] com.oreilly.hh.Artist@e7736c[id=0,name=PPK,actualArtist=<null>]
[java] com.oreilly.hh.Artist@a59727[id=1,name=The
Buggles,actualArtist=<null>]
[java] com.oreilly.hh.Artist@1f7fbe[id=2,name=Laurie
Anderson,actualArtist=<null>]
[java] com.oreilly.hh.Artist@f42d53[id=3,name=William
Orbit,actualArtist=<null>]
[java] com.oreilly.hh.Artist@ba63a2[id=4,name=Ferry
Corsten,actualArtist=<null>]
[java] com.oreilly.hh.Artist@e668c2[id=5,name=Samuel
Barber,actualArtist=<null>]
[java] com.oreilly.hh.Artist@d67d61[id=6,name=ATB,actualArtist=<null>]
[java] com.oreilly.hh.Artist@e9a555[id=7,name=Pulp
Victim,actualArtist=<null>]
[java] com.oreilly.hh.Artist@6a4aef[id=8,name=Martin L.
Gore,actualArtist=<null>]
[java] com.oreilly.hh.Track@4dba7f[id=0,title=Russian
Trance,sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@906563[id=1,title=Video Killed the Radio
Star,sourceMedia=VHS Videocassette Tape]
[java] com.oreilly.hh.Track@f068b9[id=2,title=Gravity's
Angel,sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@6b54ef[id=3,title=Adagio for Strings (Ferry
Corsten Remix),sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@954549[id=4,title=Adagio for Strings (ATB
Remix),sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@f7dab1[id=5,title=The World
'99,sourceMedia=Digital Audio Stream]
[java] com.oreilly.hh.Track@36d1d7[id=6,title=Test Tone 1,sourceMedia=<null>]
[java] com.oreilly.hh.Track@aaa10[id=7,title=Compulsion,sourceMedia=Compact
Disc]
[java] com.oreilly.hh.Track@231e35[id=8,title=In a Manner of
Speaking,sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@9432e0[id=9,title=Smile in the
Crowd,sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@9830bc[id=10,title=Gone,sourceMedia=Compact
Disc]
[java] com.oreilly.hh.Track@a79447[id=11,title=Never Turn Your Back on Mother
Earth,sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@f801c4[id=12,title=Motherless
Child,sourceMedia=Compact Disc]
BUILD SUCCESSFUL
Total time: 17 seconds
9.1.2 What just happened?
Well, it's pretty remarkable if you think about it, because Hibernate had to do several separate SQL queries in order to obtain the results for us. A whole lot of work went on behind the scenes to hand us that list of every known persisted entity. Although it's hard to imagine a situation where you'll actually need to do something exactly like this, it certainly highlights some of the interesting capabilities of HQL and Hibernate.
And there are certainly times when slightly less comprehensive queries will be very useful to you, so it's worth keeping in mind that tablespanning polymorphic queries are not only possible, but easy to use.
There are some limitations that come into play when you run an HQL query that requires multiple separate SQL queries to implement. You can't use order by clauses to sort the entire set of results, nor can you use the Query interface's scroll() method to walk through them.
9.1.3 What about...
...Associations and joins? These are easy to work with as well. You can traverse associations by simply following the property chain, using periods as delimiters. To help you refer to a particular entity in your query expressions, HQL lets you assign aliases, just like SQL. This is particularly important if you want to refer to two separate entities of the same class, for example:
from com.oreilly.hh.Track as track1
which is equivalent to
from com.oreilly.hh.Track track1
The version you'll use will most likely depend on what you're used to, or the style guidelines established for your project.
We'll see examples of joins below, once we introduce enough other HQL elements to make them interesting.
9.2 Selecting Properties and Pieces
The queries we've been using so far have returned entire persistent objects. This is the most common use of an object/relational mapping service like Hibernate, so it should come as no surprise. Once you've got the objects, you can use them in whatever way you need to within the familiar realm of Java code. There are circumstances where you might want only a subset of the properties that make up an object, though, such as producing reports. HQL can accommodate such needs, in exactly the same way you'd use ordinary SQL—projection in a select clause.
9.2.1 How do I do that?
Suppose we want to change QueryTest.java to display only the titles of the tracks that meet our search criteria, and we want to extract only that information from the database in the first place. We'd start by changing the query of Example 3-9 to retrieve only the title property. Edit Track.hbm.xml to make the query look like Example 9-6.
Example 9-6. Obtaining just the titles of the short tracks
<query name="com.oreilly.hh.tracksNoLongerThan">
<![CDATA[
select track.title from com.oreilly.hh.Track as track
where track.playTime <= :length
]]>
</query>
Make sure the tracksNoLongerThan() method in QueryTest.java is set up to use this query. (If you edited it to use criteria queries in Chapter 8, change it back to the way it was in Example 3-10. To save you the trouble of hunting that down, it's reproduced as Example 9-7.)
Example 9-7. HQL-driven query method, using the query mapped in Example 9-6
public static List tracksNoLongerThan(Time length, Session session)
throws HibernateException
{
Query query = session.getNamedQuery(
"com.oreilly.hh.tracksNoLongerThan");
query.setTime("length", length);
return query.list();
}
Finally, the main() method needs to be updated, as shown in Example 9-8, to reflect the fact that the query method is now returning the title property rather than entire Track records. This property is defined as a String, so the method now returns a List of Strings.
Example 9-8. Changes to QueryTest's main() method to work with the title query
// Print the titles of tracks that will fit in five minutes
List titles = tracksNoLongerThan(Time.valueOf("00:05:00"),
session);
for (ListIterator iter = titles.listIterator() ;
iter.hasNext() ; ) {
String aTitle = (String)iter.next();
System.out.println("Track: " + aTitle);
}
Those changes are pretty simple, and the relationship between the return type of the query and the list elements we see in Java is straightforward. Depending on what data you've set up, running this version using ant qtest will result in output similar to Example 9-9. (If you've not got any data, or you want it to look just like this, recreate the test data before displaying it by running ant schema ctest atest qtest .)
Example 9-9. Listing just the titles of tracks no more than five minutes long
qtest:
[java] Track: Russian Trance
[java] Track: Video Killed the Radio Star
[java] Track: Test Tone 1
[java] Track: In a Manner of Speaking
[java] Track: Gone
[java] Track: Never Turn Your Back on Mother Earth
[java] Track: Motherless Child
9.2.2 What about...
...Returning more than one property? You can certainly do this, and the properties can come from multiple objects if you're using a join, or if your query object has components or associations (which are, after all, a very convenient form of object-oriented join). As you'd expect from SQL, all you do is list the properties you'd like, separated by commas. As a simple example, let's get the IDs as well as the titles for our tracks in this query. Tweak Track.hbm.xml so the query looks like Example 9-10.
Example 9-10. Selecting multiple properties from an object
<query name="com.oreilly.hh.tracksNoLongerThan">
<![CDATA[
select track.id, track.title from com.oreilly.hh.Track as track
where track.playTime <= :length
]]>
</query>
We don't need to change the query method at all; it still invokes this query by name, passes in the same named parameter, and returns the resulting list. But what does that list contain now? We'll need to update our loop in main() so that it can show both the IDs and the titles.
In situations like this, when it needs to return multiple, separate values for each "row" in a query, each entry in the List returned by Hibernate will contain an array of objects. Each array contains the selected properties, in the order they're listed in the query. So we'll get a list of twoelement arrays; each array will contain an Integer followed by a String.
Example 9-11 shows how we can update main() in QueryTest.java to work with these arrays.
Example 9-11. Working with multiple, separate properties in query results
// Print IDs and titles of tracks that will fit in five minutes
List titles = tracksNoLongerThan(Time.valueOf("00:05:00"),
session);
for (ListIterator iter = titles.listIterator() ;
iter.hasNext() ; ) {
Object[] aRow = (Object[])iter.next();
Integer anID = (Integer)aRow[0];
String aTitle = (String)aRow[1];
System.out.println("Track: " + aTitle + " [ID=" + anID + ']');
}
Running ant qtest after these changes produces output like Example 9-12.
Example 9-12. Listing titles and IDs
qtest:
[java] Track: Russian Trance [ID=0]
[java] Track: Video Killed the Radio Star [ID=1]
[java] Track: Test Tone 1 [ID=6]
[java] Track: In a Manner of Speaking [ID=8]
[java] Track: Gone [ID=10]
[java] Track: Never Turn Your Back on Mother Earth [ID=11]
[java] Track: Motherless Child [ID=12]
I hope that while looking at this example you thought "that's an awkward way to work with Track properties." If you didn't, compare Example 9-11 with lines 48-56 of Example 3-5. The latter is more concise and natural, and it prints even more information about the tracks. If you're extracting information about a mapped object, you're almost always better off taking full advantage of the mapping capability to extract an actual instance of the object, so you can work with its properties with the full expressive and type-safe capabilities of Java.
NOTE
Was this some sort of cruel joke?
So why did I show it at all? Well, there are situations where retrieving multiple values in an HQL query can make sense: you might want just one property from each of a couple of mapped classes, for example. Or you might want to return a group of related classes by listing the class names in the select clause. For such cases it's worth knowing this technique. There may also be significant performance benefits if your mapped object has dozens of large (or non-lazily associated) properties, and you're only interested in one or two.
There is another surprising trick you can use to impose a good object structure even when you're building reports that select a bunch of properties from disparate mapped objects. HQL lets you construct and return an arbitrary object within your select clause. So you could create an adhoc reporting class whose properties reflect the values needed by your report, and return instances of this class in the query instead of cryptic Object arrays. If we'd defined a TrackSummary class with id and title properties and an appropriate constructor, our query could have used:
select new TrackSummary(track.id, track.title)
instead of:
select track.id, track.title
and we wouldn't have needed any of the array manipulation in the code that works with the results. (Again, in this case, it would still have made more sense to simply return the entire Track, but this is useful when you're working with properties from multiple objects or even synthetic results like aggregate functions, as demonstrated below.)
9.3 Sorting
It should come as no surprise that you can use a SQL-style "order by" clause to control the order in which your output appears. I've alluded to this several times in earlier chapters, and it works just like you'd expect. You can use any property of the objects being returned to establish the sort order, and you can list multiple properties to establish sub-sorts within results for which the first property values are the same.
9.3.1 How do I do that?
Sorting is very simple: you list the values that you want to use to sort the results. As usual, where SQL uses columns, HQL uses properties. For Example 9-13, let's update the query in Example 9-10 so that it displays the results in reverse alphabetical order.
NOTE
As in SQL, you specify an ascending sort using "asc" and a descending sort with "desc".
Example 9-13. Addition to Track.hbm.xml that sorts the results backwards by title
<query name="com.oreilly.hh.tracksNoLongerThan">
<![CDATA[
select track.id, track.title from com.oreilly.hh.Track as track
where track.playTime <= :length
order by track.title desc
]]>
</query>
The output from running this is as you'd expect (Example 9-14).
Example 9-14. Titles and IDs in reverse alphabetical order
% ant qtest
Buildfile: build.xml
prepare:
[copy] Copying 1 file to /Users/jim/Documents/Work/OReilly/Hibernate/
Examples/ch09/classes
compile:
qtest:
[java] Track: Video Killed the Radio Star [ID=1]
[java] Track: Test Tone 1 [ID=6]
[java] Track: Russian Trance [ID=0]
[java] Track: Never Turn Your Back on Mother Earth [ID=11]
[java] Track: Motherless Child [ID=12]
[java] Track: In a Manner of Speaking [ID=8]
[java] Track: Gone [ID=10]
9.4 Working with Aggregate Values
Especially when writing reports, you'll often want summary information from the database: "How many? What's the average? The longest?" HQL can help with this, by offering aggregate functions like those in SQL. In HQL, of course, these functions apply to the properties of persistent classes.
9.4.1 How do I do that?
Let's try some of this in our query test framework. First, add the query in Example 9-15 after the existing query in Track.hbm.xml.
Example 9-15. A query collecting aggregate information about tracks
<query name="com.oreilly.hh.trackSummary">
<![CDATA[
select count(*), min(track.playTime), max(track.playTime)
from com.oreilly.hh.Track as track
]]>
</query>
I was tempted to try asking for the average playing time as well, but unfortunately HSQLDB doesn't know how to calculate averages for nonnumeric values, and this property is stored in a column of type date.
Next we need to write a method to run this query and display the results. Add the code in Example 9-16 to QueryTest.java, after the tracksNoLongerThan() method.
Example 9-16. A method to run the trackSummary query
/**
* Print summary information about all tracks.
*
* @param session the Hibernate session that can retrieve data.
* @throws HibernateException if there is a problem.
**/
public static void printTrackSummary(Session session)
throws HibernateException
{
Query query = session.getNamedQuery("com.oreilly.hh.trackSummary");
Object[] results = (Object[])query.uniqueResult();
System.out.println("Summary information:");
System.out.println(" Total tracks: " + results[0]);
System.out.println(" Shortest track: " + results[1]);
System.out.println(" Longest track: " + results[2]);
}
Since we're only using aggregate functions in the query, we know we'll only get one row of results back. This is another opportunity to use the uniqueResult() convenience method offered by the Query interface. It saves us the trouble of getting back a list and extracting the first element. As discussed in the "Selecting Properties and Pieces" section above, since we've asked for multiple distinct values, that result will be an Object array, whose elements are the values we requested in the same order we put them in the query.
We also need to add a line to main() to call this method. We can put it after the end of the loop in which we print details about selected tracks, as shown in Example 9-17.
Example 9-17. Addition to main() in QueryTest.java to display the new summary information
...
System.out.println("Track: " + aTitle + " [ID=" + anID + ']');
}
printTrackSummary(session);
} finally {
// No matter what, close the session
...
With these additions, we get new output when running ant qtest (Example 9-18).
Example 9-18. The summary output
...
qtest:
[java] Track: Video Killed the Radio Star [ID=1]
[java] Track: Test Tone 1 [ID=6]
[java] Track: Russian Trance [ID=0]
[java] Track: Never Turn Your Back on Mother Earth [ID=11]
[java] Track: Motherless Child [ID=12]
[java] Track: In a Manner of Speaking [ID=8]
[java] Track: Gone [ID=10]
[java] Summary information:
[java] Total tracks: 13
[java] Shortest track: 00:00:10
[java] Longest track: 00:07:39
That was pretty easy. Let's try something trickier—pulling information from joined tables. Tracks have a collection of artists associated with them. Suppose we want to get summary information about the tracks associated with a particular artist, rather than for all tracks. Example 9-19 shows what we'd add to the query.
Example 9-19. Summarizing tracks associated with an artist
<query name="com.oreilly.hh.trackSummary">
<![CDATA[
select count(*), min(track.playTime), max(track.playTime)
from com.oreilly.hh.Track as track
where :artist in elements(track.artists)
]]>
</query>
We've added a where clause to narrow down the tracks we want to see, using a named parameter, artist. HQL provides another use for the in operator. While you can use it in the normal SQL sense to give a list of possible values for a property, you can also do what we've done here. This statement tells Hibernate we are interested in tracks whose artists collection contains a specified value. To call this version of the query, beef up printTrackSummary() a little, as shown in Example 9-20.
Example 9-20. Enhancing printTrackSummary() to work with a specific artist
/**
* Print summary information about tracks associated with an artist.
*
* @param artist the artist in whose tracks we're interested
* @param session the Hibernate session that can retrieve data.
* @throws HibernateException if there is a problem.
**/
public static void printTrackSummary(Artist artist, Session session)
throws HibernateException
{
Query query = session.getNamedQuery("com.oreilly.hh.trackSummary");
query.setParameter("artist", artist);
Object[] results = (Object[])query.uniqueResult();
System.out.println("Summary of tracks by " + artist.getName() + ':');
System.out.println(" Total tracks: " + results[0]);
System.out.println(" Shortest track: " + results[1]);
System.out.println(" Longest track: " + results[2]);
}
Wasn't much to that, was there? Finally, the line that calls this method needs another parameter to specify an artist. Use the handy getArtist() method in CreateTest.java once again. Change the method call in QueryTest.java's main() method to look like it does in Example 9-21.
Example 9-21. Calling the enhanced printTrackSummary()
...
System.out.println("Track: " + aTitle + " [ID=" + anID + ']');
}
printTrackSummary(CreateTest.getArtist("Samuel Barber",
false, session), session);
} finally {
// No matter what, close the session
...
Now when you run ant qtest you'll see information that looks like Example 9-22.
Example 9-22. Running the summary query for tracks by Samuel Barber
qtest:
[java] Track: Video Killed the Radio Star [ID=1]
[java] Track: Test Tone 1 [ID=6]
[java] Track: Russian Trance [ID=0]
[java] Track: Never Turn Your Back on Mother Earth [ID=11]
[java] Track: Motherless Child [ID=12]
[java] Track: In a Manner of Speaking [ID=8]
[java] Track: Gone [ID=10]
[java] Summary of tracks by Samuel Barber:
[java] Total tracks: 2
[java] Shortest track: 00:06:35
[java] Longest track: 00:07:39
9.4.2 What just happened?
This took so little effort that it's worth taking a minute to appreciate how much Hibernate actually did for us. The getArtist() method we called returned the Artist instance corresponding to Samuel Barber. We were able to pass this entire object as a named parameter to our HQL query, and Hibernate knows enough about how to put together join queries using the Artist's id property and the TRACK_ARTISTS table to implement the complicated condition we expressed so concisely in Example 9-19.
The results we got reflect the two remixes of "Adagio for Strings" in the sample data. They don't show up in the detailed track list because they're both longer than five minutes.
NOTE
Just try doing something like that with vanilla SQL!
9.5 Writing Native SQL Queries
Given the power and convenience of HQL, and the way it dovetails so naturally with the objects in your Java code, why wouldn't you want to use it? Well, there might be some special feature supported by the native SQL dialect of your project's database that HQL can't exploit. If you're willing to accept the fact that using this feature will make it harder to change databases in the future, Hibernate will let you write queries in that native dialect while still helping you write expressions in terms of properties and translate the results to objects. (If you didn't want this help, you could just use a raw JDBC connection to run a plain SQL query, of course.)
Another circumstance in which it might be nice to meet your database halfway is if you're in the process of migrating an existing JDBC-based project to Hibernate, and you want to take small steps rather than thoroughly rewriting each query right away.
9.5.1 How do I do that?
If you're embedding your query text inside your Java source code, you use the Session method createSQLQuery() instead of Example 3-8's createQuery(). Of course, you know better than to code like that, so I won't even show you an example. The better approach is to put the query in a mapping document like Example 3-9. The difference is that you use a sql-query tag rather than the query tag we've seen up until now. You also need to tell Hibernate the mapped class you want to return, and the alias that you're using to refer to it (and its properties) in the query.
As a somewhat contrived example, suppose we want to know all the tracks that end exactly halfway through the last minute they're playing (in other words, the time display on the jukebox would be h:mm:30). An easy way to do that would be to take advantage of HSQLDB's built-in SECOND function, which gives you the seconds part of a Time value. Since HQL doesn't know about functions that are specific to HSQLDB's SQL dialect, this will push us into the realm of a native SQL query. Example 9-23 shows what it would look like; add this after the HQL queries in Track.hbm.xml.
Example 9-23. Embedding a native SQL dialect query in a Hibernate mapping
<sql-query name="com.oreilly.hh.tracksEndingAt">
<return alias="track" class="com.oreilly.hh.Track"/>
<![CDATA[
select {track.*}
from TRACK as {track}
where SECOND({track}.PLAYTIME) = :seconds
]]>
</sql-query>
The return tag tells Hibernate we're going to be using the alias track in our query to refer to a Track object. That allows us to use the shorthand {track.*} in the query body to refer to all the columns from the TRACK table we need in order to create a Track instance. (Notice that everywhere we use the alias in the query body we need to enclose it in curly braces. This gets us "out of" the native SQL environment so we can express things in terms of Hibernate-mapped classes and properties.)
The where clause in the query uses the HSQLDB SECOND function to narrow our results to include only tracks whose length has a specified number in the seconds part. Happily, even though we're building a native SQL query, we can still make use of Hibernate's nice named query parameters. In this case we're passing in a value named "seconds" to control the query. (You don't use curly braces to tell Hibernate you're using a named parameter even in an SQL query; its parser is smart enough to figure this out.)
The code that uses this mapped SQL query is no different than our previous examples using HQL queries. The getNamedQuery() method is used to load both kinds, and they both implement the Query interface. So our Java method invoking this query should look familiar. Add the code in Example 9-24 after the printTrackSummary() method in QueryTest.java.
Example 9-24. Calling a native SQL mapped query
/**
* Print tracks that end some number of seconds into their final minute.
*
* @param seconds, the number of seconds at which we want tracks to end.
* @param session the Hibernate session that can retrieve data.
* @throws HibernateException if there is a problem.
**/
public static void printTracksEndingAt(int seconds, Session session)
throws HibernateException
{
Query query = session.getNamedQuery("com.oreilly.hh.tracksEndingAt");
query.setInteger("seconds", seconds);
List results = query.list();
for (ListIterator iter = results.listIterator() ; iter.hasNext() ; ) {
Track aTrack = (Track)iter.next();
System.out.println("Track: " + aTrack.getTitle() +
", length=" + aTrack.getPlayTime());
}
}
Finally, add some lines to main() that call this method. Example 9-25 shows them added after the invocation of printTrackSummary().
Example 9-25. Calling printTracksEndingAt() to display tracks ending at a half minute
...
printTrackSummary(CreateTest.getArtist("Samuel Barber",
false, session), session);
System.out.println("Tracks ending halfway through final minute:");
printTracksEndingAt(30, session);
} finally {
// No matter what, close the session
...
These changes produce the additional output shown in Example 9-26 when ant qtest is run.
Example 9-26. Sample output from the native SQL query
qtest:
[java] Track: Video Killed the Radio Star [ID=1]
...
[java] Summary of tracks by Samuel Barber:
[java] Total tracks: 2
[java] Shortest track: 00:06:35
[java] Longest track: 00:07:39
[java] Tracks ending halfway through final minute:
[java] Track: Russian Trance, length=00:03:30
There's a lot more tedium and attention to detail required in using a native SQL query than an HQL query (especially when your query starts getting complex or referring to multiple tables), but it's nice to know that it is possible on the rare occasions where you really need one.
9.5.2 What about...
...Well, lots of things? You undoubtedly suspect this chapter barely scratches the surface of what you can do with HQL. That's definitely true. When you start combining some of these capabilities, and working with collections, associations, and powerful expressions, you can achieve some remarkable things. We can't possibly cover them all in this introduction, so you'll want to take a close look at the HQL section and examples in the Hibernate reference documentation, and do some experimentation on your own.
When you look through the Hibernate Query Language chapter in the reference documentation, be sure to look at the interesting things you can use in expressions, especially as they apply to collections. Don't miss the way you can use array bracket notation to select elements of indexed collections—you can even put arithmetic expressions inside the brackets.
The "Tips and Tricks" section that follows their longer examples gives some useful advice about working efficiently in different database environments, and using a variety of Hibernate interfaces to achieve useful results in ways you might not think of, especially if you're coming from a SQL background.
Hopefully, this discussion has helped you get a grounding in the basics, and it will serve as a starting point and anchor for the explorations on which you will embark!
NOTE
This isn't your father's SQL....
Example 9-1. The simplest HQL query
from Track
There's not much to it, is there? This is the HQL equivalent of Example 8-1, in which we built a criteria query on the Track class and supplied no criteria.
By default, Hibernate automatically "imports" the names of each class you map, which means you don't have to provide a fully qualified package name when you want to use it, the simple class name is enough. As long as you've not mapped more than one class with the same name, you don't need to use fully qualified class names in your queries. You are certainly free to do so if you prefer—as I have in this book to help readers remember that the queries are expressed in terms of Java data beans and their properties, not database tables and columns (like you'd find in SQL). Example 9-2 produces precisely the same result as our first query.
Example 9-2. Explicit package naming, but still pretty simple
from com.oreilly.hh.Track
If you do have more than one mapped class with the same name, you can either use the fully qualified package name to refer to each, or you can assign an alternate name for one or both classes using an import tag in their mapping documents. You can also turn off the auto-import facility for a mapping file by adding auto-import="false" to the hibernatemapping tag's attributes.
You're probably used to queries being case-insensitive, since SQL behaves this way. For the most part, HQL acts the same, with the important exceptions of class and property names. Just as in the rest of Java, these are case-sensitive, so you must get the capitalization right.
Let's look at an extreme example of how HQL differs from SQL, by pushing its polymorphic query capability to its logical limit.
9.1.1 How do I do that?
A powerful way to highlight the fundamental difference between SQL and HQL is to consider what happens when you query "from java.lang.Object". At first glance this might not even seem to make sense! In fact, Hibernate supports queries that return polymorphic results. If you've got mapped classes that extend each other, or have some shared ancestor or interface, whether you've mapped the classes to the same table or to different tables, you can query the superclass. Since every Java object extends Object, this query asks Hibernate to return every single entity it knows about in the database.
We can test this by making a quick variant of our query test. Copy QueryTest.java to QueryTest3.java, and make the changes shown in Example 9-3 (most of the changes, which don't show up in the example, involve deleting example queries we don't need here).
Example 9-3. Look, Toto, we're not in SQL anymore
1 package com.oreilly.hh;
2
3 import net.sf.hibernate.*;
4 import net.sf.hibernate.cfg.Configuration;
5
6 import java.util.*;
7
8 /**
9 * Retrieve all persistent objects
10 */
11 public class QueryTest3 {
12
13 /**
14 * Look up and print all entities when invoked from the command line.
15 */
16 public static void main(String args[]) throws Exception {
17 // Create a configuration based on the properties file we've put
18 // in the standard place.
19 Configuration config = new Configuration();
20
21 // Tell it about the classes we want mapped, taking advantage of
22 // the way we've named their mapping documents.
23 config.addClass(Track.class).addClass(Artist.class);
24 config.addClass(Album.class);
25
26 // Get the session factory we can use for persistence
27 SessionFactory sessionFactory = config.buildSessionFactory();
28
29 // Ask for a session using the JDBC information we've configured
30 Session session = sessionFactory.openSession();
31 try {
32 // Print every mapped object in the database
33 List all = session.find("from java.lang.Object");
34 for (ListIterator iter = all.listIterator() ;
35 iter.hasNext() ; ) {
36 System.out.println(iter.next());
37 }
38 } finally {
39 // No matter what, close the session
40 session.close();
41 }
42
43 // Clean up after ourselves
44 sessionFactory.close();
45 }
46 }
We added line 24 because Hibernate will only return objects whose mappings it has been asked to load, and we'd like to see everything we can put in the database. To be sure you've got all the sample data created, run ant schema ctest atest . Line 33 invokes the odd and powerful query, and line 36 prints what we've found. We can't do much more than call toString() on the objects we get back because we don't know what they might be. Object is not very deep as a shared interface.
There's one more step we need to take before we can run this. Add another target to build.xml, as shown in Example 9-4.
NOTE
As usual, these examples assume you've set up the environment by following the preceding chapters. You can also just use the downloadable sample source for this chapter.
Example 9-4. Invoking the über-query
<target name="qtest3" description="Retrieve all mapped objects"
depends="compile">
<java classname="com.oreilly.hh.QueryTest3" fork="true">
<classpath refid="project.class.path"/>
<java>
</target>
Then you can test it by running ant qtest3 . You'll see results like Example 9-5.
Example 9-5. Everything Hibernate can find
% ant qtest3
Buildfile: build.xml
prepare:
compile:
[javac] Compiling 1 source file to /Users/jim/Documents/Work/OReilly/
Hibernate/Examples/ch09/classes
qtest3:
[java] com.oreilly.hh.Album@397cea[id=0,title=Counterfeit e.p.,tracks=[com.
oreilly.hh.AlbumTrack@5e60f2[track=com.oreilly.hh.
Track@aaa10[id=7,title=Compulsion,sourceMedia=Compact Disc]], com.oreilly.hh.
AlbumTrack@2ed1e8[track=com.oreilly.hh.Track@231e35[id=8,title=In a Manner of
Speaking,sourceMedia=Compact Disc]], com.oreilly.hh.AlbumTrack@d6f84a[track=com.
oreilly.hh.Track@9432e0[id=9,title=Smile in the Crowd,sourceMedia=Compact Disc]],
com.oreilly.hh.AlbumTrack@46ca65[track=com.oreilly.hh.
Track@9830bc[id=10,title=Gone,sourceMedia=Compact Disc]], com.oreilly.hh.
AlbumTrack@91e365[track=com.oreilly.hh.Track@a79447[id=11,title=Never Turn Your
Back on Mother Earth,sourceMedia=Compact Disc]], com.oreilly.hh.
AlbumTrack@e823ac[track=com.oreilly.hh.Track@f801c4[id=12,title=Motherless
Child,sourceMedia=Compact Disc]]]]
[java] com.oreilly.hh.Artist@e7736c[id=0,name=PPK,actualArtist=<null>]
[java] com.oreilly.hh.Artist@a59727[id=1,name=The
Buggles,actualArtist=<null>]
[java] com.oreilly.hh.Artist@1f7fbe[id=2,name=Laurie
Anderson,actualArtist=<null>]
[java] com.oreilly.hh.Artist@f42d53[id=3,name=William
Orbit,actualArtist=<null>]
[java] com.oreilly.hh.Artist@ba63a2[id=4,name=Ferry
Corsten,actualArtist=<null>]
[java] com.oreilly.hh.Artist@e668c2[id=5,name=Samuel
Barber,actualArtist=<null>]
[java] com.oreilly.hh.Artist@d67d61[id=6,name=ATB,actualArtist=<null>]
[java] com.oreilly.hh.Artist@e9a555[id=7,name=Pulp
Victim,actualArtist=<null>]
[java] com.oreilly.hh.Artist@6a4aef[id=8,name=Martin L.
Gore,actualArtist=<null>]
[java] com.oreilly.hh.Track@4dba7f[id=0,title=Russian
Trance,sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@906563[id=1,title=Video Killed the Radio
Star,sourceMedia=VHS Videocassette Tape]
[java] com.oreilly.hh.Track@f068b9[id=2,title=Gravity's
Angel,sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@6b54ef[id=3,title=Adagio for Strings (Ferry
Corsten Remix),sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@954549[id=4,title=Adagio for Strings (ATB
Remix),sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@f7dab1[id=5,title=The World
'99,sourceMedia=Digital Audio Stream]
[java] com.oreilly.hh.Track@36d1d7[id=6,title=Test Tone 1,sourceMedia=<null>]
[java] com.oreilly.hh.Track@aaa10[id=7,title=Compulsion,sourceMedia=Compact
Disc]
[java] com.oreilly.hh.Track@231e35[id=8,title=In a Manner of
Speaking,sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@9432e0[id=9,title=Smile in the
Crowd,sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@9830bc[id=10,title=Gone,sourceMedia=Compact
Disc]
[java] com.oreilly.hh.Track@a79447[id=11,title=Never Turn Your Back on Mother
Earth,sourceMedia=Compact Disc]
[java] com.oreilly.hh.Track@f801c4[id=12,title=Motherless
Child,sourceMedia=Compact Disc]
BUILD SUCCESSFUL
Total time: 17 seconds
9.1.2 What just happened?
Well, it's pretty remarkable if you think about it, because Hibernate had to do several separate SQL queries in order to obtain the results for us. A whole lot of work went on behind the scenes to hand us that list of every known persisted entity. Although it's hard to imagine a situation where you'll actually need to do something exactly like this, it certainly highlights some of the interesting capabilities of HQL and Hibernate.
And there are certainly times when slightly less comprehensive queries will be very useful to you, so it's worth keeping in mind that tablespanning polymorphic queries are not only possible, but easy to use.
There are some limitations that come into play when you run an HQL query that requires multiple separate SQL queries to implement. You can't use order by clauses to sort the entire set of results, nor can you use the Query interface's scroll() method to walk through them.
9.1.3 What about...
...Associations and joins? These are easy to work with as well. You can traverse associations by simply following the property chain, using periods as delimiters. To help you refer to a particular entity in your query expressions, HQL lets you assign aliases, just like SQL. This is particularly important if you want to refer to two separate entities of the same class, for example:
from com.oreilly.hh.Track as track1
which is equivalent to
from com.oreilly.hh.Track track1
The version you'll use will most likely depend on what you're used to, or the style guidelines established for your project.
We'll see examples of joins below, once we introduce enough other HQL elements to make them interesting.
9.2 Selecting Properties and Pieces
The queries we've been using so far have returned entire persistent objects. This is the most common use of an object/relational mapping service like Hibernate, so it should come as no surprise. Once you've got the objects, you can use them in whatever way you need to within the familiar realm of Java code. There are circumstances where you might want only a subset of the properties that make up an object, though, such as producing reports. HQL can accommodate such needs, in exactly the same way you'd use ordinary SQL—projection in a select clause.
9.2.1 How do I do that?
Suppose we want to change QueryTest.java to display only the titles of the tracks that meet our search criteria, and we want to extract only that information from the database in the first place. We'd start by changing the query of Example 3-9 to retrieve only the title property. Edit Track.hbm.xml to make the query look like Example 9-6.
Example 9-6. Obtaining just the titles of the short tracks
<query name="com.oreilly.hh.tracksNoLongerThan">
<![CDATA[
select track.title from com.oreilly.hh.Track as track
where track.playTime <= :length
]]>
</query>
Make sure the tracksNoLongerThan() method in QueryTest.java is set up to use this query. (If you edited it to use criteria queries in Chapter 8, change it back to the way it was in Example 3-10. To save you the trouble of hunting that down, it's reproduced as Example 9-7.)
Example 9-7. HQL-driven query method, using the query mapped in Example 9-6
public static List tracksNoLongerThan(Time length, Session session)
throws HibernateException
{
Query query = session.getNamedQuery(
"com.oreilly.hh.tracksNoLongerThan");
query.setTime("length", length);
return query.list();
}
Finally, the main() method needs to be updated, as shown in Example 9-8, to reflect the fact that the query method is now returning the title property rather than entire Track records. This property is defined as a String, so the method now returns a List of Strings.
Example 9-8. Changes to QueryTest's main() method to work with the title query
// Print the titles of tracks that will fit in five minutes
List titles = tracksNoLongerThan(Time.valueOf("00:05:00"),
session);
for (ListIterator iter = titles.listIterator() ;
iter.hasNext() ; ) {
String aTitle = (String)iter.next();
System.out.println("Track: " + aTitle);
}
Those changes are pretty simple, and the relationship between the return type of the query and the list elements we see in Java is straightforward. Depending on what data you've set up, running this version using ant qtest will result in output similar to Example 9-9. (If you've not got any data, or you want it to look just like this, recreate the test data before displaying it by running ant schema ctest atest qtest .)
Example 9-9. Listing just the titles of tracks no more than five minutes long
qtest:
[java] Track: Russian Trance
[java] Track: Video Killed the Radio Star
[java] Track: Test Tone 1
[java] Track: In a Manner of Speaking
[java] Track: Gone
[java] Track: Never Turn Your Back on Mother Earth
[java] Track: Motherless Child
9.2.2 What about...
...Returning more than one property? You can certainly do this, and the properties can come from multiple objects if you're using a join, or if your query object has components or associations (which are, after all, a very convenient form of object-oriented join). As you'd expect from SQL, all you do is list the properties you'd like, separated by commas. As a simple example, let's get the IDs as well as the titles for our tracks in this query. Tweak Track.hbm.xml so the query looks like Example 9-10.
Example 9-10. Selecting multiple properties from an object
<query name="com.oreilly.hh.tracksNoLongerThan">
<![CDATA[
select track.id, track.title from com.oreilly.hh.Track as track
where track.playTime <= :length
]]>
</query>
We don't need to change the query method at all; it still invokes this query by name, passes in the same named parameter, and returns the resulting list. But what does that list contain now? We'll need to update our loop in main() so that it can show both the IDs and the titles.
In situations like this, when it needs to return multiple, separate values for each "row" in a query, each entry in the List returned by Hibernate will contain an array of objects. Each array contains the selected properties, in the order they're listed in the query. So we'll get a list of twoelement arrays; each array will contain an Integer followed by a String.
Example 9-11 shows how we can update main() in QueryTest.java to work with these arrays.
Example 9-11. Working with multiple, separate properties in query results
// Print IDs and titles of tracks that will fit in five minutes
List titles = tracksNoLongerThan(Time.valueOf("00:05:00"),
session);
for (ListIterator iter = titles.listIterator() ;
iter.hasNext() ; ) {
Object[] aRow = (Object[])iter.next();
Integer anID = (Integer)aRow[0];
String aTitle = (String)aRow[1];
System.out.println("Track: " + aTitle + " [ID=" + anID + ']');
}
Running ant qtest after these changes produces output like Example 9-12.
Example 9-12. Listing titles and IDs
qtest:
[java] Track: Russian Trance [ID=0]
[java] Track: Video Killed the Radio Star [ID=1]
[java] Track: Test Tone 1 [ID=6]
[java] Track: In a Manner of Speaking [ID=8]
[java] Track: Gone [ID=10]
[java] Track: Never Turn Your Back on Mother Earth [ID=11]
[java] Track: Motherless Child [ID=12]
I hope that while looking at this example you thought "that's an awkward way to work with Track properties." If you didn't, compare Example 9-11 with lines 48-56 of Example 3-5. The latter is more concise and natural, and it prints even more information about the tracks. If you're extracting information about a mapped object, you're almost always better off taking full advantage of the mapping capability to extract an actual instance of the object, so you can work with its properties with the full expressive and type-safe capabilities of Java.
NOTE
Was this some sort of cruel joke?
So why did I show it at all? Well, there are situations where retrieving multiple values in an HQL query can make sense: you might want just one property from each of a couple of mapped classes, for example. Or you might want to return a group of related classes by listing the class names in the select clause. For such cases it's worth knowing this technique. There may also be significant performance benefits if your mapped object has dozens of large (or non-lazily associated) properties, and you're only interested in one or two.
There is another surprising trick you can use to impose a good object structure even when you're building reports that select a bunch of properties from disparate mapped objects. HQL lets you construct and return an arbitrary object within your select clause. So you could create an adhoc reporting class whose properties reflect the values needed by your report, and return instances of this class in the query instead of cryptic Object arrays. If we'd defined a TrackSummary class with id and title properties and an appropriate constructor, our query could have used:
select new TrackSummary(track.id, track.title)
instead of:
select track.id, track.title
and we wouldn't have needed any of the array manipulation in the code that works with the results. (Again, in this case, it would still have made more sense to simply return the entire Track, but this is useful when you're working with properties from multiple objects or even synthetic results like aggregate functions, as demonstrated below.)
9.3 Sorting
It should come as no surprise that you can use a SQL-style "order by" clause to control the order in which your output appears. I've alluded to this several times in earlier chapters, and it works just like you'd expect. You can use any property of the objects being returned to establish the sort order, and you can list multiple properties to establish sub-sorts within results for which the first property values are the same.
9.3.1 How do I do that?
Sorting is very simple: you list the values that you want to use to sort the results. As usual, where SQL uses columns, HQL uses properties. For Example 9-13, let's update the query in Example 9-10 so that it displays the results in reverse alphabetical order.
NOTE
As in SQL, you specify an ascending sort using "asc" and a descending sort with "desc".
Example 9-13. Addition to Track.hbm.xml that sorts the results backwards by title
<query name="com.oreilly.hh.tracksNoLongerThan">
<![CDATA[
select track.id, track.title from com.oreilly.hh.Track as track
where track.playTime <= :length
order by track.title desc
]]>
</query>
The output from running this is as you'd expect (Example 9-14).
Example 9-14. Titles and IDs in reverse alphabetical order
% ant qtest
Buildfile: build.xml
prepare:
[copy] Copying 1 file to /Users/jim/Documents/Work/OReilly/Hibernate/
Examples/ch09/classes
compile:
qtest:
[java] Track: Video Killed the Radio Star [ID=1]
[java] Track: Test Tone 1 [ID=6]
[java] Track: Russian Trance [ID=0]
[java] Track: Never Turn Your Back on Mother Earth [ID=11]
[java] Track: Motherless Child [ID=12]
[java] Track: In a Manner of Speaking [ID=8]
[java] Track: Gone [ID=10]
9.4 Working with Aggregate Values
Especially when writing reports, you'll often want summary information from the database: "How many? What's the average? The longest?" HQL can help with this, by offering aggregate functions like those in SQL. In HQL, of course, these functions apply to the properties of persistent classes.
9.4.1 How do I do that?
Let's try some of this in our query test framework. First, add the query in Example 9-15 after the existing query in Track.hbm.xml.
Example 9-15. A query collecting aggregate information about tracks
<query name="com.oreilly.hh.trackSummary">
<![CDATA[
select count(*), min(track.playTime), max(track.playTime)
from com.oreilly.hh.Track as track
]]>
</query>
I was tempted to try asking for the average playing time as well, but unfortunately HSQLDB doesn't know how to calculate averages for nonnumeric values, and this property is stored in a column of type date.
Next we need to write a method to run this query and display the results. Add the code in Example 9-16 to QueryTest.java, after the tracksNoLongerThan() method.
Example 9-16. A method to run the trackSummary query
/**
* Print summary information about all tracks.
*
* @param session the Hibernate session that can retrieve data.
* @throws HibernateException if there is a problem.
**/
public static void printTrackSummary(Session session)
throws HibernateException
{
Query query = session.getNamedQuery("com.oreilly.hh.trackSummary");
Object[] results = (Object[])query.uniqueResult();
System.out.println("Summary information:");
System.out.println(" Total tracks: " + results[0]);
System.out.println(" Shortest track: " + results[1]);
System.out.println(" Longest track: " + results[2]);
}
Since we're only using aggregate functions in the query, we know we'll only get one row of results back. This is another opportunity to use the uniqueResult() convenience method offered by the Query interface. It saves us the trouble of getting back a list and extracting the first element. As discussed in the "Selecting Properties and Pieces" section above, since we've asked for multiple distinct values, that result will be an Object array, whose elements are the values we requested in the same order we put them in the query.
We also need to add a line to main() to call this method. We can put it after the end of the loop in which we print details about selected tracks, as shown in Example 9-17.
Example 9-17. Addition to main() in QueryTest.java to display the new summary information
...
System.out.println("Track: " + aTitle + " [ID=" + anID + ']');
}
printTrackSummary(session);
} finally {
// No matter what, close the session
...
With these additions, we get new output when running ant qtest (Example 9-18).
Example 9-18. The summary output
...
qtest:
[java] Track: Video Killed the Radio Star [ID=1]
[java] Track: Test Tone 1 [ID=6]
[java] Track: Russian Trance [ID=0]
[java] Track: Never Turn Your Back on Mother Earth [ID=11]
[java] Track: Motherless Child [ID=12]
[java] Track: In a Manner of Speaking [ID=8]
[java] Track: Gone [ID=10]
[java] Summary information:
[java] Total tracks: 13
[java] Shortest track: 00:00:10
[java] Longest track: 00:07:39
That was pretty easy. Let's try something trickier—pulling information from joined tables. Tracks have a collection of artists associated with them. Suppose we want to get summary information about the tracks associated with a particular artist, rather than for all tracks. Example 9-19 shows what we'd add to the query.
Example 9-19. Summarizing tracks associated with an artist
<query name="com.oreilly.hh.trackSummary">
<![CDATA[
select count(*), min(track.playTime), max(track.playTime)
from com.oreilly.hh.Track as track
where :artist in elements(track.artists)
]]>
</query>
We've added a where clause to narrow down the tracks we want to see, using a named parameter, artist. HQL provides another use for the in operator. While you can use it in the normal SQL sense to give a list of possible values for a property, you can also do what we've done here. This statement tells Hibernate we are interested in tracks whose artists collection contains a specified value. To call this version of the query, beef up printTrackSummary() a little, as shown in Example 9-20.
Example 9-20. Enhancing printTrackSummary() to work with a specific artist
/**
* Print summary information about tracks associated with an artist.
*
* @param artist the artist in whose tracks we're interested
* @param session the Hibernate session that can retrieve data.
* @throws HibernateException if there is a problem.
**/
public static void printTrackSummary(Artist artist, Session session)
throws HibernateException
{
Query query = session.getNamedQuery("com.oreilly.hh.trackSummary");
query.setParameter("artist", artist);
Object[] results = (Object[])query.uniqueResult();
System.out.println("Summary of tracks by " + artist.getName() + ':');
System.out.println(" Total tracks: " + results[0]);
System.out.println(" Shortest track: " + results[1]);
System.out.println(" Longest track: " + results[2]);
}
Wasn't much to that, was there? Finally, the line that calls this method needs another parameter to specify an artist. Use the handy getArtist() method in CreateTest.java once again. Change the method call in QueryTest.java's main() method to look like it does in Example 9-21.
Example 9-21. Calling the enhanced printTrackSummary()
...
System.out.println("Track: " + aTitle + " [ID=" + anID + ']');
}
printTrackSummary(CreateTest.getArtist("Samuel Barber",
false, session), session);
} finally {
// No matter what, close the session
...
Now when you run ant qtest you'll see information that looks like Example 9-22.
Example 9-22. Running the summary query for tracks by Samuel Barber
qtest:
[java] Track: Video Killed the Radio Star [ID=1]
[java] Track: Test Tone 1 [ID=6]
[java] Track: Russian Trance [ID=0]
[java] Track: Never Turn Your Back on Mother Earth [ID=11]
[java] Track: Motherless Child [ID=12]
[java] Track: In a Manner of Speaking [ID=8]
[java] Track: Gone [ID=10]
[java] Summary of tracks by Samuel Barber:
[java] Total tracks: 2
[java] Shortest track: 00:06:35
[java] Longest track: 00:07:39
9.4.2 What just happened?
This took so little effort that it's worth taking a minute to appreciate how much Hibernate actually did for us. The getArtist() method we called returned the Artist instance corresponding to Samuel Barber. We were able to pass this entire object as a named parameter to our HQL query, and Hibernate knows enough about how to put together join queries using the Artist's id property and the TRACK_ARTISTS table to implement the complicated condition we expressed so concisely in Example 9-19.
The results we got reflect the two remixes of "Adagio for Strings" in the sample data. They don't show up in the detailed track list because they're both longer than five minutes.
NOTE
Just try doing something like that with vanilla SQL!
9.5 Writing Native SQL Queries
Given the power and convenience of HQL, and the way it dovetails so naturally with the objects in your Java code, why wouldn't you want to use it? Well, there might be some special feature supported by the native SQL dialect of your project's database that HQL can't exploit. If you're willing to accept the fact that using this feature will make it harder to change databases in the future, Hibernate will let you write queries in that native dialect while still helping you write expressions in terms of properties and translate the results to objects. (If you didn't want this help, you could just use a raw JDBC connection to run a plain SQL query, of course.)
Another circumstance in which it might be nice to meet your database halfway is if you're in the process of migrating an existing JDBC-based project to Hibernate, and you want to take small steps rather than thoroughly rewriting each query right away.
9.5.1 How do I do that?
If you're embedding your query text inside your Java source code, you use the Session method createSQLQuery() instead of Example 3-8's createQuery(). Of course, you know better than to code like that, so I won't even show you an example. The better approach is to put the query in a mapping document like Example 3-9. The difference is that you use a sql-query tag rather than the query tag we've seen up until now. You also need to tell Hibernate the mapped class you want to return, and the alias that you're using to refer to it (and its properties) in the query.
As a somewhat contrived example, suppose we want to know all the tracks that end exactly halfway through the last minute they're playing (in other words, the time display on the jukebox would be h:mm:30). An easy way to do that would be to take advantage of HSQLDB's built-in SECOND function, which gives you the seconds part of a Time value. Since HQL doesn't know about functions that are specific to HSQLDB's SQL dialect, this will push us into the realm of a native SQL query. Example 9-23 shows what it would look like; add this after the HQL queries in Track.hbm.xml.
Example 9-23. Embedding a native SQL dialect query in a Hibernate mapping
<sql-query name="com.oreilly.hh.tracksEndingAt">
<return alias="track" class="com.oreilly.hh.Track"/>
<![CDATA[
select {track.*}
from TRACK as {track}
where SECOND({track}.PLAYTIME) = :seconds
]]>
</sql-query>
The return tag tells Hibernate we're going to be using the alias track in our query to refer to a Track object. That allows us to use the shorthand {track.*} in the query body to refer to all the columns from the TRACK table we need in order to create a Track instance. (Notice that everywhere we use the alias in the query body we need to enclose it in curly braces. This gets us "out of" the native SQL environment so we can express things in terms of Hibernate-mapped classes and properties.)
The where clause in the query uses the HSQLDB SECOND function to narrow our results to include only tracks whose length has a specified number in the seconds part. Happily, even though we're building a native SQL query, we can still make use of Hibernate's nice named query parameters. In this case we're passing in a value named "seconds" to control the query. (You don't use curly braces to tell Hibernate you're using a named parameter even in an SQL query; its parser is smart enough to figure this out.)
The code that uses this mapped SQL query is no different than our previous examples using HQL queries. The getNamedQuery() method is used to load both kinds, and they both implement the Query interface. So our Java method invoking this query should look familiar. Add the code in Example 9-24 after the printTrackSummary() method in QueryTest.java.
Example 9-24. Calling a native SQL mapped query
/**
* Print tracks that end some number of seconds into their final minute.
*
* @param seconds, the number of seconds at which we want tracks to end.
* @param session the Hibernate session that can retrieve data.
* @throws HibernateException if there is a problem.
**/
public static void printTracksEndingAt(int seconds, Session session)
throws HibernateException
{
Query query = session.getNamedQuery("com.oreilly.hh.tracksEndingAt");
query.setInteger("seconds", seconds);
List results = query.list();
for (ListIterator iter = results.listIterator() ; iter.hasNext() ; ) {
Track aTrack = (Track)iter.next();
System.out.println("Track: " + aTrack.getTitle() +
", length=" + aTrack.getPlayTime());
}
}
Finally, add some lines to main() that call this method. Example 9-25 shows them added after the invocation of printTrackSummary().
Example 9-25. Calling printTracksEndingAt() to display tracks ending at a half minute
...
printTrackSummary(CreateTest.getArtist("Samuel Barber",
false, session), session);
System.out.println("Tracks ending halfway through final minute:");
printTracksEndingAt(30, session);
} finally {
// No matter what, close the session
...
These changes produce the additional output shown in Example 9-26 when ant qtest is run.
Example 9-26. Sample output from the native SQL query
qtest:
[java] Track: Video Killed the Radio Star [ID=1]
...
[java] Summary of tracks by Samuel Barber:
[java] Total tracks: 2
[java] Shortest track: 00:06:35
[java] Longest track: 00:07:39
[java] Tracks ending halfway through final minute:
[java] Track: Russian Trance, length=00:03:30
There's a lot more tedium and attention to detail required in using a native SQL query than an HQL query (especially when your query starts getting complex or referring to multiple tables), but it's nice to know that it is possible on the rare occasions where you really need one.
9.5.2 What about...
...Well, lots of things? You undoubtedly suspect this chapter barely scratches the surface of what you can do with HQL. That's definitely true. When you start combining some of these capabilities, and working with collections, associations, and powerful expressions, you can achieve some remarkable things. We can't possibly cover them all in this introduction, so you'll want to take a close look at the HQL section and examples in the Hibernate reference documentation, and do some experimentation on your own.
When you look through the Hibernate Query Language chapter in the reference documentation, be sure to look at the interesting things you can use in expressions, especially as they apply to collections. Don't miss the way you can use array bracket notation to select elements of indexed collections—you can even put arithmetic expressions inside the brackets.
The "Tips and Tricks" section that follows their longer examples gives some useful advice about working efficiently in different database environments, and using a variety of Hibernate interfaces to achieve useful results in ways you might not think of, especially if you're coming from a SQL background.
Hopefully, this discussion has helped you get a grounding in the basics, and it will serve as a starting point and anchor for the explorations on which you will embark!
NOTE
This isn't your father's SQL....
Chapter 6. Deducing the Best Execution Plan
For Tyme ylost may nought recovered be.—Geoffrey Chaucer Troilus and Criseyde
Just as reducing a word problem to abstract mathematics is
usually the hardest part of solving the problem, you will usually find that
producing the query diagram is harder than deducing the best execution plan from
the query diagram. Now that you know the hard part, how to translate a query
into a query diagram, I demonstrate the easy part. There are several questions you need to
answer to fully describe the optimum execution plan for a query:
-
How do you reach each table in the execution plan, with a full table scan or one or more indexes, and which indexes do you use, if any?
-
How do you join the tables in the execution plan?
-
In what order do you join the tables in the execution plan?
Out of these three questions, I make a case that the only hard
question, and the main point of the query diagram, is the question of join
order. If you begin by finding the optimum join order, which is nearly decoupled
from the other questions, you will find that answers to the other two questions
are usually obvious. In the worst cases, you might need to try experiments to
answer the other two questions, but these will require at most one or two
experiments per table. If you did not have a systematic way to answer the
join-order question, you would require potentially billions of experiments to
find the best plan.
6.1 Robust Execution Plans
A subset of all possible execution plans can
be described as robust. While such plans are not
always quite optimum, they are almost
always close to optimum in real-world queries, and they have desirable
characteristics, such as predictability and low likelihood of errors during
execution. (A nonrobust join can fail altogether, with an out-of-TEMP-space
error if a hash or sort-merge join needs more space than is available.) Robust
plans tend to work well across a wide range of likely data distributions that
might occur over time or between different database instances running the same
application. Robust plans are also relatively forgiving of uncertainty and
error; with a robust plan, a moderate error in the estimated selectivity of a
filter might lead to a moderately suboptimal plan, but not to a disastrous plan.
When you use robust execution plans, you can almost always solve a SQL tuning
problem once, instead of solving it many times as you encounter different data
distributions over time and at different customer sites.
|
Robust execution plans tend to have the following
properties:
-
Their execution cost is proportional to rows returned.
-
They require almost no sort or hash space in memory.
-
They need not change as all tables grow.
-
They have moderate sensitivity to distributions and perform adequately across many instances running the same application, or across any given instance as data changes.
-
They are particularly good when it turns out that a query returns fewer rows than you expect (when filters are more selective than they appear).
|
Robustness requirements imply that you should usually choose
to:
-
Drive to the first table on a selective index
|
-
Drive down to primary keys before you drive up to nonunique foreign keys
|
If you consider only robust plans, robustness rules alone
answer the first two questions of finding the best execution plan, leaving only
the question of join order:
-
You will reach every table with a single index, an index on the full filter condition for the first table, and an index on the join key for each of the other tables.
-
You will join all tables by nested loops.
I later discuss when you can sometimes safely and profitably
relax the robustness requirement for nested-loops joins, but for now I focus on
the only remaining question for robust plans: the join order. I also later
discuss what to do when the perfect execution plan is unavailable, usually
because of missing indexes, but for now, assume you are looking for a truly
optimum robust plan, unconstrained by missing indexes.
6.2 Standard Heuristic Join Order
-
With nested loops—driving through the full, unique, primary-key indexes—drive down as long as possible, first to the best (nearest to zero) remaining filters.
-
Only when necessary, follow nested loops up diagram links (against the direction of the arrow) through full, nonunique, foreign-key indexes.
These steps might not be perfectly clear now. Don't worry. The
rest of this chapter explains each of these steps in detail. The heuristic is
almost easier to demonstrate than to describe.
When the driving table turns out to be several levels below the
top detail table (the root table, so-called
because it lies at the root of the join tree), you will have to return to Step 2
after every move up the tree in Step 3. I describe some rare refinements for
special cases, but by and large, finding the optimum plan is that simple, once
you have the query diagram!
After tuning thousands of queries from real-world applications
that included tens of thousands of queries, I can state with high confidence
that these rules are just complex enough. Any
significant simplification of these rules will leave major, common classes of
queries poorly tuned, and any addition of complexity will yield significant
improvement only for relatively rare classes of queries.
|
There is one subtlety to consider when Steps 2 and 3 mention
following join links up or down: the tables reached in the plan so far are
consolidated into a single virtual node, for purposes of choosing the next step
in the plan. Alternatively, it might be easier to visualize the tables reached
so far in the plan as one cloud of nodes. From the cloud of
already-reached nodes, it makes no difference to the rest of the plan how
already-reached table nodes are arranged within that cloud, or in what order you
reached them. The answer to the question "Which table comes next?" is completely
independent of the order or method you used to join any earlier tables. Which
tables are in the cloud affects the boundaries of the cloud and matters, but how
they got there is ancient history, so to speak, and no longer relevant to your
next decision.
When you put together an execution plan following the rules,
you might find yourself focused on the most-recently-joined table, but this is a
mistake. Tables are equally joined upward or downward from the cloud if they are
joined upward or downward from any member of the set of already-joined tables,
not necessarily the most-recently-joined table. You might even find it useful to
draw the expanding boundaries of the cloud of so-far-reached tables as you
proceed through the steps, to clarify in your mind which tables lie just outside
the cloud. The relationship of remaining tables to the cloud clarifies whether
they join to the cloud from above or from below, or do not even join to the
cloud directly, being ineligible to join until you join further intermediate
tables. I further illustrate this point later in the chapter.
6.3 Simple Examples
Nothing illustrates the method better than examples, so I
demonstrate the method using the query diagrams built in Chapter 5, beginning
with the simplest case, the two-way join, shown again in Figure 6-1.
Figure 6-1. A simple two-way join
Applying Step 1 of the method, first ask which node offers the
best (lowest) effective filter ratio. The answer is E, since
E's filter ratio of 0.1 is less than D's ratio of 0.5. Driving
from that node, apply Step 2 and find that the best (and only) downstream node
is node D, so go to D next. You find no other tables, so that
completes the join order. Following the rules for a robust execution plan, you
would reach E with an index on its filter, on Exempt_Flag.
Then, you would follow nested loops to the matching departments through the
primary-key index on Department_ID for Departments. By brute
force, in Chapter 5, I already
showed the comforting result that this plan is in fact the best, at least in
terms of minimizing the number of rows touched.
6.3.1 Join Order for an Eight-Way Join
So far, so good, but two-way joins are too easy to need an
elaborate new method, so let's continue with the next example, the eight-way
join. Eight tables can, in theory, be joined in 8-factorial join orders (40,320
possibilities), enough to call for a systematic method. Figure 6-2 repeats the earlier problem from Chapter
5.
Figure 6-2. A typical eight-way join

Following the heuristics outlined earlier, you can determine
the optimal join order for the query diagrammed in Figure 6-2 as follows:
-
Find the table with the lowest filter ratio. In this case, it's C, with a ratio of 0.0002, so C becomes the driving table.
-
From C, it is not possible to drive down to any table through a primary-key index. You must therefore move upward in the diagram.
-
The only way up from C is to O, so O becomes the second table to be joined.
-
After reaching O, you find that you can now drive downward to OT. Always drive downward when possible, and go up only when you've exhausted all downward paths. OT becomes the third table to be joined.
-
There's nothing below OT, so return to O and move upward to OD, which becomes the fourth table to be joined.
-
The rest of the joins are downward and are all unfiltered, so join to S, P, and ODT in any order.
-
Join to A at any point after it becomes eligible, after the join to S places it within reach.
|
Therefore, you find the optimum join order to be C;
O; OT; OD; S, P, and ODT
in any order; and A at any point after S. This dictates 12
equally good join orders out of the original 40,320 possibilities. An exhaustive
search of all possible join orders confirms that these 12 are equally good and
are better than all other possible join orders, to minimize rows visited in
robust execution plans.
This query diagram might strike you as too simple to represent
a realistic problem, but I have found this is not at all the case. Most queries
with even many joins have just one or two filters, and one of the filter ratios
is usually obviously far better than any other.
For the most part, simply driving to that best filter first,
following downward joins before upward joins, and perhaps picking up one or two
minor filters along the way, preferably sooner rather than later, is all it
takes to find an excellent execution plan. That plan is almost certainly the
best or so close to the best that the difference does not matter. This is where
the simplified query diagrams come in. The fully simplified query diagram, shown
in Figure 6-3, with the best
filter indicated by the capital F and the other filter by lowercase
f, leads to the same result with only qualitative filter
information.
Figure 6-3. Fully simplified query diagram for the same eight-way join

I will return to this example later and show that you can
slightly improve on this result by relaxing the requirement for a fully robust
execution plan and using a hash join, but for now, I focus on teaching complete
mastery of the skill of optimizing for the best robust plan. I already showed
the 12 best join orders, and I need one of these for further illustration to
complete the solution of the problem. I choose (C, O, OT, OD, ODT,
P, S, A) as the join order to illustrate.
6.3.2 Completing the Solution for an Eight-Way Join
The rest of the solution is to apply the robust-plan rules to
get the desired join order in a robust plan. A robust plan calls for nested
loops through indexes, beginning with the filter index on the driving table and
followed by indexes on the join keys. Here is the best plan in detail (refer
back to Chapter 5 for the
original query and filter conditions):
-
Join, with nested loops, to Orders on an index on the foreign key Customer_ID.
-
Join, with nested loops, to Code_Translations (OT) on its primary-key index (Code_Type, Code).
-
Join, with nested loops, to Order_Details on an index on the foreign key Order_ID.
-
Join, with nested loops, to Code_Translations (ODT) on its primary-key index (Code_Type, Code).
-
Outer join, with nested loops, to Products on its primary-key index Product_ID.
-
Outer join, with nested loops, to Shipments on its primary-key index Shipment_ID.
-
Outer join, with nested loops, to Addresses on its primary-key index Address_ID.
-
Sort the final result as necessary.
Any execution plan that failed to follow this join order,
failed to use nested loops, or failed to use precisely the indexes shown would
not be the optimal robust plan chosen here. Getting the driving table and index
right is the key problem 90% of the time, and this example is no exception. The
first obstacle to getting the right plan is to somehow gain access to the
correct driving filter index for Step 1. In Oracle, you might use a functional
index on the uppercase values of the Last_Name and First_Name
columns, to avoid the dilemma of driving to an index with a complex expression.
In other databases, you might recognize that the name values are, or should be,
always stored in uppercase, or you might denormalize with new, indexed columns
that repeat the names in uppercase, or you might change the application to
require a case-sensitive search. There are several ways around this specific
problem, but you would need to choose the right driving table to even discover
the need.
Once you make it possible to get the driving table right, are
your problems over? Almost certainly, you have indexes on the necessary primary
keys, but good database design does not (and should not) guarantee that every
foreign key has an index, so the next likely issue is to make sure you have
foreign-key indexes Orders(Customer_ID) and
Order_Details(Order_ID). These enable the necessary nested-loops joins
upward for a robust plan starting with Customers.
Another potential problem is that optimizers might choose a
join method other than nested loops to one or more of the tables, and you might
need hints or other techniques to avoid the use of methods other than nested
loops. If they take this course, they will likely also choose a different access
method for the tables being joined without nested loops, reaching all the rows
that can join at once.
|
In this simple case, with just the filters shown, join order is
likely the least of the problems, as long as you get the driving table right and
have the necessary foreign-key indexes.
6.3.3 A Complex 17-Way Join
Figure 6-4
shows a deliberately complex example to fully illustrate the method. I have left
off the join ratios, making Figure 6-4 a partially simplified query
diagram, since the join ratios do not affect the rules you have learned so far.
I will step through the solution to this problem in careful detail, but you
might want to try it yourself first, to see which parts of the method you
already fully understand.
Figure 6-4. A complex 17-way join

For Step 1, you quickly find that B4 has the best
filter ratio, at 0.001, so choose that as the driving table. It's best to reach
such a selective filter through an index; so, in real life, if this were an
important enough query, you might create a new index to use in driving to
B4. For now though, we'll just worry about the join order. Step 2
dictates that you next examine the downward-joined nodes C4 and
C5, with a preference to join to better-filtered nodes first.
C5 has a filter ratio of 0.2, while C4 has a filter ratio of
0.5, so you join to C5 next. At this point, the beginning join order is
(B4, C5), and the cloud around the so-far-joined tables looks like Figure 6-5.
Figure 6-5. So-far-joined cloud, with two tables joined

If C5 had one or more nodes connected below, you would
now have to compare them to C4, but since it does not, Step 2 offers
only the single choice of C4. When you widen the cloud boundaries to
include C4, you find no further nodes below the cloud, so you move on
to Step 3, find the single node A2 joining the cloud from above, and
add it to the building join order, which is now (B4, C5, C4, A2). The
cloud around the so-far-joined tables now looks like Figure 6-6.
Figure 6-6. So-far-joined cloud, after four tables joined

Note that I left in the original two-node cloud in gray. In
practice, you need not erase the earlier clouds; just redraw new clouds around
them. Returning to Step 2, find downstream of the current joined-tables cloud a
single node, B3, so put it next in the join order without regard to any
filter ratio it might have. Extend the cloud boundaries to include B3,
so you now find nodes C2 and C3 applicable under Step 2, and
choose C2 next in the join order, because its filter ratio of 0.5 is
better than the implied filter ratio of 1.0 on unfiltered C3. The join
order so far is now (B4, C5, C4, A2, B3, C2). Extend the cloud
further around C2. This brings no new downstream nodes into play, so
Step 2 now offers only C3 as an alternative. The join order so far is
now (B4, C5, C4, A2, B3, C2, C3), and Figure 6-7 shows the current join cloud.
Figure 6-7. So-far-joined cloud, with seven tables joined

Step 2 now offers two new nodes below the current join cloud,
D1 and D2, with D1 offering the better filter ratio.
Since neither of these has nodes joined below, join to them in filter-ratio
order and proceed to Step 3, with the join order so far now (B4,
C5, C4, A2, B3, C2, C3, D1, D2). This completes the entire branch from
A2 down, leaving only the upward link to M (the main table
being queried) to reach the rest of the join tree, so Step 3 takes you next to
M. Since you have reached the main table at the root of the join tree,
Step 3 does not apply for the rest of the problem. Apply Step 2 until you have
reached the rest of the tables. Immediately downstream of M (and of the
whole join cloud so far), you find A1 and A3, with only
A1 having a filter, so you join to A1 next. Now, the join
order so far is (B4, C5, C4, A2, B3, C2, C3, D1, D2, M, A1),
and Figure 6-8 shows the
current join cloud.
Figure 6-8. So-far-joined cloud, with 11 tables joined

Find immediately downstream of the join cloud nodes
B1, B2, and A3, but none of these have filters, so
look for two-away filters that offer the best filter ratios two steps away. You
find such filters on C1 and B5, but C1 has the best,
so add B1 and C1 to the running join order. Now, the join
order so far is (B4, C5, C4, A2, B3, C2, C3, D1, D2, M, A1, B1,
C1). You still find no better prospect than the remaining two-away filter
on B5, so join next to A3 and B5, in that order. Now,
the only two nodes remaining, B2 and C6, are both eligible to
join next, being both directly attached to the join cloud. Choose C6
first, because it has a filter ratio of 0.9, which is better than the implied
filter ratio of 1.0 for the unfiltered join to B2. This completes the
join order:(B4, C5, C4, A2, B3, C2, C3, D1, D2, M, A1, B1, C1, A3, B5, C6,
B2).
Apart from the join order, the rules specify that the database
should reach table B4 on the index for its filter condition and the
database should reach all the other tables in nested loops to the indexes on
their join keys. These indexed join keys would be primary keys for downward
joins to C5, C4, B3, C2, C3,
D1, D2, A1, B1, C1, A3,
B5, C6, and B2, and foreign keys for A2
(pointing to B4) and M (pointing to A2). Taken
together, this fully specifies a single optimum robust plan out of the
17-factorial (355,687,428,096,000) possible join orders and all possible join
methods and indexes. However, this example is artificial in two respects:
-
Real queries rarely have so many filtered nodes, so it is unlikely that a join of so many tables would have a single optimum join order. More commonly, there will be a whole range of equally good join orders, as I showed in the previous example.
-
The later part of the join order matters little to the runtime, as long as you get the early join order right and reach all the later tables through their join keys. In this example, once the database reached node M, and perhaps A1, with the correct path, the path to the rest of the tables would affect the query runtime only slightly. In most queries, even fewer tables really matter to the join order, and often you will do fine with just the correct driving table and nested loops to the other tables following join keys in any order that the join tree permits.
|
6.4 A Special Case
Usually, following the tuning-process steps
without deviation works fine, but the problem shown in Figure
6-4 turns out to offer, potentially, one last trick you could apply,
especially with Oracle.
6.4.1 The Oracle Solution
Return to Figure
6-4 and consider a special case: all tables other than M are
relatively small and well cached, but M is very large and therefore
poorly cached and much more expensive to access than the other tables.
Furthermore, M is a special combinations
table to express a many-to-many relationship between A1 and
A2. An example of such a combinations table would be a table of
actor/movie combinations for a movie-history database, linking Movies
and Actors, when the relationship between these is many-to-many. In
such a combinations table, it is natural to use a two-part primary key made up
of the IDs of the linked tables—in this case, Movie_ID and
Actor_ID. As is commonly the case, this combinations table has an index
on the combination of foreign keys pointing to the tables A2 and
A1. For this example, assume the order of the keys within that index
has the foreign key pointing to A2 first, followed by the foreign key
pointing to A1 .
Consider the costs of accessing each table in the plan I found
earlier as the original solution to Figure
6-4. You find low costs for the tables up to M, then a much higher
cost for M because you get many more rows from that table than the
previous tables and accessing those rows leads to physical I/O. Following
M, the database joins to just as many rows in A1 (since
M has no filter), but these are much cheaper per row, because they are
fully cached. Then, the filter in A1 drops the rowcount back down to a
much lower number for the remaining tables in the plan. Therefore, you find a
cost almost wholly dominated by the cost of accessing M, and it would
be useful to reduce that cost.
As it happens, in this unusual case, you find an opportunity to
get from the foreign key in M pointing to A2 to the foreign
key pointing to A1, stored in the same multicolumn index in M,
without ever touching the table M! The database will later need to read
rows from the table itself, to get the foreign key pointing to A3 and
probably to get columns in the query SELECT list. However, you can
postpone going to the table until after the database reaches filtered tables
A1 and C1. Therefore, the database will need to go only to 18%
as many rows in M (0.3 / 0.6, picking up the filter ratios 0.3 on
A1 and 0.6 on C1) as it would need to read if the database
went to the table M as soon as it went to the index for M.
This greatly reduces the query cost, since the cost of reading rows in table
M, itself, dominates in this particular case.
No database makes it particularly easy to decouple index reads
from table reads; a table read normally follows an index read immediately,
automatically, even when this is not optimal. However, Oracle does allow for a
trick that solves the problem, since Oracle SQL can explicitly reference rowids.
In this case, the best join order is (B4, C5, C4, A2, B3, C2, C3, D1, D2,
MI, A1, B1, C1, MT, A3, B5, C6, B2). Here, MI is the index on
M(FkeyToA2,FkeyToA1), inserted into the join order where M was
originally. MT is the table M, accessed later in the plan
through the ROWID from MI and inserted into the join order
after picking up the filters on A1 and C1. The trick is to
refer to M twice in the FROM clause, once for the index-only
access and once for a direct ROWID join, as follows, assuming that the
name of the index on M(FkeyToA2,FkeyToA1) is
M_DoubleKeyInd:
Select /*+ ORDERED INDEX(MI M_DoubleKeyInd) */ MT.Col1, MT.Col2,... ... FROM B4, C5, C4, A2, B3, C2, C3, D1, D2, M MI, A1, B1, C1, M MT, A3, B5, C6, B2 WHERE ... AND A2.Pkey=MI.FKeyToA2 AND A1.Pkey=MI.FKeyToA1 AND MI.ROWID=MT.ROWID AND...
So, two joins to M are really cheaper than one in this
unusual case! Note the two hints in this version of the query:
Other hints might be necessary to get the rest of the plan just
right. Also note the unusual rowid-to-rowid join between MI and
MT, and note that the only references to MI are in the
FROM clause and in the WHERE clause conditions shown. These
references require only data (the two foreign keys and the rowid) stored in the
index. This is crucial: Oracle avoids going to the table M as soon as
it reaches the index on the primary key to M only because MI
refers solely to the indexed columns and the rowids that are also stored in the
index. Columns in the SELECT clause and elsewhere in the
WHERE-clause conditions (such as a join to A3, not shown) all
come from MT. Because of all this, the optimizer finds that the only
columns required for MI are already available from the index. It counts
that join as index-only. The direct-ROWID join to MT occurs
later in the join order, and any columns from the M table are selected
from MT.
However, the technique I've just described is not usually
needed, for several reasons:
6.4.2 Solving the Special Case Outside of Oracle
If you cannot refer directly to rowids in your WHERE
clause, can this trick still work in some form? The only part of the trick that
depends on rowids is the join between MI and MT. You could
also join those table aliases on the full primary key. The cost of the early
join to MI would be unchanged, but the later join to MT would
require looking up the very same index entry you already reached in the
index-only join to MI for those rows that remain. This is not as
efficient, but note that these redundant hits on the index will surely be
cached, since the execution plan touches the same index entries moments before,
leaving only the cost of the extra logical I/O. Since the database throws most
of the rows out before it even reaches MT, this extra logical I/O is
probably much less than the savings on access to the table itself.
|
6.5 A Complex Example
Now,
I demonstrate an example that delivers a less straightforward join order that
requires more attention to the whole join cloud. You will find that the sequence
of next-best tables can jump all over the query diagram in cases like the one I
describe in this section. Consider the problem in Figure 6-9, and try it yourself before you
read on.
Figure 6-9. Another problem with the same join skeleton
Here, you see, as is quite common, that the best filter falls
on the root detail table.
|
Since you will drive from the root detail table, Step 3 will
never apply; you have no nodes upstream of the starting point. The cloud of
tables joined so far will grow downward from the top, but keep in mind that you
can find the next-best node anywhere along the boundary of this cloud, not
necessarily near the last table joined. Try to find the best join order yourself
before you read on.
The first eligible nodes are A1, A2, and
A3, and the best filter ratio lies on A1, so A1 falls
second in the join order. After extending the join cloud to A1, add
B1 and B2 to the eligible nodes, and A2 and
A3 are still eligible. Between these four nodes, A3 has the
best filter ratio, so join to it next. The join order, so far, is (M, A1,
A3), and the join cloud now looks like Figure 6-10.
Figure 6-10. Join cloud after the first three tables

The list of eligible nodes attached to the join cloud is now
B1, B2, A2, and B5. B1 has the best
filter ratio among these, so join to it next, extending the cloud and adding
C1 to the list of eligible nodes, which is now B2,
A2, B5, and C1. Among these, C1 is the best,
so join to it next and extend the cloud further. C1 has no downstream
nodes, so proceed to the next-best node on the current list, A2, which
adds B3 and B4 to the list of eligible nodes, which is now
B2, B5, B3, and B4. The join order, so far,
is (M, A1, A3, B1, C1, A2), and the join cloud now looks like
Figure 6-11.
Figure 6-11. Join cloud after the first six tables

Among the eligible nodes, B4 has, by far, the best
filter ratio, so it comes next. (It would have been great to join to it earlier,
but it did not become eligible until the database reached A2.) The join
order to B4 adds C4 and C5 to the eligible list,
which now includes B2, B5, B3, C4, and
C5. Of these, C5 is by far the best, so it comes next. The
join order, so far, is (M, A1, A3, B1, C1, A2, B4, C5), and the join
cloud now looks like Figure
6-12.
Figure 6-12. Join cloud after the first eight tables

Eligible nodes attached below the join cloud now include
B2, B3, B5, and C4, and the best filter
among these is B5. B5 adds C6 to the eligible list,
and the next-best on the list is C4, which adds no new node to the
list. C6 is the next-best node, but it also adds no new node to the
eligible-nodes list, which is now just B2 and B3. Neither of
these even has a filter, so you look for two-away filters and find that
B3 at least gives access to the filter on C2, so you join to
B3 next. The join order, so far, is (M, A1, A3, B1, C1,
A2, B4, C5, B5, C4, C6, B3), and the join cloud now looks like Figure 6-13.
Figure 6-13. Join cloud after the first 12 tables

You now find eligible nodes B2, C2, and
C3, and only C2 has a filter, so join to C2 next. It
has no downstream node, so choose between B2 and C3 and again
use the tiebreaker that C3 at least gives access to two-away filters on
D1 and D2, so join C3 next. The join order, so far,
is now (M, A1, A3, B1, C1, A2, B4, C5, B5, C4, C6, B3, C2,
C3). The eligible downstream nodes are now B2, D1,
and D2. At this point in the process, the eligible downstream nodes are
the only nodes left, having no nodes further downstream. Just sort the nodes
left by filter ratio, and complete the join order: (M, A1, A3, B1,
C1, A2, B4, C5, B5, C4, C6, B3, C2, C3, D1, D2, B2). In real queries, you
usually get to the point of just sorting immediately attached nodes sooner. In
the common special case of a single detail table with only direct joins to
master-table lookups (dimension tables, usually), called a star join, you sort master nodes right from the
start.
Given the optimal order just derived, complete the
specification of the execution plan by calling for access to the driving table
M from an index for the filter condition on that table. Then join the
other tables using nested loops joins that follow indexes on those tables'
primary keys.
6.6 Special Rules for Special Cases
The heuristic rules so far handle most cases
well and nearly always generate excellent, robust plans. However, there are some
assumptions behind the rationale for these rules, which are not always true.
Surprisingly often, even when the assumptions are wrong, they are right enough to yield a plan that is at least close to
optimum. Here, I lay these assumptions out and examine more sophisticated rules
to handle the rare cases in which deviations from the assumptions matter:
-
Intermediate query results in the form of Cartesian products lead to poor performance. If you do not follow the joins when working out a join order, the result is a Cartesian product between the first set of rows and the rows from the leaped-to node. Occasionally, this is harmless, but even when it is faster than a join order following the links, it is usually dangerous and scales poorly as table volumes increase.
-
Detail join ratios are large, much greater than 1.0. Since master join ratios (downward on the join tree) are never greater than 1.0, this makes it much safer to join downward as much as possible before joining upward, even when upward joins give access to more filters. (The filters on the higher nodes usually do not discard more rows than the database picks up going to the more detailed table.) Even when detail join ratios are small, the one-to-many nature of the join offers at least the possibility that they could be large in the future or at other sites running the same application. This tends to favor SQL that is robust for the case of a large detail join ratio, except when you have high confidence that the local, current statistics are relatively timeless and universal.
-
The table at the root of the join tree is significantly larger than the other tables, which serve as master tables or lookup tables to it. This assumption follows from the previous assumption. Since larger tables have poorer hit ratios in cache and since the rowcount the database reads from this largest table is often much larger than most or all other rowcounts it reads, the highest imperative in tuning the query is to minimize the rowcount read from this root detail table.
-
When tables are big enough that efficiency matters, there will be one filter ratio that is much smaller than the others. Near-ties, two tables that have close to the same filter ratio, are rare. If the tables are large and the query result is relatively small, as useful query results almost always are, then the product of all filter ratios must be quite small. It is much easier to get this small result with one selective filter, sometimes combined with a small number of fairly unselective filters, than with a large number of comparable, semiselective filters. Coming up with a business rationale for lots of semiselective filters turns out to be difficult, and, empirically speaking, I could probably count on one hand the number of times I've seen such a case in over 10 years of SQL tuning. Given one filter that is much more selective than the rest, the way to guarantee reading the fewest rows from that most important root detail table is to drive the query from the table with that best filter.
-
The rowcount that the query returns, even before any possible aggregation, will be small enough that, even for tiny master tables, there is little or no incentive to replace nested loops through join keys with independent table reads followed by hash joins. Note that this assumption falls apart if you do the joins with a much higher rowcount than the query ultimately returns. However, the heuristics are designed to ensure that you almost never find a much higher rowcount at any intermediate point in the query plan than the database returns from the whole query.
The following sections add rules that handle rare special cases
that go against these assumptions.
6.6.1 Safe Cartesian Products
Consider the query diagrammed in Figure 6-14. Following the
usual rules (and breaking ties by choosing the leftmost node, just for
consistency), you drive into the filter on T1 and join to M
following the index on the foreign key pointing to T1. You then follow
the primary-key index into T2, discarding in the end the rows that fail
to match the filter on T2. If you assume that T1 has 100 rows,
based on the join ratios M must have 100,000 rows and T2 must
have 100 rows.
Figure 6-14. A query with a potentially good Cartesian-product execution plan
The plan just described would touch 1 row in T1, 1,000
rows in M (1% of the total), and 1,000 rows in T2 (each row in
T2 10 times, on average), before discarding all but 10 rows from the
result. Approximating query cost as the number of rows touched, the cost would
be 2,001. However, if you broke the rules, you could get a plan that does not
follow the join links. You could perform nested loops between T1 and
T2, driving into their respective filter indexes. Because there is no
join between T1 and T2 the result would be a Cartesian product
of all rows that meet the filter condition on T1 and all rows that meet
the filter condition on T2. For the table sizes given, the resulting
execution plan would read just a single row from each of T1 and
T2. Following the Cartesian join of T1 and T2, the
database could follow an index on the foreign key that points to T1,
into M, to read 1,000 rows from M. Finally, the database would
discard the 990 rows that fail to meet the join condition that matches
M and T2.
|
Using the Cartesian product, the plan costs just 1,002, using
the rows-touched cost function.
What happens if the table sizes double? The original plan cost,
following the join links, exactly doubles, to 4,002, in proportion to the number
of rows the query will return, which also doubles. This is normal for robust
plans, which have costs proportional to the number of rows returned. However,
the Cartesian-product plan cost is less well behaved: the database reads 2 rows
from T1; then, for each of those rows, reads the same 2 rows of
T2 (4 rows in all); then, with a Cartesian product that has 4 rows,
reads 4,000 rows from M. The query cost, 4,006, now is almost the same
as the cost of the standard plan. Doubling table sizes once again, the standard
plan costs 8,004, while the Cartesian-product plan costs 16,020. This
demonstrates the lack of robustness in most Cartesian-product execution plans,
which fail to scale well as tables grow. Even without table growth,
Cartesian-product plans tend to behave less predictably than standard plans,
because filter ratios are usually averages across the possible values. A filter
that matches just one row on average might sometimes match 5 or 10 rows for some
values of the variable. When a filter for a standard, robust plan is less
selective than average, the cost will scale up proportionally with the number of
rows the query returns. When a Cartesian-product plan runs into the same filter
variability, its cost might scale up as the square of the filter variability, or
worse.
Sometimes, though, you can have the occasional advantages of
Cartesian products safely. You can create a Cartesian product of as many
guaranteed-single-row sets as you like with perfect safety, with an inexpensive,
one-row result. You can even combine a one-row set with a multirow set and be no
worse off than if you read the multirow set in isolation from the driving table.
The key advantage of robust plans is rigorous avoidance of execution plans that
combine multiple multirow sets. You might recall that in Chapter 5 the rules
required you to place an asterisk next to unique filter conditions (conditions
guaranteed to match at most a single row). I haven't mentioned these asterisks
since, but they finally come into play here.
Consider Figure
6-15. Note that you find two unique filters, on B2 and C3.
Starting from the single row of C3 that matches its unique filter
condition, you know it will join to a single row of D1 and D2,
through their primary keys, based on the downward-pointing arrows to D1
and D2. Isolate this branch, treating it as a separate, single-row
query. Now, query the single row of B2 that matches its unique filter
condition, and combine these two independent queries by Cartesian product for a
single-row combined set.
Figure 6-15. A query with unique filter conditions
Placing these single-row queries first, you find an initial
join order of (C3, D1, D2, B2) (or (B2, C3, D1, D2);
it makes no difference). If you think of this initial prequeried single-row
result as an independent operation, you find that tables A1 and
B3 acquire new filter conditions, because you can know the values of
the foreign keys that point to B2 and C3 before you perform
the rest of the query. The modified query now looks like Figure 6-16, in which the already-executed
single-row queries are covered by a gray cloud, showing the boundaries of the
already-read portion of the query.
Figure 6-16. A query with unique filter conditions, with single-row branches preread
Upward-pointing arrows show that the initial filter condition
on A1 combines with a new filter condition on the foreign key into
B2 to reach a combined selectivity of 0.06, while the initially
unfiltered node B3 acquires a filter ratio of 0.01 from its foreign-key
condition, pointing into C3.
|
You can now optimize the remaining portion of the query,
outside the cloud, exactly as if it stood alone, following the standard rules.
You then find that A3 is the best driving table, having the best filter
ratio. (It is immaterial that A3 does not join directly to B2
or C3, since a Cartesian product with the single-row set is safe.) From
there, drive down to B5 and C6, then up to M. Since
A1 acquired added selectivity from its inherited filter on the foreign
key that points to B2, it now has a better filter ratio than
A2, so join to A1 next. So far, the join order is (C3, D1,
D2, B2, A3, B5, C6, M, A1), and the query, with a join cloud, looks like Figure 6-17.
Figure 6-17. A query with unique filter conditions, with single-row branches preread and a join cloud around the next five nodes read
Since you preread B2, the next eligible nodes are
B1 and A2, and A2 has the better filter ratio. This
adds B3 and B4 to the eligible list, and you find that the
inherited filter on B3 makes it the best next choice in the join order.
Completing the join order, following the normal rules, you reach B4,
C5, B1, C4, C2, and C1, in that
order, for a complete join order of (C3, D1, D2, B2, A3, B5, C6, M, A1, A2,
B3, B4, C5, B1, C4, C2, C1).
Even if you have just a single unique filter condition, follow
this process of prereading that single-row node or branch, passing the filter
ratio upward to the detail table above and optimizing the resulting remainder of
the diagram as if the remainder of the diagram stood alone. When the unique
condition is on some transaction table, not some type or status table, that
unique condition will also usually yield the best filter ratio in the query. In
this case, the resulting join order will be the same order you would choose if
you did not know that the filter condition was unique. However, when the best
filter is not the unique filter, the best join order can jump the join skeleton, which is to say that it does
not reach the second table through a join key that points to the first table.
6.6.2 Detail Join Ratios Close to 1.0
Treat
upward joins like downward joins when the join ratio is close to 1.0 and when
this allows access to useful filters (low filter ratios) earlier in the
execution plan. When in doubt, you can try both alternatives. Figure 6-18 shows a case of two of the upward
joins that are no worse than downward joins. Before you look at the solution,
try working it out yourself.
Figure 6-18. A case with detail joins ratios equal to 1.0
As usual, drive to the best filter, on B4, with an
index, and reach the rest of the tables with nested loops to the join-key
indexes. Unlike previous cases, you need not complete all downward joins before
considering to join upward with a join ratio equal to 1.0.
|
As usual, look for filters in the immediately adjacent nodes
first, and find that the first two best join opportunities are C5 and
then C4. Next, you have only the opportunity for the upward join to
unfiltered node A2, which you would do next even if the detail join
ratio were large. (The low join ratio to A2 turned out not to matter.)
So far, the join order is (B4, C5, C4, A2).
From the cloud around these nodes, find immediately adjacent
nodes B3 (downward) and M (upward). Since the detail join
ratio to M is 1.0, you need not prefer to join downward, if other
factors favor M. Neither node has a filter, so look at filters on nodes
adjacent to them to break the tie. The best filter ratio adjacent to M
is 0.3 (on A1), while the best adjacent to B3 is 0.5 (on
C2), favoring M, so join next to M and A1.
The join order at this point is (B4, C5, C4, A2, M, A1). Now that the
database has reached the root node, all joins are downward, so the usual rules
apply for the rest of the optimization, considering immediately adjacent nodes
first and considering nodes adjacent to those nodes to break ties. The complete
optimum join order is (B4, C5, C4, A2, M, A1, B3, C2, B1, C1, A3, B5, C6,
C3, D1, (D2, B2)).
|
Note that, even in this specially contrived case designed to
show an exception to the earlier rule, you find only modest improvement for
reaching A1 earlier than the simple heuristics would allow, since the
improvement is relatively late in the join order.
6.6.3 Join Ratios Less than 1.0
If either the detail join ratio or the master
join ratio is less than 1.0, you have, in effect, a join that is, on average,
[some number]-to-[less than 1.0]. Whether the less-than-1.0 side of that join is
capable of being to-many is immaterial to the
optimization problem, as long as you are confident that the current average is
not likely to change much on other database instances or over time. If a
downward join with a normal master join ratio of 1.0 is preferred over a to-many
upward join, a join that follows a join ratio of less than 1.0 in any direction
is preferred even more. These join ratios that are less than 1.0 are, in a
sense, hidden filters that discard rows when you
perform the joins just as effectively as explicit single-node filters discard
rows, so they affect the optimal join order like filters.
6.6.3.1 Rules for join ratios less than 1.0
You need three new rules to account for the effect of
smaller-than-normal join ratios on choosing the optimum join order:
-
When choosing a driving node, all nodes on the filtered side of a join inherit the extra benefit of the hidden join filter. Specifically, if the join ratio less than 1.0 is J and the node filter ratio is R, use J x R when choosing the best node to drive the query. This has no effect when comparing nodes on the same side of a join filter, but it gives nodes on the filtered side of a join an advantage over nodes on the unfiltered side of the join.
-
When choosing the next node in a sequence, treat all joins with a join ratio J (a join ratio less than 1.0) like downward joins, and use J x R as the effective node filter ratio when comparing nodes, where R is the single-node filter ratio of the node reached through that filtering join.
-
However, for master join ratios less than 1.0, consider whether the hidden filter is better treated as an explicit foreign-key-is-not-null filter. Making the is-not-null filter explicit allows the detail table immediately above the filtering master join also to inherit the adjusted selectivity J x R for purposes of both choice of driving table and join order from above. See the following sections for more details on this rule.
6.6.3.2 Detail join ratios less than 1.0
The
meaning of the small join ratio turns out to be quite different depending on
whether it is the master join ratio or the detail join ratio that is less than
1.0. A detail join ratio less than 1.0 denotes the possibility of multiple
details, when it is more common to have no details of that particular type than
to have more than one. For example, you might have an Employees table
linked to a Loans table to track loans the company makes to a few top
managers as part of their compensation. The database design must allow for some
employees to have multiple loans, but far more employees have no loans from the
company at all, so the detail join ratio would be nearly 0. For referential
integrity, the Employee_ID column in Loans must point to a
real employee; that is its only purpose, and all loans in this table are to
employees. However, there is no necessity at all for an Employee_ID to
correspond to any loan. The Employee_ID column of Employees
exists (like any primary key) to point to its own row, not to point to rows in
another table, and there is no surprise when the join fails to find a match in
the upward direction, pointing from primary key to foreign key.
Since handling of detail join ratios less than 1.0 turns out to
be simpler, though less common, I illustrate that case first. I'll elaborate the
example of the previous paragraph to try to lend some plausibility to the new
rules. Beginning with a query that joins Employees to Loans,
add a join to Departments, with a filter that removes half the
departments. The result is shown in Figure 6-19, with each node labeled by the
initial of its table.
Figure 6-19. Simple example with a filtering join
|
Let there be 1,000 employees, 10 departments, and 10 loans to 8
of those employees. Let the only filter be the filter that discards half the
departments. The detail join ratio to Loans must be 0.01, since after
joining 1,000 employees to the Loans table, you would find only the 10
loans. The original rules would have you drive to the only filtered table,
Departments, reading 5 rows, joining to half the employees, another 500
rows, then joining to roughly half the loans (from the roughly 4 employees in
the chosen half of the departments). The database would then reach 5 loans,
while performing 496 unsuccessful index range scans on the Employee_ID
index for Loans using Employee_IDs of employees without
loans.
On the other hand, if the Loans table inherits the
benefit of the filtering join, you would choose to drive from Loans,
reading all 10 of them, then go to the 10 matching rows in Employees (8
different rows, with repeats to bring the total
to 10). Finally, join to Departments 10 times, when the database
finally discards (on average) half. Although the usual objective is to minimize
rows read from the top table, this example demonstrates that minimizing reads to
the table on the upper end of a strongly filtering detail join is nowhere near
as important as minimizing rows read in the much larger table below it.
How good would a filter on Employees have to be to
bring the rows read from Employees down to the number read in the
second plan? The filter would have to be exactly as good as the filtering join
ratio, 0.01. Imagine that it were even better, 0.005, and lead to just 5
employees (say, a filter on Last_Name). In that case, what table should
you join next? Again, the original rules would lead you to Departments,
both because it is downward and because it has the better filter ratio. However,
note that from 5 employees, the database will reach, on average, just 0.05
loans, so you are much better off joining to Loans before joining to
Departments.
|
6.6.3.3 Optimizing detail join ratios less than 1.0 with the rules
Figure 6-20
illustrates another example with a detail join ratio under 1.0. Try working out
the join order before you read on.
Figure 6-20. Example with a detail join ratio less than 1.0
First, examine the effect of the join ratio on the choice of
driving table. In Figure
6-21, I show the adjustments to filter ratios from the perspective of
choosing the driving table. After these adjustments, the effective filter of
0.003 on M is best, so drive from M. From this point, revert
to the original filter ratios to choose the rest of the join order, because,
when driving from any node (M, A2, or B2) on the
detail side of that join, this join ratio no longer applies. In a more complex
query, it might seem like a lot of work to calculate all these effective filter
values for many nodes on one side of a filtering join. In practice, you just
find the best filter ratio on each side (0.01 for A1 and 0.03 for
M, in this case) and make a single adjustment to the best filter ratio
on the filtered side of the join.
Figure 6-21. Effective filters for choosing the driving table
|
When choosing the rest of the join order, compare the original
filter ratios on A1 and A2, and choose A1 next.
Comparing B1 to A2, choose B1, and find the join
order so far: (M, A1, B1). The rest of the join order is fully
constrained by the join skeleton, for a complete join order of (M, A1, B1,
A2, B2).
Figure 6-22
leads to precisely the same result. It makes no difference that this time the
lowest initial filter ratio is not directly connected to the filtering join; all
nodes on the filtered side of the join get the benefit of the join filter when
choosing the driving table, and all nodes on the other side do not. Neither
A1 nor A2 offers filters that drive from M, so choose
A1 first for the better two-away filter on B1, and choose the
same join order as in the last example.
Figure 6-22. Another example with a detail join ratio less than 1.0
In Figure
6-23, M and B2 get the same benefit from the filtering
join, so simply compare unadjusted filter ratios and choose B2. From
there, the join order is fully constrained by the join skeleton: (B2, A2, M,
A1, B1).
Figure 6-23. An example comparing only filters on the same side of the filtering join
In Figure
6-24, again you compare only filtered nodes on the same side of the
filtering join, but do you see an effect on later join order?
Figure 6-24. Another example comparing only filters on the same side of the filtering join
The benefit of the filtering join follows only if you follow
the join in that direction. Since you drive from A2, join to M
with an ordinary one-to-many join from A2, which you should postpone as
long as possible. Therefore, join downward to B2 before joining upward
to M, even though M has the better filter ratio. The join
order is therefore (A2, B2, M, A1, B1).
Try working out the complete join order for Figure 6-25 before you read on.
Figure 6-25. One last example with a detail join ratio greater than 1.0
Here, note that the adjustment to filter ratios when choosing
the driving table is insufficient to favor the filtered side of the join; the
best filter on A2 is still favored. Where do you go from there, though?
Now, the filtering join does matter. This theoretically one-to-many join is
really usually one-to-zero, so, even though it is upward on the diagram, you
should favor it over an ordinary downward join with a normal join ratio
(unshown, by convention) of 1.0. For purposes of choosing the next table, the
effective filter on M is R / J (0.3 / 0.1=0.03), better than the filter on
B1, so join to M next. However, when you compare A2
and B1, compare simple, unadjusted filter ratios, because you have, in
a sense, already burned the benefit of the filtering join to M. The
full join order, then, is (A1, M, B1, A2, B2).
6.6.3.4 Master join ratios less than 1.0
-
The relationship to the master does not apply (or is unknown) in some cases, where the foreign key to that master is null.
-
The relationship to the master table is corrupt in some cases, where the nonnull foreign-key value fails to point to a legitimate master record. Since the only legitimate purpose of a foreign key is to point unambiguously to a matching master record, nonnull values of that key that fail to join to the master are failures of referential integrity. These referential-integrity failures happen in this imperfect world, but the ideal response is to fix them, either by deleting detail records that become obsolete when the application eliminates their master records, or by fixing the foreign key to point to a legitimate master or to be null. It is a mistake to change the SQL to work around a broken relationship that ought to be fixed soon, so you should generally ignore master join ratios that are less than 1.0 for such failed referential integrity.
The first case is common and legitimate for some tables. For
example, if the company in the earlier Loans-table example happened to
be a bank, they might want a single Loans table for all loans the bank
makes, not just those to employees, and in such a Loans table
Employee_ID would apply only rarely, nearly always being null. However,
in this legitimate case, the database need not perform the join to pick up this
valuable row-discarding hidden join filter. If the database has already reached
the Loans table, it makes much more sense to make the filter explicit,
with a condition Employee_ID IS NOT NULL in the query. This
way, the execution engine will discard the unjoinable rows as soon as it reaches
the Loans table. You can choose the next join to pick up another filter
early in the execution plan, without waiting for a join to
Employees.
|
In the following examples, assume that master join ratios less
than 1.0 come only from sometimes-null foreign keys, not from failed referential
integrity. Choose the driving table by following the rules of this section. If
the driving table reaches the optional master table from above, make the
is-not-null condition explicit in the query and migrate the selectivity of that
condition into the detail node filter ratio. Consider the SQL diagram in Figure 6-26.
Figure 6-26. A query with a filtering master join
First, does the master join ratio affect the choice of driving
table? Both sides of the join from A1 to B1 can see the
benefit of this hidden join filter, and A1 has the better filter ratio
to start with. Nodes attached below B1 would also benefit, but there
are no nodes downstream of B1. No other node has a competitive filter
ratio, so drive from A1 just as if there were no hidden filter. To see
the best benefit of driving from A1, make explicit the is-not-null
condition on A1's foreign key that points to B1, with an added
clause:
A1.ForeignKeyToB1 IS NOT NULL
This explicit addition to the SQL enables the database to
perform the first join to another table with just the fraction (0.01 /
0.1=0.001) of the rows in A1. If you had the column
ForeignKeyToB1 in the driving index filter, the database could even
avoid touching the unwanted rows in A1 at all. Where do you join next?
Since the database has already picked up the hidden filter (made no longer
hidden) from the now-explicit condition A1.ForeignKeyToB1 IS NOT NULL,
you have burned that extra filter, so compare B1 and B2 as if
there were no filtering join.
|
Comparing B1 and B2 for their simple filter
ratios, choose B2 first, and choose the order (A1, B2, B1, M, A2,
B3).
Now, consider the SQL diagram in Figure 6-27, and try to work it out yourself
before reading further.
Figure 6-27. Another query with a filtering master join
Again, the filtering join has no effect on the choice of
driving table, since the filter ratio on M is so much better than even
the adjusted filter ratios on A1 and B1. Where do you join
next? When you join from the unfiltered side of the filtering master join, make
the hidden filter explicit with an is-not-null condition on
ForeignKeyToB1. When you make this filter explicit, the join of the
remaining rows has an effective join ratio of just 1.0, like most master joins,
and the adjusted SQL diagram looks like Figure 6-28.
Figure 6-28. Adjusting the SQL diagram to make the master-join filter explicit
Now, it is clear that A1 offers a better filter than
A2, so join to A1 first. After reaching A1, the
database now has access to B1 or B2 as the next table to join.
Comparing these to A2, you again find a better choice than A2.
You join to B2 next, since you have already burned the benefit of the
filtering join to B1. The complete optimum order, then, is (M, A1,
B2, A2, B1, B3).
Next, consider the more complex problem represented by Figure 6-29, and try to solve
it yourself before you read on.
Figure 6-29. Yet another query with a filtering master join
Considering first the best effective driving filter, adjust the
filter ratio on A1 and the filter ratios below the filtering master
join and find C2 offers an effective filter of 0.1 / 0.02=0.002. You
could make the filter explicit on A1 as before with an is-not-null
condition on the foreign key that points to B2, but this does not add
sufficient selectivity to A1 to make it better than C1. The
other alternatives are unadjusted filter ratios on the other nodes, but the best
of these, 0.005 on M, is not as good as the effective driving filter on
C2, so choose C2 for the driving table. From there, the
filtering master join is no longer relevant, because the database will not join
in that direction, and you find the full join order to be (C1, B2, A1, B1,
M, A2, B3).
How would this change if the filter ratio on A1 were
0.01? By making the implied is-not-null condition on ForeignKeyToB2
explicit in the SQL, as before, you can make the multiple-condition filter
selectivity 0.1 / 0.01=0.001, better than the effective filter ratio on
C1, making A1 the best choice for driving table. With the join
filter burned, you then choose the rest of the join order based on the simple
filter ratios, finding the best order: (A1, B2, C1, B1, M, A2, B3) .
6.6.4 Close Filter Ratios
Occasionally, you find filter ratios that
fall close enough in size to consider relaxing the heuristic rules to take
advantage of secondary considerations in the join order. This might sound like
it ought to be a common, important case, but it rarely matters, for several
reasons:
-
When tables are big enough to matter, the application requires a lot of filtering to return a useful-sized (not too big) set of rows for a real-world application, especially if the query serves an online transaction. (End users do not find long lists of rows online or in reports to be useful.) This implies that the product of all filters is a small number when the query includes at least one large table.
-
Useful queries rarely have many filters, usually just one to three.
-
With few filters (but with the product of all filters being a small number), at least one filter must be quite small and selective. It is far easier to achieve sufficient selectivity reliably with one selective filter, potentially combined with a small number of at-best moderately selective filters, than with a group of almost equally selective filters.
-
With one filter that is much more selective than the rest, usually much more selective than the rest put together, the choice of driving table is easy.
-
Occasionally, you find near-ties on moderately selective filter ratios in choices of tables to join later in the execution plan. However, the order of later joins usually matters relatively little, as long as you start with the right table and use a robust plan that follows the join skeleton.
-
My own experience tuning queries confirms that it is rarely necessary to examine these secondary considerations. I have had to do this less than once per year of intensive tuning in my own experience.
Nevertheless, if you have read this far, you might want to know
when to at least consider relaxing the simple rules, so here are some rule
refinements that have rare usefulness in the case of near-ties:
-
Prefer to drive to smaller tables earlier. After you choose the driving table, the true benefit/cost ratio of joining to the next master table is (1-R)/C, where R is the filter ratio and C is the cost per row of reading that table through the primary key. Smaller tables are better cached and tend to have fewer index levels, making C smaller and making the benefit/cost ratio tend to be better for smaller tables.
|
-
Prefer to drive to tables that allow you to reach other tables with even better filter ratios sooner in the execution plan. The general objective is to discard as many rows as early in the execution plan as possible. Good (low) filter ratios achieve this well at each step, but you might need to look ahead a bit to nearby filters to see the full benefit of reaching a node sooner.
-
To compare nodes when choosing a driving table, compare the absolute values of the filter ratios directly; 0.01 and 0.001 are not close, but a factor of 10 apart, not a near-tie at all. Near-ties for the driving filter almost never happen, except when the application queries small tables either with no filters at all or with only moderately selective filters. In cases of queries that touch only small tables, you rarely need to tune the query; automated optimization easily produces a fast plan.
|
-
To compare nodes when choosing later joins, compare 1-R, where R is each node's filter ratio. In this context, filter ratios of 0.1 and 0.0001 are close. Even though these R values are a factor of 1,000 apart, the values of 1-R differ only by 10%, and the benefit/cost ratio (see the first rule in this list) would actually favor the less selective filter if that node were much smaller.
|
Figure 6-30
shows the first example with a near-tie that could lead to a break with the
original simple heuristic rules. Try to work it out yourself before moving
on.
Figure 6-30. A case to consider exceptions to the simple rules
When choosing the driving node here, make no exception to the
rules; M has far and away the best filter ratio from the perspective of
choosing the driving table. The next choice is between A1 and
A2, and you would normally prefer the lower filter ratio on
A2. When you look at the detail join ratios from A1 to
M and from A2 to M, you find that A1 and
A2 are the same rowcount, so you have no reason on account of size to
prefer either. However, when you look below these nearly tied nodes, note that
A1 provides access to two nodes that look even better than A2.
You should try to get to these earlier in the plan, so you would benefit
moderately by choosing A1 first. After choosing A1 as the
second table in the join order, the choice of third table is clear; B2
is both much smaller and better-filtered than A2.
|
The join order, so far, is (M, A1, B2). The choice of
the next table is less obvious. C1 has a slightly worse filter ratio
than A2 but is much smaller, by a factor of 300 x 2,000, so its cost
per row is surely low enough to justify putting it ahead of A2 as well.
The join order, so far, is now (M, A1, B2, C1), and the next eligible
nodes are B1 and A2. The values for these nodes (1-R) are 0.5 and 0.7, respectively, and B1 is
half the size of A2, making its expected cost per row at least a bit
lower. If B1 were right on the edge of needing a new level in its
primary-key index, A2 would likely have that extra index level and
B1 might be the better choice next. However, since each level of an
index increases the capacity by about a factor of 300, it is unlikely that the
index is so close to the edge that this factor of 2 makes the difference.
Otherwise, it is unlikely that the moderate size difference matters enough to
override the simple rule based on filter ratio. Even if B1 is better
than A2, by this point in the execution plan it will not make enough
difference to matter; all four tables joined so far have efficiently filtered
rows to this point. Now, the cost of these last table joins will be tiny,
regardless of the choice, compared to the costs of the earlier joins. Therefore,
choose the full join order (M, A1, B2, C1, A2, B1, B3).
Now, consider the problem in Figure 6-31, and try your hand again before
you read on.
Figure 6-31. Another case to consider exceptions to the simple rules
This looks almost the same as the earlier case, but the filter
on M is not so good, and the filters on the A1 branch are
better, in total. The filter on M is twice as good as the filter on the
next-best table, C1, but C1 has other benefits — much smaller
size, by a factor of 2,000 x 300 x 10, or 6,000,000, and it has proximity to
other filters offering a net additional filter of 0.2 x 0.4 x 0.5=0.04. When you
combine all the filters on the A1 branch, you find a net filter ratio
of 0.04 x 0.1=0.004, more than a factor of 10 better than the filter ratio on
M.
The whole point of choosing a driving table, usually choosing
the one with the lowest filter ratio, is particularly to avoid reading more than
a tiny fraction of the biggest table (usually the root table), since costs of
reading rows on the biggest table tend to dominate query costs. However, here
you find the database will reach just 8% as many rows in M if it begins
on the A1 branch, rather than driving directly to the filter on
M, so C1 is the better driving table by a comfortable margin.
From there, just follow the normal rules, and find the full join order (C1,
B2, A1, B1, M, A2, B3).
All these exceptions to the rules sound somewhat fuzzy and
difficult, I know, but don't let that put you off or discourage you. I add the
exceptions to be complete and to handle some rare special cases, but you will
see these only once in a blue moon. You will almost always do just fine, far
better than most, if you just apply the simple rules at the beginning of this
chapter. In rare cases, you might find that the result is not quite as good as
you would like, and then you can consider, if the stakes are really high,
whether any of these exceptions might apply.
6.6.5 Cases to Consider Hash Joins
When a well-optimized query returns a modest
number of rows, it is almost impossible for a hash join to yield a significant
improvement over nested loops. However, occasionally, a large query can see significant benefit from
hash joins, especially hash joins to small or well-filtered tables.
When you drive from the best-filtered table in a query, any
join upward, from master table to detail table, inherits the selectivity of the
driving filter and every other filter reached so far in the join order. For
example, consider Figure
6-32. Following the usual rules, you drive from A1 with a filter
ratio of 0.001 and reach two other filters of 0.3 and 0.5 on B1 and
B2, respectively, on two downward joins. The next detail table join, to
M, will normally reach that table, following the index on the foreign
key that points to A1, on a fraction of rows equal to 0.00015 (0.001 x
0.3 x 0.5). If the detail table is large enough to matter and the query does not
return an unreasonably large number of rows, this strategy almost guarantees
that nested loops that follow foreign keys into detail tables win. Other join
methods, such as a hash join that accesses the detail table independently,
through its own filter, will read a larger fraction of that same table, since
the best nested-loops alternative drives from the best filter ratio in the
query.
Figure 6-32. Illustration of a query that favors hash joins
|
On the other hand, when you join downward to a master table,
you might be joining to a much smaller table, and the cost of reaching the rows
through the primary key might be larger than the cost of reading the table
independently for a hash join. From the statistics in Figure 6-32, you would find 3,000 times as
many rows in A1 as in B1. Even discarding 999 out of 1,000
rows from A1, the database would join to each row in B1 an
average of three times. Assume that A1 has 3,000,000 rows and
B1 has 1,000 rows. After winnowing A1 down to 3,000 rows,
using the driving filter, the database would join to rows in the 1,000-row table
B1 3,000 times. If the database drove into B1 with
B1's own filter, it would need to read just 300 (0.3 x 1,000) rows, at
roughly 1/10th the cost. Since the query reaches over 20% of the rows
in B1, the database would find an even lower cost by simply performing
a full table scan of B1 and filtering the result before doing the hash
join. Therefore, while leaving the rest of the query cost unchanged, choosing a
hash join to B1 would eliminate almost all the cost of the B1
table reads, compared to the standard robust plan that reads B1 through
nested loops.
So,
especially when you see large detail join ratios, hash joins to master tables
can be fastest. How big is the improvement, though? In the example, the
fractional improvement for this one table was high, over 90%, but the absolute
improvement was quite modest, about 9,000 logical I/Os (6,000 in the
two-level-deep key index and 3,000 on the table), which would take about 150
milliseconds. You would find no physical I/O at all to a table and index of this
size. On the other hand, you would find the query reads about 2,250 (3,000 x 0.3
x 0.5 x 5) rows from the 15,000,000-row table M with 3,600 logical
I/Os.
|
These 3,600 logical I/Os, especially the 2,250 to the table
itself, will lead to hundreds of physical I/Os for such a large, hard-to-cache
table. At 5-10 milliseconds per physical I/O, the reads to M could take
seconds. This example is typical of cases in which hash joins perform best;
these cases tend to deliver improvements only in logical I/Os to the smallest
tables, and these improvements tend to be slight compared to the costs of the
rest of the query.
A couple of rough formulae might help you choose the best join
method:
- H=C x R
- L=C x D x F x N
The variables in these formulas are defined as follows:
- H
-
The number of logical I/Os necessary to read the master table independently, for purposes of a hash join.
- C
-
The rowcount of the master table.
- R
-
The master table's filter ratio. Assume the database reaches the table independently, through an index range scan on that filter.
- L
-
The number of logical I/Os to read the master table through nested loops on its primary key.
- D
-
The detail join ratio for the link that leads up from the master table, which normally tells you how much smaller it is than the detail table above it.
- F
-
The product of all filter ratios, including the driving filter ratio, reached before this join.
- N
-
The number of logical I/Os required per row read through the primary-key index.
Since primary keys up to 300 rows normally fit in the root
block, N=2 (1 for the index root block and 1 for
the table) for C up to 300. N=3 for C between
roughly 300 and 90,000. N=4 for C between 90,000 and 27,000,000. Since you will
normally drive from the best filter ratio, F<R, even if the plan
picks up no additional filters after the driving-table filter.
H is less than L, favoring the hash join for logical I/O costs, when
R<D x F x N. F<R when you drive
from the node with the best filter ratio. N is
small, as shown, since B-trees branch heavily at each level. Therefore, to favor
a hash join, either F is close in magnitude to
R, or D is large,
making this a join to a much smaller master table. This same calculation will
sometimes show a logical-I/O savings when joining to a large master table, but
use caution here. If the master table is too large to cache easily, physical I/O
comes into play and tends to disfavor the hash join in two ways:
-
Since table rows are much less well cached than index blocks, the hash-join cost advantage (if any) when physical I/O costs dominate mostly compares just the table-row counts, asking whether R<D x F. Since I/O to the index is so much better cached than I/O to the table, costs dominated by physical I/O lose the N factor for index-block reads. Without the N factor, hash joins are less favored.
-
If C x R is large, the hash join might have to write prehashed rows to disk and read them back, making the hash join much more expensive and potentially leading to out-of-disk errors for disk caching during query execution. This is the robustness risk of foregoing nested loops.
In all, it is difficult to find real queries that yield a
hash-join savings over the best robust plan that are worth much bother. Almost
the only cases of large savings for hash joins appear when the query starts with
a poor driving filter and hits large tables, which almost inevitably means the
whole query will usually return an unreasonably large rowset.
However, none of this implies that hash joins are a mistake;
cost-based optimizers look for small improvements as well as large, and they are
good at finding cases in which hash joins help a bit. While I almost never go
out of my way to force a hash join, I usually do not attempt to override the
optimizer's choice when it chooses one for me on a small table, as it often
does, as long as the join order and index choices are still good. If in doubt,
you can always experiment. First, find the best robust nested-loops plan.
|
Then, replace nested loops to a master table with an optimized
single-table access path to the same table and a hash join at the same point in
the join order. This change will not affect costs to the other tables,
decoupling the choice from the other optimization choices. Use whichever choice
is fastest, keeping in mind that you must rerun tests or run tests far apart to
avoid a caching bias against the first test.
The earlier example, in Section
6.3.1, of the problem diagrammed in Figure
6-2 illustrates well a minor performance improvement available through hash
joins. Notice that the detail join ratios above nodes OT and
ODT are in the millions, indicating that these are joins to a tiny set
of rows of the Code_Translations table. Since each set of eligible rows
for the given values of Code_Type is so small, you will require less
logical I/O to read the whole rowset once, probably through an index on the
Code_Type column, then perform a hash join, rather than go to each row
(more than once, probably) as you need it, though nested loops. Either
alternative is cheap, though, and will make up just a small fraction of the
overall query runtime, since the number of rows the query returns is modest and
the physical index and table blocks of Code_Translations will be fully
cached.
Subscribe to:
Posts (Atom)