Wednesday 14 September 2011

Best Practices for String Procedures and Tracing

Question:
What is the best way to write generic procedures to work on strings and/or lists of strings? For example, what is the best way to write a function to parse a delimited string into a collection, with a specific separator? I assume that function parameters should be declared as VARCHAR2(with no length) and "internal" strings should be declared as VARCHAR2(32767). What is the best way of representing a list of strings in the most generic way possible? Also, is there a good way to extend the code to support CLOBs (character large objects) as well, without duplicating the code (to cover every possible combination of parameter types)?
Answer:
Sounds as if you like to write generic, reusable code. So do I. It is intellectually stimulating and can save me lots of time, because I can avoid writing the same things over and over again.
On the other hand, it is also important to recognize that a generic program is usually more complex and harder to test than more-specific variations on an algorithm. Working out requirements for something that should be truly generic is never as easy as it seems at first glance. Furthermore, we often venture unknowingly into the dangerous terrain of overdesign. "Gee, why don't I make this program really flexible by adding five other parameters to modify the default behavior of the program?"
Before I answer your questions, I'd like to offer some high-level advice on your next generic utility creation adventure:
• Don't overdesign. Be very practical about what you need now and what you can clearly see you will need in the near future. Don't daydream about possible requirements as you design your program.
• Don't overcode. Make sure that you hew closely to the design you have settled on. Again, when you are writing generic code, you will find yourself thinking about more and more possibilities, even as you write the program.
• Cut your losses. You might find as you start to build your generic program that it is tougher than you expected. Listen to your intuition. If the little voice inside your head is saying, "Yes, I know I can do this . . . but is it really all that important?" you should think seriously about abandoning your generic approach and instead write a more specific program from scratch that meets your current requirements.
Having said all that, your question covers three areas:
1. How should you declare "internal" strings—local variables inside your generic functions? When declaring a local variable of type VARCHAR2, you must specify a maximum length. The question then becomes how best to do this. Should you declare it as VARCHAR2(32767), because that is the largest possible value? Will that use excessive amounts of memory, if only small string values are manipulated?
2. What is the best way to declare and work with lists of strings?
3. How can you best write programs to work with both VARCHAR2 variables and CLOBs?
Here are my answers, followed by examples and additional explanations:
1. The best way to declare your local variables is to define a subtype that is based on the biggest possible string size in a shared utility package. All of your string functions can then reference that subtype when declaring local variables. If you create a single package containing all your string functions, you can declare the subtype in that package. And you will not use up any more memory than needed, because Oracle Database allocates memory as needed for string variables declared with a length of 32,767 characters.
2. To work with a list of strings, you need to declare the list based on an already-defined collection type. You can use a type offered by Oracle Database, such as DBMS_SQL.VARCHAR2S or DBMS_SQL.VARCHAR2A. You can also define your own type and then reference that type.
3. You should use CLOBs only when you need to manipulate strings longer than 32,767 bytes or characters in length. Assuming, then, that you are running Oracle9i Database Release 2, you should generally be able to use the same code for many of your functions, because the newer versions of Oracle Database allow you to execute built-in string functions natively against CLOBs, as well as VARCHAR2 string variables.
Let's take a look at each of these answers in more detail.
Avoiding hard-coded VARCHAR2 constraints. A program I often use to demonstrate various best-practices techniques is a simple variant on SUBSTR, which I call BETWNSTR. The built-in SUBSTR function takes a string, the starting location, and the number of characters, and returns that portion of the string. You can get more information about SUBSTR from the Oracle Database SQL Reference.
SUBSTR is dandy, but I often have the starting and ending locations in a string and want the substring between those two positions. To use SUBSTR, I must calculate the number of characters between start and end. I find that I always forget the relevant formula (end - start + 1). So I put together an itty-bitty function to remember the formula, as follows:
FUNCTION betwnstr (
string_in IN VARCHAR2
, start_in IN PLS_INTEGER
, end_in IN PLS_INTEGER
)
RETURN VARCHAR2
IS
l_return VARCHAR2(32767);
BEGIN
l_return := SUBSTR ( string_in, start_in, end_in - start_in + 1 );
RETURN l_return;
END betwnstr;
Here are two notes to make about this implementation:
• I have declared a local variable, but I don't really need it here. It is present to demonstrate the best practice of how to best declare such variables.
• This is a simplified implementation of BETWNSTR to demonstrate best practices. It does not handle all the cases a real "between string" utility should.
Code Listing 1: string_pkg with BETWNSTR
CREATE OR REPLACE PACKAGE string_pkg
IS
SUBTYPE maxvarchar2_t IS VARCHAR2 ( 32767 );

FUNCTION betwnstr (
string_in IN VARCHAR2
, start_in IN PLS_INTEGER
, end_in IN PLS_INTEGER
)
RETURN VARCHAR2;
END string_pkg;
/

CREATE OR REPLACE PACKAGE BODY string_pkg
IS
FUNCTION betwnstr (
string_in IN VARCHAR2
, start_in IN PLS_INTEGER
, end_in IN PLS_INTEGER
)
RETURN VARCHAR2
IS
l_return maxvarchar2_t;
BEGIN
l_return := SUBSTR ( string_in
, start_in
, end_in - start_in + 1 );
RETURN l_return;
END betwnstr;
END string_pkg;
/
Now I am going to move the BETWNSTR function into a package and move the hard-coded local variable declaration, as shown in Listing 1. Notice that I have now declared the following subtype:
SUBTYPE maxvarchar2_t IS
VARCHAR2 ( 32767 );
I use that subtype in my declaration of the local variable.
It is true that I do still have a hard-coding of the number 32767 in my package, but it appears only once and serves as the single point of definition for the new, biggest VARCHAR2-allowed datatype.
If Oracle Database increases the maximum size for a VARCHAR2, I can make a change in one place and recompile string_pkg and all programs that use this package, and then my application will be able to immediately take advantage of the expanded datatype.
Work with lists of strings. As noted above, to work with a list of strings you need to declare the list based on an already-defined collection type. You can use a type offered by Oracle Database, such as DBMS_SQL.VARCHAR2S (maximum of 255 characters per string) or DBMS_SQL.VARCHAR2A (maximum of 32,767 characters per string, introduced in Oracle9i Database Release 2). You can also define your own type, again perhaps in your centralized string package, and then reference that type.
Suppose I want to add a function to my string package that accepts a list of elements separated by a delimiter (a comma-delimited list, or a string such as that returned by DBMS_UTILITY.FORMAT_CALL_STACK, delimited by new-line characters) and returns a collection whose elements are those strings between delimiters. I can take one of two approaches:
1. Use a collection type defined by Oracle Database and available automatically to all programmers, such as those found in the DBMS_SQL package. Here is an example of the function header for this approach:
CREATE OR REPLACE PACKAGE
string_pkg
IS
SUBTYPE maxvarchar2_t
IS VARCHAR2 ( 32767 );

FUNCTION betwnstr (
string_in IN VARCHAR2
, start_in IN PLS_INTEGER
, end_in IN PLS_INTEGER
)
RETURN VARCHAR2;

FUNCTION list_to_collection (
string_in IN VARCHAR2
, delim_in IN VARCHAR2
DEFAULT ' , '
)
RETURN DBMS_SQL.varchar2a;
END string_pkg;
/
2. Alternatively, I define my own collection type in the package and then use that in the function interface:
CREATE OR REPLACE PACKAGE
string_pkg
IS
SUBTYPE maxvarchar2_t IS
VARCHAR2 ( 32767 );

TYPE maxvarchar2_aat IS
TABLE OF maxvarchar2_t
INDEX BY PLS_INTEGER;
FUNCTION betwnstr (
string_in IN VARCHAR2
, start_in IN PLS_INTEGER
, end_in IN PLS_INTEGER
)
RETURN VARCHAR2;

FUNCTION list_to_collection (
string_in IN VARCHAR2
, delim_in IN VARCHAR2
DEFAULT ' , '
)
RETURN maxvarchar2_aat;
END string_pkg;
/
The first option (using DBMS_SQL.VARCHAR2A) saves some typing and can be used with programs that are already using collections based on the DBMS_SQL type. The downside is that it introduces an external dependency in this package. Because the DBMS_SQL package is present in all versions of Oracle Database, it's hard to get too worried about this external dependency.
Generally, though, you are better off avoiding such dependencies or at least minimizing the number of dependencies in a program. The second option (a collection type defined directly within the package specification) allows you to define a completely self-contained utility.
Functions for VARCHAR2s and CLOBs. In Oracle9i Database Release 2 and above, you can in many circumstances treat VARCHAR2 and CLOB interchangeably. Visit the "PL/SQL Semantics for LOBs" section in the Oracle Database Application Developer's Guide—Large Objects, to get more details about how Oracle Database will implicitly convert between VARCHAR2 and CLOB, and how you can apply many of the standard VARCHAR2 built-in functions directly to CLOBs.
Because CLOBs can be much larger than VARCHAR2 strings, you can think of a VARCHAR2 variable as being (logically) a subtype of CLOB. So a function that works with CLOBs should also work with VARCHAR2s.
One might then be tempted to say, "Let's just use CLOBs as the standard datatype for string functions." I could then build a VARCHAR2 "skin" that simply invokes the CLOB-based algorithm for its implementation. Given the performance characteristics of CLOB, however, it is recommended that you use CLOB only when you're sure that VARCHAR2 (32767) is not big enough.
So from a performance standpoint, I will want to implement the CLOB variation of list_to_collection, one that accepts a CLOB and returns a collection of CLOBs, with a copy-and-paste operation, simply replacing the VARCHAR2 declarations with CLOBs wherever appropriate.
And until the performance characteristics of CLOBs improve, you will generally want to create separate programs for this datatype.
Avoiding the hard-coding of the 32,767 literal. Now, I have a problem with the above program: I have hard-coded the 32767 maximum length into my call to SUBSTR. I could get around this problem by adding a constant to the package specification like this:
CREATE OR REPLACE PACKAGE string_pkg
IS
c_max_varchar2_len CONSTANT PLS_INTEGER := 32767;
SUBTYPE maxvarchar2_t IS VARCHAR2 (32767);
I can then use this constant in my call to SUBSTR:
16 FOR clob_index IN 1 .. l_clobs.COUNT
17 LOOP
18 l_return ( clob_index ) :=
19 SUBSTR ( l_clobs ( clob_index )
20 , 1
21 , c_max_varchar2_len );
22 END LOOP;
This is better, but I still have multiple occurrences of the 32767 literal in my package specification.
If I am properly obsessed with avoiding repetition of such values in my code, I could take things a step further and take advantage of conditional compilation as follows:
ALTER SESSION SET plsql_ccflags = 'max_varchar2_length:32767'
/

CREATE OR REPLACE PACKAGE string_pkg
IS
SUBTYPE maxvarchar2_t IS VARCHAR2 ( $$max_varchar2_length );
and then my function implementation would look like this:
16 FOR clob_index IN 1 .. l_clobs.COUNT
17 LOOP
18 l_return ( clob_index ) :=
19 SUBSTR ( l_clobs ( clob_index )
20 , 1
21 , $$max_varchar2_length );
22 END LOOP;
For more information on conditional compilation, check out:
/technology/tech/pl_sql/pdf/Plsql_Conditional_Compilation.pdf
/technology/tech/pl_sql/pdf/Plsql_Conditional_Compilation_PPT.pdf
Conclusions. We should all pay attention to the opportunities to create reusable programs, especially handy string functions. In the process of creating these generic utilities, we should do everything we can to make the code easy to maintain (avoid hard-codings and redundant algorithms), while still ensuring that the code is efficient enough to be useful in a production environment.
Code Listing 2: string_fun Package with full BETWNSTR implementation
The following script was created by Bryn Llewellyn, PL/SQL Product Manager, to compare VARCHAR2 and CLOB performance.
Code Listing 3: Script to compare VARCHAR2 and CLOB performance
Using the CLOB implement as the foundation algorithm. Recognizing that CLOBs are slower than VARCHAR2s, I thought I would still show you how you could implement the CLOB version of "list to collection" as the foundation algorithm and then implement the VARCHAR2 version on top of that. It is a useful technique for avoiding duplication of application logic, one that you should be entirely comfortable applying to your own situations. Of course, avoiding code redundancy usually cannot trump performance bottlenecks.
Here we go...
I will add a CLOB variation of list_to_collection to my package, one that accepts a CLOB and returns a collection of CLOBs, each element of which contains a delimited item from the incoming CLOB.
My string package specification now needs another collection type and function header:
CREATE OR REPLACE PACKAGE string_pkg
IS
... previous VARCHAR2 elements ...

TYPE clob_aat IS TABLE OF CLOB
INDEX BY PLS_INTEGER;

FUNCTION cloblist_to_collection (
string_in IN CLOB
, delim_in IN VARCHAR2 DEFAULT ','
)
RETURN clob_aat;
END string_pkg;
/
Now, I could implement this function with a copy-and-paste operation, but I would really much rather have just one instance of the parsing algorithm used in the list_to_collection function. So here is what I will do:
1. Create cloblist_to_collection by copying list_to_collection.
2. Change all relevant declarations from VARCHAR2 to the CLOB types (individual variables and collection types).
3. Replace the algorithm inside list_to_collection with a call to cloblist_to_collection.
Let's take a look. Here is the header and declaration section of cloblist_to_collection:
FUNCTION cloblist_to_collection (
string_in IN CLOB
, delim_in IN VARCHAR2 DEFAULT ','
)
RETURN clob_aat
IS
l_loc PLS_INTEGER;
l_row PLS_INTEGER := 1;
l_startloc PLS_INTEGER := 1;
l_return clob_aat;
Notice that I have changed the string_in datatype to CLOB, and return the clob_aat collection type instead of maxvarchar2_aat. The body of this function remains unchanged — it looks exactly the same as it did when it implemented list_to_collection.
Now I must "reinvent" the list_to_collection function. Here is my implementation, with explanation below:
1 FUNCTION list_to_collection (
2 string_in IN VARCHAR2
3 , delim_in IN VARCHAR2 DEFAULT ','
4 )
5 RETURN maxvarchar2_aat
6 IS
7 l_clobs clob_aat;
8 l_return maxvarchar2_aat;
9 BEGIN
10 -- Parse the string as a CLOB.
11 l_clobs := cloblist_to_collection (
12 TO_CLOB ( string_in ), delim_in );
13
14 -- Copy the individual items to the string collection.
15 -- Use SUBSTR to avoid VALUE_ERROR exceptions.
16 FOR clob_index IN 1 .. l_clobs.COUNT
17 LOOP
18 l_return ( clob_index ) :=
19 SUBSTR ( l_clobs ( clob_index )
20 , 1
21 , 32767);
22 END LOOP;
23
24 RETURN l_return;
25* END list_to_collection;


Line(s) Significance
7-8 Declare a CLOB collection for use in the call to cloblist_to_collection, and another collection to be returned by the function.
11-12 Convert the VARCHAR2 string to CLOB and then pass it to the cloblist_to_collection to do the parsing.
16-22 Copy the contents of the CLOB collection to the "maximum VARCHAR2" collection. Use SUBSTR to take only the first 32,767 characters of each CLOB, to avoid raising VALUE_ERROR.
Note: instead of truncating and possibly losing string data with SUBSTR, you might want to adjust the algorithm to break up individual items that are longer than 32767 characters and pass them back as separate items in the "maximum VARCHAR2" collection

No comments:

Post a Comment