aspose file tools*
The moose likes JDBC and the fly likes A Tricky Stored Procedure Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Databases » JDBC
Bookmark "A Tricky Stored Procedure" Watch "A Tricky Stored Procedure" New topic
Author

A Tricky Stored Procedure

Corey McGlone
Ranch Hand

Joined: Dec 20, 2001
Posts: 3271
This may be getting a bit off topic, but it's still primarily SQL so, hopefully, I won't get too bad of a thrashing for posting this here.

I'm trying to write a rather complicated stored procedure (in MS SQL Server 2000). It accesses a lot of tables and uses a number of temporary tables to store intermediate data. Let me try to explain my conundrum without letting too many details get in the way.

I have a table (temporary) in which I have gathered a large amount of data. I have a column for customer number, a column for transaction type, and a column for date of transaction. So, my table might look something like this:



Now, what I want to get from this table is this:

For each customer, for each transaction type, return the most recent transaction.

So, in the above table, I would want lines 1, 3, 4, 5, and 8 to be returned. The other rows do not need to be included in the result set. As you've probably already noticed, the table is already sorted: first by customer number, next by transaction type, and finally by transaction date. I'm really struggling with how I can get the data I need from this table.

If I was pulling this information to a Java application, I'd simply use a ResultSet and iterate through it, grabbing the data I needed and discarding the rest. Unfortuantely, this data is not going to a Java application and I'm not even the one writing the application.

Any advice you might give is greatly appreciated.

Thanks,
Corey


SCJP Tipline, etc.
Michael Matola
whippersnapper
Ranch Hand

Joined: Mar 25, 2001
Posts: 1751
    
    2
In plain SQL?



For each customer, for each transaction type

Hey, that's what tells us we want to join on both customer and transaction type in the subquery!

You can omit the "and t2.tx_date <= sysdate" if you know your data won't contain dates in the future.

(Hey, didn't we already do this?)

As you've probably already noticed, the table is already sorted:

Why should that matter?

Any sort you want to rely on should be specified in an "order by" section in your query (or have the calling application sort).
Jeanne Boyarsky
author & internet detective
Marshal

Joined: May 26, 2003
Posts: 30764
    
156

Michael,
Doesn't that query return any transactions that fall on the date for the latest transaction in the whole table? I think Corey wanted the latest for each customer.

For each customer, for each transaction type

Hey, that's what tells us we want to join on both customer and transaction type in the subquery!

In addition to telling us that we want to join, it tells us we need to group the results by customer and transaction type.

Corey,
This is definitely a good place for SQL or even stored procs!

My (untested) attempt at the query:


[Blog] [JavaRanch FAQ] [How To Ask Questions The Smart Way] [Book Promos]
Blogging on Certs: SCEA Part 1, Part 2 & 3, Core Spring 3, OCAJP, OCPJP beta, TOGAF part 1 and part 2
Michael Matola
whippersnapper
Ranch Hand

Joined: Mar 25, 2001
Posts: 1751
    
    2
Doesn't that query return any transactions that fall on the date for the latest transaction in the whole table?

No.

(I think you're overlooking the fact that this query has a correlated subquery.)

I think Corey wanted the latest for each customer.

As I understand his post, he wants the latest for each customer, for each transaction type.

Those are the results my query gives.

...group by t.customerId, t.tx_type...

Your query is invalid.

Problem is that you have a field in your select list (namely, tx_date, because of the t.*) that is neither (1) an aggregating function nor (2) specified explicitly in the group by section.

(Even if you would add transaction date to the group by, adding the (corrected) group by to the query would have zero effect on the results.)

In addition to telling us that we want to join, it tells us we need to group the results by customer and transaction type.

SQL's "group by" is not the only way to form subsets of data (groups) to operate on.

Another common way is the use of a correlated subquery. A correlated subquery is a subquery that "knows" about the main query it lives in. A correlated subquery joins back to (at least one of) the table(s) in the main select.

from coreyTable t2
where t2.customerId = t.customerId
and t2.tx_type = t.tx_type

Essentially forms groups of data to operate on, based on what fields we choose to join on in the subselect:



Now:

( select max( t2.tx_date )
from coreyTable t2
where t2.customerId = t.customerId
and t2.tx_type = t.tx_type
and t2.tx_date <= sysdate )

The function "max( t2.tx_date )" and restriction "and t2.tx_date <= sysdate" will operate on those groups and select the bolded dates:



And the correlated subquery effect as a whole will return only the rows per group that meet the specified conditions.
[ February 04, 2005: Message edited by: Michael Matola ]
Michael Matola
whippersnapper
Ranch Hand

Joined: Mar 25, 2001
Posts: 1751
    
    2
If anyone's still confused as to why the unbolded rows in each group get dropped from the results, I'd recommend stepping through the original data row by row, pretending you're the SQL engine processing my query, and deciding whether the given row makes it into the result set or is discarded.
Michael Matola
whippersnapper
Ranch Hand

Joined: Mar 25, 2001
Posts: 1751
    
    2
As an aside, if anyone's curious as to why I recommend the subquery approach in queries like this, instead of something like the following, which in this case produces similar results:

select max( t.tx_date ) , t.customerId , t.tx_type
from coreyTable t
group by t.customerId , t.tx_type

(1) If you take the non-subquery approach, you can't use a cutoff date such as tx_date <= sysdate. (I often work with data and queries where cutoff dates are important. Sometimes I have future dates in the data that need to be excluded.)

(2) If you take the non-subquery approach, if there are multiple max date rows per customer id and transaction type, they are suppressed by the implied distinct that you get when using an aggregating function. This may not be important to you. For me it often is.

(3) What about the "Other Data" fields in Corey's original data. In or out? Note that if you include additional fields in your select list in the non-subquery approach, it affects the grouping logic. Not so with the subquery approach.

(4) The non-subquery approach can get really ugly fast if you're joining all this to additional tables. Slightly less so with the subquery approach.
[ February 04, 2005: Message edited by: Michael Matola ]
Corey McGlone
Ranch Hand

Joined: Dec 20, 2001
Posts: 3271
Wow - thanks for all the help, everyone. Lots of really good information here. Just as a bit of background, I'm sitting here writing this stored procedure, which touches 11 tables to get back some data for a report. For people that do this stuff everyday, that's probably nothing to break a sweat at. For me...my head is swimming. *sigh*


(Hey, didn't we already do this?)


Kind of... My initial post was here. Notice the final statement from me? I had to clarify some business rules. Well, as luck would have it, the requirements have changed a bit on me...big surprise. That's why I'm here again. Now, instead of getting just the most recent from a table, I have to get multiple "most recents" from a single table, one for each grouping that I have.

The first query you gave me works like a charm, Michael. I really appreciate the help.

While working through this, I've run into another issue that I'm not quite sure how to solve. Let me try to explain...

There are multiple customers in the database and there are multiple service providers in the database. Each service provider provides service to multiple customers (and it is possible that more than one service providers serves a single customer, but I'm not sure that really matters).

Anyway, I start by building a temporary table of service providers (named #Providers) that we want to report on. That table includes some basic information, such as the primary key and the provider's name. So, that table might look like this:



Next, I want to get all of the customers that are served by these service providers. Because Customers and Service Providers have a many-to-many relationship, there's an intermediate table (Customers_Providers) that links the two. So, to get my list of customers, I do something like this:



My problem is that I need to get the additional information from the #Providers table, such as Provider name. I'd really like that information to go along with the Customer Number. Can I do something like this?



Will that work? Something about it seems wrong to me, but maybe it's just that I've been looking at this stuff so long that I don't know what I'm doing any more. :roll: At least it's almost the weekend.

Thanks again for the help, folks.
[ February 04, 2005: Message edited by: Corey McGlone ]
Corey McGlone
Ranch Hand

Joined: Dec 20, 2001
Posts: 3271
That seemed to have done the trick.

Well, after a lot of work, I think this stored procedure actually works. I still need to do some more error checking and what-not, but I think, at the very least, it will do what it's supposed to do.

Thanks, everyone, for the help.
Jeanne Boyarsky
author & internet detective
Marshal

Joined: May 26, 2003
Posts: 30764
    
156

Originally posted by Michael Matola:
(I think you're overlooking the fact that this query has a correlated subquery.)

Yes I was. Oops! Thanks for the explanation.
 
jQuery in Action, 2nd edition
 
subject: A Tricky Stored Procedure