This week's book giveaway is in the OCAJP 8 forum. We're giving away four copies of OCA Java SE 8 Programmer I Study Guide and have Edward Finegan & Robert Liguori on-line! See this thread for details.
This may be getting a bit off topic, but it's still primarily SQL so, hopefully, I won't get too bad of a thrashing for posting this here.
I'm trying to write a rather complicated stored procedure (in MS SQL Server 2000). It accesses a lot of tables and uses a number of temporary tables to store intermediate data. Let me try to explain my conundrum without letting too many details get in the way.
I have a table (temporary) in which I have gathered a large amount of data. I have a column for customer number, a column for transaction type, and a column for date of transaction. So, my table might look something like this:
Now, what I want to get from this table is this:
For each customer, for each transaction type, return the most recent transaction.
So, in the above table, I would want lines 1, 3, 4, 5, and 8 to be returned. The other rows do not need to be included in the result set. As you've probably already noticed, the table is already sorted: first by customer number, next by transaction type, and finally by transaction date. I'm really struggling with how I can get the data I need from this table.
If I was pulling this information to a Java application, I'd simply use a ResultSet and iterate through it, grabbing the data I needed and discarding the rest. Unfortuantely, this data is not going to a Java application and I'm not even the one writing the application.
Doesn't that query return any transactions that fall on the date for the latest transaction in the whole table?
(I think you're overlooking the fact that this query has a correlated subquery.)
I think Corey wanted the latest for each customer.
As I understand his post, he wants the latest for each customer, for each transaction type.
Those are the results my query gives.
...group by t.customerId, t.tx_type...
Your query is invalid.
Problem is that you have a field in your select list (namely, tx_date, because of the t.*) that is neither (1) an aggregating function nor (2) specified explicitly in the group by section.
(Even if you would add transaction date to the group by, adding the (corrected) group by to the query would have zero effect on the results.)
In addition to telling us that we want to join, it tells us we need to group the results by customer and transaction type.
SQL's "group by" is not the only way to form subsets of data (groups) to operate on.
Another common way is the use of a correlated subquery. A correlated subquery is a subquery that "knows" about the main query it lives in. A correlated subquery joins back to (at least one of) the table(s) in the main select.
from coreyTable t2 where t2.customerId = t.customerId and t2.tx_type = t.tx_type
Essentially forms groups of data to operate on, based on what fields we choose to join on in the subselect:
( select max( t2.tx_date ) from coreyTable t2 where t2.customerId = t.customerId and t2.tx_type = t.tx_type and t2.tx_date <= sysdate )
The function "max( t2.tx_date )" and restriction "and t2.tx_date <= sysdate" will operate on those groups and select the bolded dates:
And the correlated subquery effect as a whole will return only the rows per group that meet the specified conditions. [ February 04, 2005: Message edited by: Michael Matola ]
Joined: Mar 25, 2001
If anyone's still confused as to why the unbolded rows in each group get dropped from the results, I'd recommend stepping through the original data row by row, pretending you're the SQL engine processing my query, and deciding whether the given row makes it into the result set or is discarded.
Joined: Mar 25, 2001
As an aside, if anyone's curious as to why I recommend the subquery approach in queries like this, instead of something like the following, which in this case produces similar results:
select max( t.tx_date ) , t.customerId , t.tx_type from coreyTable t group by t.customerId , t.tx_type
(1) If you take the non-subquery approach, you can't use a cutoff date such as tx_date <= sysdate. (I often work with data and queries where cutoff dates are important. Sometimes I have future dates in the data that need to be excluded.)
(2) If you take the non-subquery approach, if there are multiple max date rows per customer id and transaction type, they are suppressed by the implied distinct that you get when using an aggregating function. This may not be important to you. For me it often is.
(3) What about the "Other Data" fields in Corey's original data. In or out? Note that if you include additional fields in your select list in the non-subquery approach, it affects the grouping logic. Not so with the subquery approach.
(4) The non-subquery approach can get really ugly fast if you're joining all this to additional tables. Slightly less so with the subquery approach. [ February 04, 2005: Message edited by: Michael Matola ]
Joined: Dec 20, 2001
Wow - thanks for all the help, everyone. Lots of really good information here. Just as a bit of background, I'm sitting here writing this stored procedure, which touches 11 tables to get back some data for a report. For people that do this stuff everyday, that's probably nothing to break a sweat at. For me...my head is swimming. *sigh*
(Hey, didn't we already do this?)
Kind of... My initial post was here. Notice the final statement from me? I had to clarify some business rules. Well, as luck would have it, the requirements have changed a bit on me...big surprise. That's why I'm here again. Now, instead of getting just the most recent from a table, I have to get multiple "most recents" from a single table, one for each grouping that I have.
The first query you gave me works like a charm, Michael. I really appreciate the help.
While working through this, I've run into another issue that I'm not quite sure how to solve. Let me try to explain...
There are multiple customers in the database and there are multiple service providers in the database. Each service provider provides service to multiple customers (and it is possible that more than one service providers serves a single customer, but I'm not sure that really matters).
Anyway, I start by building a temporary table of service providers (named #Providers) that we want to report on. That table includes some basic information, such as the primary key and the provider's name. So, that table might look like this:
Next, I want to get all of the customers that are served by these service providers. Because Customers and Service Providers have a many-to-many relationship, there's an intermediate table (Customers_Providers) that links the two. So, to get my list of customers, I do something like this:
My problem is that I need to get the additional information from the #Providers table, such as Provider name. I'd really like that information to go along with the Customer Number. Can I do something like this?
Will that work? Something about it seems wrong to me, but maybe it's just that I've been looking at this stuff so long that I don't know what I'm doing any more. :roll: At least it's almost the weekend.
Thanks again for the help, folks. [ February 04, 2005: Message edited by: Corey McGlone ]
Joined: Dec 20, 2001
That seemed to have done the trick.
Well, after a lot of work, I think this stored procedure actually works. I still need to do some more error checking and what-not, but I think, at the very least, it will do what it's supposed to do.
Thanks, everyone, for the help.
author & internet detective