This week's book giveaway is in the Java 8 forum.
We're giving away four copies of Java 8 in Action and have Raoul-Gabriel Urma, Mario Fusco, and Alan Mycroft on-line!
See this thread for details.
The moose likes JDBC and the fly likes retrieve data from huge result set Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Java 8 in Action this week in the Java 8 forum!
JavaRanch » Java Forums » Databases » JDBC
Bookmark "retrieve data from huge result set" Watch "retrieve data from huge result set" New topic
Author

retrieve data from huge result set

Marilyn de Queiroz
Sheriff

Joined: Jul 22, 2000
Posts: 9043
    
  10
I have a result set with 207,000 rows. Each ticket may be associated with one or more rows. I need to count the number of rows for each ticket and find the row for each ticket which has the most recent timestamp.

I'm thinking that I should let the database do the sorting (use "order by") because it seems like that result set is gonna take a lot of memory, but I'm wondering if I might be able to do something in addition to speed up the process.
[ June 03, 2006: Message edited by: Marilyn de Queiroz ]

JavaBeginnersFaq
"Yesterday is history, tomorrow is a mystery, and today is a gift; that's why they call it the present." Eleanor Roosevelt
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

I guess you know database-indicees?


http://home.arcor.de/hirnstrom/bewerbung
Marilyn de Queiroz
Sheriff

Joined: Jul 22, 2000
Posts: 9043
    
  10
I know that database indices exist and that a runstat/reorg will speed up the sql call, but how will that help me find the last row for each ticket.

Oh, I just had a thought, what if I compare each row's ticket number to the next row's ticket number. If they're different, the current row must be the last row for that ticket.

Or is there a better way?
Jeanne Boyarsky
internet detective
Marshal

Joined: May 26, 2003
Posts: 29287
    
140

Marilyn,
If you use a desc (descending) index, the query for finding the last call will go much faster.

If your database supports a mechansim to only get the first (really the last since the sort is descending) row, the time will go down a lot for reading the data and transporting it across the network.

For one ticket, it would be something like:
select a.* from table a where exists (

For one ticket # it would be:
select *
from table
for numRows = 1 <-- syntax here varies greatly by database

For multiple tickets, you would need to put this in a subquery. I'm not sure how to do that part. Feel free to move this to JDBC to get more help


[Blog] [JavaRanch FAQ] [How To Ask Questions The Smart Way] [Book Promos]
Blogging on Certs: SCEA Part 1, Part 2 & 3, Core Spring 3, OCAJP, OCPJP beta, TOGAF part 1 and part 2
Jeanne Boyarsky
internet detective
Marshal

Joined: May 26, 2003
Posts: 29287
    
140

Originally posted by Marilyn de Queiroz:
Oh, I just had a thought, what if I compare each row's ticket number to the next row's ticket number. If they're different, the current row must be the last row for that ticket.

I'm not sure how that helps you. You still need the sort. And even if you realize you can stop going though records, you need to read the whole result set for the next ticket. I may be missing the point on this thought though.

I had another thought too. You may well get faster response time if you separate the two problems into two queries. You can efficiently get the # records of each ticket in one query. Trying to get the count and max time is going to be one monster query which will be hard for the database to optimize.
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

not doing much sql the last years, but in former times, I would have tried:

Isn't that correct?
Or:

after already modifying and correctin the sql - doesn't this work too:


[ June 03, 2006: Message edited by: Stefan Wagner ]
[ June 03, 2006: Message edited by: Stefan Wagner ]
Marilyn de Queiroz
Sheriff

Joined: Jul 22, 2000
Posts: 9043
    
  10
Originally posted by Jeanne Boyarsky:
I'm not sure how that helps you. You still need the sort. And even if you realize you can stop going though records, you need to read the whole result set for the next ticket. I may be missing the point on this thought though.


Yes I was thinking of doing that after the sort. However, the descending sort and getting the first line sounds even better.

I had another thought too. You may well get faster response time if you separate the two problems into two queries. You can efficiently get the # records of each ticket in one query. Trying to get the count and max time is going to be one monster query which will be hard for the database to optimize.

I think using 2 queries is not an option for me at this time.
Marilyn de Queiroz
Sheriff

Joined: Jul 22, 2000
Posts: 9043
    
  10
For the other part of the query (besides the count part), let me try to clarify a little bit.

I need to do something like:
for each PROBLEM_ID
select
the first 500 chars from most recent ENTRY of the current PROBLEM_ID where ENTRY_TYPE='ADD NOTE'
(each problem_id can have 0 to an infinity 'ADD NOTE' entries)

I'm using a DB2 database. DB2 OS/390 7.1.1
[ June 03, 2006: Message edited by: Marilyn de Queiroz ]
Jeanne Boyarsky
internet detective
Marshal

Joined: May 26, 2003
Posts: 29287
    
140

Stefan,
Your query is perfect for getting the maximum timestamp. Since Marilyn needs other columns in the row with the maximum timestamp, the query gets a lot more complicated.
Jeanne Boyarsky
internet detective
Marshal

Joined: May 26, 2003
Posts: 29287
    
140

Marilyn,
db2 is good about letting you nest queries which helps in general (not necessarily here.)

The syntax for getting the first row is "for fetch first 1 rows only". However, I believe this can only be used at the end of a query, not in a subquery. I'm not positive though and don't have access to db2 at the moment to try it out.

Another alternative is to use a stored procedure if this is allowed in your architecture.
sinasi susam
Ranch Hand

Joined: Jul 15, 2005
Posts: 67
" because it seems like that result set is gonna take a lot of memory, but I'm wondering if I might be able to do something in addition to speed up the process."

Why do you think so?

ResultSet doesn't bring all datas at once.It stores with fetch size amount of data.aAd with every next() it takes one from fetched datas.If it comes to the end of the fetch,then it goes to db and fetches next.

Good news is that you dont have to worry about the memory usage

Bad news is ResultSet doesnt know how many records the sql brought.Because of its nature i told you above.
Therefore you can not know how many rows brought.
[ June 05, 2006: Message edited by: sinasi susam ]
Marilyn de Queiroz
Sheriff

Joined: Jul 22, 2000
Posts: 9043
    
  10
Any way I can know what the fetch size is? When I run it in a tool, I can set it to only display the first 100 rows or whatever, but when I run it from my app, how does that work?

Jeanne, thank you. I'm going to try that today.
Michael Matola
whippersnapper
Ranch Hand

Joined: Mar 25, 2001
Posts: 1740
    
    2
but how will that help me find the last row for each ticket.

Marilyn,

I've fielded questions a couple times here about this kind of date logic in SQL. Take a look at

this

and this.

(I did a post once too giving several reasons why the correlated subquery approach is superior to a different approach, but I can't see to find it.)

Anyhow, if you can't get the date logic to work after reading those old posts, post your table structure (including keys) and some sample data and results. That'll make it easier to write the query.

[Edited to fix broken links. -MM]
[ June 05, 2006: Message edited by: Michael Matola ]
Michael Matola
whippersnapper
Ranch Hand

Joined: Mar 25, 2001
Posts: 1740
    
    2
And here's the post where I defend the correlated subquery approach.
Michael Matola
whippersnapper
Ranch Hand

Joined: Mar 25, 2001
Posts: 1740
    
    2
Trying to get the count and max time

(and retrieve the fields in the row with the max timestamp)

is going to be one monster query which will be hard for the database to optimize.

It's nicer to have this split into two queries (one for the count another for the max row), but it can be done in one.



As for performance, I'd say test it and see.
[ June 05, 2006: Message edited by: Michael Matola ]
Marilyn de Queiroz
Sheriff

Joined: Jul 22, 2000
Posts: 9043
    
  10
In table named problem_history, I have
problem_id, entry_date, entry_time, entry_type, entry
for input.

I need a table consisting of only three columns:
problem_id,
entry (if entry_type is 'ADD NOTE', get the most recent entry),
count of rows where entry_type is 'Call Attached'

I can't have more than one row per problem_id and I can't have more than these three columns in my final table.

"fetch first 1 rows only" doesn't seem to work because it doesn't do it on a "per problem_id" basis. I was already working on the subqueries, but I'm not getting very far very fast.
Marilyn de Queiroz
Sheriff

Joined: Jul 22, 2000
Posts: 9043
    
  10
Here is one problem_id



Each row is another problem_id.
[ June 05, 2006: Message edited by: Marilyn de Queiroz ]
Marilyn de Queiroz
Sheriff

Joined: Jul 22, 2000
Posts: 9043
    
  10
This query works (sort of)


I get


It looks like I got the correct number of "Call Attached" calls, but I'm still getting too many "ADD NOTE" entries, one row for each entry rather than just the most recent.
[ June 14, 2006: Message edited by: Marilyn de Queiroz ]
Michael Matola
whippersnapper
Ranch Hand

Joined: Mar 25, 2001
Posts: 1740
    
    2
OK, you've given a lot more to go on, but still not everything.

In table named problem_history, I have
problem_id, entry_date, entry_time, entry_type, entry
for input.


OK. (It's real important to know that you've have entry_date and entry_time as two separate fields and not a single date/time field. That affects the date logic.)

I need a table consisting of only three columns:
problem_id,
entry (if entry_type is 'ADD NOTE', get the most recent entry),
count of rows where entry_type is 'Call Attached'


OK, but a table? I thought you were just writing a query here? Are you working with temporary tables? Shuttling some data around?

You say "if entry_type is 'ADD NOTE', get the most recent entry". What to do if entry_type is not 'ADD NOTE'? This is really important to know.

I can't have more than one row per problem_id

OK.

Here is one problem_id

Very good.

These 14 rows need to be converted to one row with three columns which would look something like this:

Converted. So if you're writing some conversion process is it really necessary to do all this with a single SQL statement?

I'm not sure what you mean by "keys".

Which field(s) in the table have been set up to uniquely identify a row. It's my assumption based on the table structure and data that you list that the primary key is the combination of three fields: problem_id, entry_date, and entry_time.

...

You've said

entry (if entry_type is 'ADD NOTE', get the most recent entry)

but elsewhere you've said

the first 500 chars from most recent ENTRY of the current PROBLEM_ID where ENTRY_TYPE='ADD NOTE'

Do you mean for each problem_id, find all rows where entry_type = 'ADD NOTE', then find the most recent ENTRY from among that subset?

Or do you mean for each problem_id, find the most recent ENTRY, then do one thing if the ENTRY_TYPE = 'ADD NOTE' and do something else if the ENTRY_TYPE doesn't equal 'ADD NOTE'?

In other words, what to do if this 15th row of data is the absolute most recent:



...

Anyhow, here's something a lot closer given the new info. But I didn't know what to do about the entry_type = 'ADD NOTE', so for now I didn't do anything.




(If the date/time was a single field, we would only have to take the max on that single field, instead of what I've done here: first taking the max date at (A), then taking the max time at (B) for the already-determined max date at (C).)
Michael Matola
whippersnapper
Ranch Hand

Joined: Mar 25, 2001
Posts: 1740
    
    2
This query works (sort of)

...from pfxtsd.problems p

:shocked:

Where'd the table "problems" come from? What's the story on that?

but I'm still getting too many "ADD NOTE" entries, one row for each entry rather than just the most recent.

You haven't qualified on dates anywhere.
Marilyn de Queiroz
Sheriff

Joined: Jul 22, 2000
Posts: 9043
    
  10
Originally posted by Michael Matola:
I need a table consisting of only three columns:
problem_id,
entry (if entry_type is 'ADD NOTE', get the most recent entry),
count of rows where entry_type is 'Call Attached'


OK, but a table? I thought you were just writing a query here? Are you working with temporary tables? Shuttling some data around?

Ok, not a table, exactly. When I run the query and get the results in a ResultSet, I (actually the code I'm supposed to use) just turn around and write to a csv file (with no other manipulation), but that file looks just like a table with headers, rows and columns. So I'm thinking in terms of writing from one table to another. Not too much of a stretch. I could override that method if I need to and make it have two queries, but I'd rather not if I don't have to.

You say "if entry_type is 'ADD NOTE', get the most recent entry". What to do if entry_type is not 'ADD NOTE'? This is really important to know.

If entry_type is not 'ADD NOTE' and entry_type is not "Call Attached", we can ignore that record. If it is "Call Attached", we count it. If it is 'ADD NOTE', we care only if it is the most recent. If there is no 'ADD NOTE' entry_type for that problem_id, the result is null. Clear as mud?

These 14 rows need to be converted to one row with three columns which would look something like this:

Converted. So if you're writing some conversion process is it really necessary to do all this with a single SQL statement?

Sorry, bad choice of words. The code in the class I'm using only makes one sql query and writes the file based on that query. I'll think about how easy it would be to change that.

I'm not sure what you mean by "keys".

Which field(s) in the table have been set up to uniquely identify a row. It's my assumption based on the table structure and data that you list that the primary key is the combination of three fields: problem_id, entry_date, and entry_time.

I believe that's true although I have been unable to get my describe statement to work on this particular table (appears to be a permissions problem). There is also a column titled probhist_id which is assigned which could also be the primary key. It's a number which means nothing to me.

...

You've said

entry (if entry_type is 'ADD NOTE', get the most recent entry)

but elsewhere you've said

the first 500 chars from most recent ENTRY of the current PROBLEM_ID where ENTRY_TYPE='ADD NOTE'

Do you mean for each problem_id, find all rows where entry_type = 'ADD NOTE', then find the most recent ENTRY from among that subset?


yes.

[B][I]In other words, what to do if this 15th row of data is the absolute most recent:


[/I][/B]

Just add it to the total number of Call_Attached_Totals for this problem_id.

...

[B][I]Anyhow, here's something a lot closer given the new info. But I didn't know what to do about the entry_type = 'ADD NOTE', so for now I didn't do anything.




(If the date/time was a single field, we would only have to take the max on that single field, instead of what I've done here: first taking the max date at (A), then taking the max time at (B) for the already-determined max date at (C).) [/I][/B]

Thank you so much, Mike, I'll try this first thing in the morning.
[ June 06, 2006: Message edited by: Marilyn de Queiroz ]
Marilyn de Queiroz
Sheriff

Joined: Jul 22, 2000
Posts: 9043
    
  10
Originally posted by Michael Matola:
This query works (sort of)

...from pfxtsd.problems p



Where'd the table "problems" come from? What's the story on that?

I need the join on the problems table to get the "first_contact_id" and the "close_date" and the "open_date" because this table is shared by clients other than FTT and open_date and close_date are limiting dates for the problem_id's.



LFMDate = LastFullMonth
Michael Matola
whippersnapper
Ranch Hand

Joined: Mar 25, 2001
Posts: 1740
    
    2
I'm still confused by some of this, so I'll try to be more clear about what I'm asking.

Let's go ahead and assume my proposed 15th row of data *is* part of the data you're operating on.

MM: You say "if entry_type is 'ADD NOTE', get the most recent entry". What to do if entry_type is not 'ADD NOTE'? This is really important to know.

MdQ: If entry_type is not 'ADD NOTE' and entry_type is not "Call Attached", we can ignore that record. If it is "Call Attached", we count it. If it is 'ADD NOTE', we care only if it is the most recent. If there is no 'ADD NOTE' entry_type for that problem_id, the result is null. Clear as mud?

When you say the result is null what exactly do you have in mind? In databases you can have a column/field value (something stored on a table) that's null. Or you can introduce a null as a column/field value as a result of a query, which is fine. [Results (2) below.]

Or do you mean that if there's an 'ADD NOTE' entry_type for that problem_id, but it's not the most recent overall, no row at all for that problem_id should show up in the results. [Results (3) below.]

MM: Do you mean for each problem_id, find all rows where entry_type = 'ADD NOTE', then find the most recent ENTRY from among that subset?

MdQ: yes.

MM: In other words, what to do if this 15th row of data is the absolute most recent:

MdQ: Just add it to the total number of Call_Attached_Totals for this problem_id.

Again, assuming my proposed 15th row is part of the data, should the results in this case be this:

(1) 6 Call Attached rows for problem_id 12295318. Last_Add_Note is string 14 because it's the most recent row across the set of rows where entry_type is 'ADD NOTE'.



Or this:

(2) 6 Call Attached rows for problem_id 12295318. Last_Add_Note is blank/null/empty (I used <> as a placeholder; we can decide the exact value later) because the most recent (overall) row (15) does not happen to have an entry_type = 'ADD NOTE'. We still want problem_id 12295318 to show up in the results.



Or this:

(3) Problem_id doesn't show up in the results because it's most recent (overall) row (15) does not happen to have an entry_type = 'ADD NOTE'.



Or something else I haven't thought of.
...

Similarly, if there's data like this,



Do you want that problem_id in the results? If so, like this?



Or not in the results at all?

Or is there some guarantee in the data that for every problem_id there is at least one row that has an entry_type = 'ADD NOTE'?
Badri Sarma
Ranch Hand

Joined: Apr 01, 2003
Posts: 72
you can limit the size of featching the records from the table,
below is the link
http://java.sun.com/j2se/1.4.2/docs/guide/jdbc/getstart/resultset.html


Thanks<br />Badri
Marilyn de Queiroz
Sheriff

Joined: Jul 22, 2000
Posts: 9043
    
  10
[B]Again, assuming my proposed 15th row is part of the data, should the results in this case be this:

(1) 6 Call Attached rows for problem_id 12295318. Last_Add_Note is string 14 because it's the most recent row across the set of rows where entry_type is 'ADD NOTE'.

[/B]

I believe this is the one they want.

...

[B]Similarly, if there's data like this,



Do you want that problem_id in the results? If so, like this?

[/B]

That's my understanding

Or is there some guarantee in the data that for every problem_id there is at least one row that has an entry_type = 'ADD NOTE'?

Definitely no guarantee.
[ June 06, 2006: Message edited by: Marilyn de Queiroz ]
Marilyn de Queiroz
Sheriff

Joined: Jul 22, 2000
Posts: 9043
    
  10
Thank you, Badrinath. I had never noticed that ResultSet method before.
Michael Matola
whippersnapper
Ranch Hand

Joined: Mar 25, 2001
Posts: 1740
    
    2
I think we got the query part worked out in the other thread dealing with this.
Marilyn de Queiroz
Sheriff

Joined: Jul 22, 2000
Posts: 9043
    
  10
I couldn't get the
select a, b, ( select c ) stuff to work. This is what I finally ended up with.


[ June 14, 2006: Message edited by: Marilyn de Queiroz ]
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: retrieve data from huge result set
 
Similar Threads
Query Question
Table Rows
ArrayList too slow?
Displaying Dynamic Tables
limit result through JDBC