File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Performance and the fly likes How to handle Large Lists Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login


JavaRanch » Java Forums » Java » Performance
Reply Bookmark "How to handle Large Lists" Watch "How to handle Large Lists" New topic
Author

How to handle Large Lists

Paul Jacob
Greenhorn

Joined: Jul 25, 2009
Posts: 15

Hi,
I have an application where I get a very large list (approx range would be between 10,000 to 500, 000) of objects. I have to delete around 10-100 objects from this list around 10 times and also randomly access objects by index. So my basic question is should i decorate this list with any particular kind of list (I don't know the type of the list that is returned from the db since I am using spring jdbc) from java collections or apache collections?

help is much appreciated.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 11691
How do you locate items to be removed?

What is the relation between deleting objects and randomly accessing objects by index? Is there a sequence of events or are they mixed.

Bill


Java Resources at www.wbrogden.com
Paul Jacob
Greenhorn

Joined: Jul 25, 2009
Posts: 15

Here is the sequence :
1) Retrieve large list from the db
2) create a new list which just has indexes of the above list (1...n)
3) shuffle this list
4) generate a random number between 1 and n and use this number to index into the shuffled list and get a value
5) this returned value is used to index into the large list ... an object is retrieved and and based on the value of this object, some similar objects are deleted from the large list
6) repeat from 2 to 5 (around 10 times).

As can be seen from the above, there are two lists a large object list from which items are deleted and an equally large list which just has integers but needs to be shuffled once.

The first list is obtained from our db layer where we use spring. The second list is ours..which we create and as of now is an arraylist.

Can you suggest any kind of optimization here?
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 11691
Perhaps I am missing something here, but why a second list, why not an array.

Since you know the size you will need you don't require the auto expansion of a list.

There is also no need that I can see for two randomizations - shuffle and random N.

based on the value of this object, some similar objects are deleted from the large list


How are these similar objects located?

Bill
Paul Jacob
Greenhorn

Joined: Jul 25, 2009
Posts: 15
This is kind of like a lottery application where the original large list is a list of lottery tickets and some random tickets need to be chosen from it.
So the idea of deletion is that when a lottery is chosen, I have to delete all tickets owned by the user whose ticket was chosen so that the next draw will not have his tickets. I am using the apache collection utils filter predicate for deletion.
I don't want to use an array for the shuffle because of the ready made shuffle method available in the collections class. I am ensuring that the list has the capacity while creating the list itself.
so my question remains....is arraylist a good enough list.. for the shuffle?

This message was edited 1 time. Last update was at by Morgan Stanley

William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 11691
Near as I can tell ArrayList is the simplest / least overhead implementation of List that you can use.

If it is your program that is doing the "draw" - why cant it just ignore hits that correspond to an existing winner and do another "draw"?

The overhead of iterating through the list and removing items must be huge.

Bill
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 11740

William Brogden wrote:Near as I can tell ArrayList is the simplest / least overhead implementation of List that you can use.

If it is your program that is doing the "draw" - why cant it just ignore hits that correspond to an existing winner and do another "draw"?

The overhead of iterating through the list and removing items must be huge.

Bill


If the application needs to access list elements randomly, a Map is MUCH faster. If you need both fixed-sequence AND random access, I believe the TreeMap will provide that in a more efficient form.


A lot the of modern-day software development platforms are designed to permit parcelling out work to those with the best aptitude for it. A lot of modern-day business is predicated on making one person do all the work, regardless of aptitude.
Ireneusz Kordal
Ranch Hand

Joined: Jun 21, 2008
Posts: 413
Morgan Stanley wrote: This is kind of like a lottery application where the original large list is a list of lottery tickets and some random tickets need to be chosen from it.
So the idea of deletion is that when a lottery is chosen, I have to delete all tickets owned by the user whose ticket was chosen so that the next draw will not have his tickets.

What is a final result of your algorithm ?
If - on the end - you need only n tickets chosen from 500.000, then it would be much faster to do the whole work on the database server using SQL/stored procedures,
and retrieve only this small final result.
Data transmission (through the network - I assume) is much more slower than copying it from the disk to the memory on the db server.

This message was edited 3 times. Last update was at by Ireneusz Kordal

Paul Jacob
Greenhorn

Joined: Jul 25, 2009
Posts: 15

Thanks to all for the replies.
The list has to be shuffled and then a random index selected to give a value. Think of this as a pack of cards which we randomly mix and then we pick out a card from it. I don't need the sorting capabilities of a treemap neither can i get the same level of randomness in a stored proc. So I guess the arraylist is the best option ( as william suggested).
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 12871

Morgan Stanley wrote:The list has to be shuffled and then a random index selected to give a value. Think of this as a pack of cards which we randomly mix and then we pick out a card from it.


But note that once you've shuffled the list, you have already applied enough randomization. Just taking the first element of the shuffled list (the top card from the shuffled pack) will give you a random element. Choosing some other element doesn't gain you anything.

Or to put it the other way around, if you wanted to choose a random element instead of the first element, then you wouldn't have to shuffle the list in the first place.
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 3433
I would take a completely different approach, and I sure would not have the whole list in memory. That doesn't scale.

Rather I'd use the Google Collections list functions, specifically the Predicate and perhaps the Function.transform.

Write a Predicate that does the random selection in one pass.

The key to performance using Google Collections is to work on Iterables, not List or Set collections. That way, you don't have to have the whole list in memory at once. Plus, its a ton easier to split this into parallel tasks for parallel cores.
 
 
subject: How to handle Large Lists
 
MyEclipse, The Clear Choice