Hi everyone, any chance I could get some design tips?
I have one million Point objects where the x and y values are floats with 2 decimal places. These points exist somewhere between a top left corner of (0,0) and a bottom right corner(400,200).
When the user specifies a top left and bottom right Point(for a rectangle) (no slanted diaganol lines), I have to return all of my Point objects that are inside of the Rectangle specified by the user - in less than a second.
I tried making 2 TreeSets -- one ordered by X and one ordered by Y, where they both use the same Point objects, (I'm avoiding instantiating double the points). The 1st treeset is instantiated without problems, but when my application attempts instantiating the 2nd TreeSet, I get an out-of Heap memory exception. If that were to have worked correctly instead, I was planning on using the "subset" method in TreeSet to get all the Point objects in the user's selected X range, .. and then vice-versa on the "Y" treeset for the points matching the user's specified Y rectangle values,.. and then I could take the intersection of those two result lists to arrive at a list with the points inside the user's specified rectangle.
Tweaking the JVM memory is Not going to be an option. Neither is attaching jar file libraries for help... but can probably use their source code if that would help. It has to run on a regular $300 Walmart laptop w/2GB Ram and modern processor.
Is there a different design I should be using?
Thanks so much
Putting a million anything into memory is going to cause problems, both with memory and performance. My first reaction is to think of an on-disk structure rather than an in-memory structure. Think of using RandomAccessFile as an array. Then I thought, is it necessary to pre-populate the structure? If the points occur at regular intervals, it may make sense to create them on-the-fly given the coordinates.
Just wild guesses on my part.
I don't think your idea could work. Try to have a look how the retainAll() method is implemented in lists (in short, it's O(n^2) complexity).
Is there some upper limit of the size of the user-specified rectangle? Smaller rectangles will naturally be much faster processed than large ones.
I can think of two approaches:
1) Given the requested precision is two decimal digits, it seems to me that you can have 40,000 (400*100) individual points along one axis and 20,000 points (200*100) along the other axis (or may be 40,001*20,001, but it is just a detail). The coordinates could be expressed as ints, as there is exactly 800,000,000 distinct points in all - it is trivial to combine the two coordinates into a number in this range (0 to 799,999,999). You could then create a gigantic BitSet and just set bits corresponding to the existing points. The BitSet should take around 100 MB of memory in this case - good enough for your specification, but you definitely need to verify whether BitSet of this size can be created!
When answering the question, you'd have to loop through all possible points inside the specified rectangle (easy) and create the list and instances of Point to be returned on the fly. Maybe doable in under one second for small rectangles.
2) The other option is to store the points into an array (or arraylist) ordered by x, then by y, and simply loop over the array to find them. You could create an index along the longer axis which would for every possible point (40,000 of them) contain a starting index and ending index in the array, so that you could quickly eliminate large parts of the array (you're actually trying to achieve the same with the subsets). Unfortunatelly, values along the other axis cannot be efficiently indexed (you'd need 40,000 indexes, 20,000 elements each and your memory would certainly blow up). However, maybe you could employ binary search to find the boundaries along the y axis for every non-empty x coordinate.
I think that it just might be possible to iterate over million entries and test them in under second, depending on the hardware. However, the GC (or classloader) can kick in at any time and delay the execution of your code. Unless you go for real-time Java, there are no guarantees something will execute fast enough every time.
Also note that I might have missed some clever arithmetic trick to speed things up.
do you really need to create all those point objects ahead of time? And do you really need to return all of them?
Can you instead just return the coordinates needed then create the objects on the fly?
this may not be an option, but it is one I would explore.
There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Joined: Feb 18, 2009
Thanks for the tips.
I made a home-made spatial index by using a 2d collection (HashMap of HashMaps of Lists)-- where the first is ordered by X, and the second is ordered by Y, which then contains a list of points for that spatial index grid cell.
Instead of using every exact x/y value, I just used the lowest whole number multiple of 10. Example: When encountering Point(x,y) with a value of (19.9, 20.0) I would put this into my 2d hashmap spatial index grid location of (19,20).
My x hashmap has keys of ex: 10,20,30.... 400
MY Y hashmap has keys of ex: 10,20,30....200
Then when I receive the rectangle query that the user wants,.. I know (based on the rectangle top-left and bottom-right) - the exact hashmap index key values I need to look in for possible point matches. Once I get each list of possible results, - I can then filter them out based on their actual 2 float decimal places.
This ends up running lightning fast on a cheap computer,.. but not as fast as what I plan to change it to use - which is a "Quad Tree" data structure I see a lot of people using in GIS and video game programming. Sun/Oracle doesn't provide it,.. but there are a couple good examples around online.
Incidentally, I have thought of a similar solution on my way home on Monday
There might be a possibility for further speedup. Certain squares are completely inside the user specified rectangle (eg. if the user specifies (5,5) to (45,45), you know that squares (10,10) to (40,40), 9 in total, are completely in), so you do not have to loop over individual points in these squares and add them in bulk. You'll have to test points only in squares partially covered by the user's rectangle. In the mentioned example, you will skip testing 9 squares from a total of 25. You'll be able to skip more when the user specifies a larger rectangle, which is good. But then maybe you're already doing it and just didn't write it as the obvious thing.
If you want to speed it up a tiny bit, do not use hashmaps. Convert the coordinate into an integer (eg. (int)(x/10.0)) and use a two-dimensional array. Arrays are much faster than hashmaps, but on the other hand you do only a handful of lookups, so the speed up will be very small.
Another way is to focus on how your results are returned. Do you return a collection? You might be able to implement your own collection that would join up several independent lists. You would then put all the fully contained squares into the collection and create just one array for the points from partially covered squares, adding it to the collection later. You'll save the time to copy some of the points, but the speed up will again be very small. IF your users just iterate over the collection, you'll add a slight overhead to it, but it could still save some time overall (but measure it).
The last thing I can think of is converting your floats to ints (by multiplying them with 100) and using ints for comparisons instead of floats. However on modern computers there may be no difference in speed after all.
After all, if it is still to slow, try decreasing the size of the squares, you'll have less points to actually test for inclusion. However, measure the effect, as decreasing the size too much would certainly be counterproductive.
subject: Picking a Collection and Design for querying 1G Point objects