aspose file tools*
The moose likes Java in General and the fly likes Sorting records using multiple criteria, before writing to file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Soft Skills this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Sorting records using multiple criteria, before writing to file " Watch "Sorting records using multiple criteria, before writing to file " New topic
Author

Sorting records using multiple criteria, before writing to file

Kumar Raja
Ranch Hand

Joined: Mar 18, 2010
Posts: 530
    
    2

Hello All,

I'm writing a code which makes a bunch of calls to database from different tables, process them, and then write to the flat file. We are looking at atleast 100000 records to be written at any given day. The data/value objects that are collected are flattened to String. The record structure has 40+ columns with different data types (mostly doubles). Before writing to the files, the data need to be sorted based on certain business rules and column values. I can use a comparator to achieve this. At this point, I thought about two solutions. One comparing on the data objects itself but that is not a feasible solution, as the data objects do not contain all the data and some of the data is dynamically calculated and written along with each record . The other option is to sort on flattened String (because the records are of fixed length and the position of each column with in record is fixed). I went with the second option, and actuall did some testing with around 50000 records and the whole sorting operation took less than 3 - 4 secs. Given that this is a kind of batch operation, there is some relaxation on the total execution time with in SLA. But I certainly don't want to see Heap memory exception.

Now, the logic which I'm employing is given below. Though this is working fine and as required, I appreciate, if some one could suggest me a better approach or tweaks.

We write a columnA in the first 15 positions of record, columnB from 18 - 26 position, columnC in 27 and we need to be able to sort them in ascending order first on columnA, then on columnB, then on columnC.

and I'm using Collections.soft(list,RecordComparator);

The only thing, I'm worried here is Mergesort , which I think is used behind the scenes by Collections.soft operation and though each record is approximately 500 bytes (I did String.getBytes[].length), would it impact the jvm memory if we are sorting 100000 records. Otherwise, I need to write my own custom Sort algorithm.



Regards
KumarRaja

Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 40052
    
  28
Kumar Raja wrote:. . . The data/value objects that are collected are flattened to String.
Joshua Bloch, in Effective Java™ says that’s a very bad idea. You should create classes which encapsulate all those data
. . . some of the data is dynamically calculated and written along with each record
Does that mean you are denormalising your database? If you have redundant data there, you can omit them from the data classes.
. The other option is to sort on flattened String
That does not sound easy.
. . . did some testing with around 50000 records and the whole sorting operation took less than 3 - 4 secs. . . . suggest me a better approach or tweaks.
That looks like premature optimisation. You should get your sorting logic correct and leave the optimisation to the JVM; it can do it far better than you can.
. . . We write a columnA in the first 15 positions of record, columnB from 18 - 26 position, columnC in 27 and we need to be able to sort them in ascending order first on columnA, then on columnB, then on columnC.
That looks error-prone and difficult to maintain.
. . . . Mergesort , which I think is used behind the scenes by Collections.soft
Have you looked in the API? It does in fact use a form of merge sort
. . . each record is approximately 500 bytes . . .
A substring 15 chars long occupies 30 bytes; a double occupies 8 bytes. I think your Strings will occupy more memory than doubles.
Also, what about the decimal points? How will these Strings sort:It only works if the decimal points align vertically.
It is ages since I tried any timings on sorting, but a merge sort on a 1000000 element int[] took about 0.4seconds, on an old computer which failed mechanically a few weeks ago (a Pentium IV!), so your computer will probably have much faster performance. The compareTo() method on Strings is probably very fast, but d1 > d2 is even faster. So you won’t lose in performance, even if you have to use > twice in a compare() method.

What happens if you try SORT BY in your SQL statement to retrieve those data? Most databases are heavily optimised for that sort of retrieval. You might even get your data sorted before delivery if you can fit enough SORT BYs in.
If you get the data partially sorted [edit]or unsorted[/edit], create data objects. Create several comparators, because you might need to sort the data in different ways. You can create Comparators which compare lots of data values, or few data values. If you compare few, you may have to sort several times which might take longer, but makes it much easier to alter the sorting if your business rules ever change.
And if you want column A in ascending order, then column B, you might have to sort in reverse order. To get this sort of output. . . you sort on date of birth, then first name, then last name. You can do that with merge sort, which is “stable” but not with quick sort.
Kumar Raja
Ranch Hand

Joined: Mar 18, 2010
Posts: 530
    
    2

Thanks Ritchie for each of the points below.

I just realized that, flattening to String is actually taking more bytes, than I initially thought. I did not take "A substring 15 chars long occupies 30 bytes; a double occupies 8 bytes. I think your Strings will occupy more memory than doubles. Also, what about the decimal points? How will these Strings sort:" into consideration.

The reason I went with working directly on String object is, since I get data from different sources, in different objects, I did not want to create another Value object which encapsulates only the data that is needed for me, (Though the data returns a lot). But I think it is a better option and easily maintainable, if I actually create my encapsulating object and work on that.



That looks like premature optimisation. You should get your sorting logic correct and leave the optimisation to the JVM; it can do it far better than you can.


This is also one of the reasons, I posted in this forum, to check if my algorithm used in my Comparator is correct (, Assuming that now we are working with a value object rather than a String).
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 40052
    
  28
I didn’t look at the Comparator code. I just saw substring and thought (eek!).
You can have several data objects and put them together into a Wrapper object.A bit tedious to write, but you can probably persuade an IDE to do 90% of the work for you. You can check in each of the Data classes that the values they encapsulate are in the correct ranges, eg weight > 0.0.
Jayesh A Lalwani
Bartender

Joined: Jan 17, 2008
Posts: 2448
    
  28

Couldn't you simplify your comparison logic to



Regarding Sorting in database, if you can manage to sort in database, you can "Stream" the data right into your file. You don't need to fetch all the data in memory, sort it and then write to the file. Instead you ask the database to sort it for you, retrieve the rows in batches, and write to the file as you retrieve. This will limit the memory consumption on your server.
Kumar Raja
Ranch Hand

Joined: Mar 18, 2010
Posts: 530
    
    2

Hi Jayesh,

Sorting the data in database is not an option, as our data is dependent on predefined views and not all sorting criteria available in the result set and some are calculated at run time. But as Ritchie mentioned, using a value object is a better option.

Yes, returning delta if !=0 is also a good option.


Kumar Raja
Ranch Hand

Joined: Mar 18, 2010
Posts: 530
    
    2

Campbell Ritchie wrote:I didn’t look at the Comparator code. I just saw substring and thought (eek!).
You can have several data objects and put them together into a Wrapper object.[code=java]/**
* blahblabhblah ... immutable class (if you are careful about the other fields) ...
*/
A bit tedious to write, but you can probably persuade an IDE to do 90% of the work for you. You can check in each of the Data classes that the values they encapsulate are in the correct ranges, eg weight > 0.0.


Thanks Ritchie. I will follow this approach.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Sorting records using multiple criteria, before writing to file