wood burning stoves 2.0*
The moose likes Java in General and the fly likes Matching data between two collections Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Matching data between two collections" Watch "Matching data between two collections" New topic
Author

Matching data between two collections

Nick Kelly
Ranch Hand

Joined: Jan 28, 2005
Posts: 45
Hi - I'm looking for a clean way to perform data matching between two collections.

Let me try to explain the problem.

I have a system where I have certain data currently stored - call this currentData.

I make a call to an external system which has an updated version of this data in a different format - call this externalData.

I need to end up with a consolidated version of this data which will replace my current data - call this newData.

Previously we didn't care about what we had in our "currentData" - so the code looked like:



However, now we need to keep certain data from our current record that isn't returned from the external system.

So the updated code would look something like:



So far, easy enough (hopefully - if I've explained it properly!) - but here comes the twist.

The "matching" criteria we have is very "fuzzy".

For example if we have a date, a description and a price we may call it a match if all 3 of these are the same on a current record and an external record.

But if we haven't already found a match we may also call it a match if the description and the price are the same and the date is within 10 days.

Or we may also call it a match if two dates are the same even if the description and price don't match etc. etc.

I'm not sure of the best way to code this. For example, the following doesn't look nice:



Can any of you nice people suggest a better way to do this?
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38519
    
  23
You have a datum which matches a customer, and in the other repository a more recent datum matching the same customer, and you want to update the older repository? Is it something like that?

When I see customer and data, the word Map immediately comes into my head . . .

Is there any way you can take some sort of index out of the data to use as a "key"? Then you can use the remainder of the datum as a "value". Then, maybe, you can use the put or putAll methods of your destination Map to transfer the data. Have a look at the Map interface and the HashMap class and see whether you think that would work.
Nick Kelly
Ranch Hand

Joined: Jan 28, 2005
Posts: 45
Thanks for your reply Campbell

I see what you're saying and that would work if we had a unique "key" we could use.

For example, if we could say that if the date and description and price all matched then they're the same otherwise they're not.

The problem is, we need to be able to find the best match possible on more vague criteria.

For example, if we can't find a match on all three pieces of data then we could match on two.

Or if the date is within a week we may consider that a "match".

I don't think maps can help us in this situation as we don't have a "key" that we can use here.

I think I'll need to start with something like the following



Then I'm not sure where I go after this - maybe something like:



Any thoughts?
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38519
    
  23
You could have two items with the same description and same date, but different price. Then you are making the assumption that these are in fact a partial match and probably the same item. How can you be sure they are the same item?

How about putting them into sorted sets, using comparators for date, date and price or date price description? Would that help, or not?
How about putting the lot into a List, sorting with a price comparator, then description comparator, then by date? That reverse order sorting will give you a List ordered by date, then description within date, then price within description (at least I think it would). Then you can create two Lists and iterate through them looking for matches. You may have to go backwards and forwards to get matches.

Anybody else got any ideas, please? I feel I am scraping the bottom of the barrel for ideas, and often other people can see another solution.
Nick Kelly
Ranch Hand

Joined: Jan 28, 2005
Posts: 45
Thanks for your help Campbell.

Yes it's possible we would end up matching data that is not actually the same item.

Unfortunately that is a known risk but we don't have a unique key to match the items so it is something we have to live with.

What I'm really looking for is some sort of algorithm to get the best fit match between two sets of data.

I was hoping there was something standard out there but I can't find anything.

I'm not sure how sorting would work as there are a number of different criteria that we are matching on.

For example, if we prioritise sort on date then this may miss a match on description and price that has a slightly more different date than another record.

Maybe something like the following would work (basically a cleaner version of what I had in the original post).

Apurv Adarsh
Greenhorn

Joined: Feb 07, 2009
Posts: 12
We had a simmilar problem once but we planned to write a data migration stored procedure for it. It was quite simple there.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38519
    
  23
Apurv Adarsh wrote: . . . data migration stored procedure for it. It was quite simple there.
And how did you sort it out?
Apurv Adarsh
Greenhorn

Joined: Feb 07, 2009
Posts: 12
One querry with exclusive or statements
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Matching data between two collections