permaculture playing cards*
The moose likes Beginning Java and the fly likes Flagging duplicate lines when comparing the input from two seperate text files Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Flagging duplicate lines when comparing the input from two seperate text files" Watch "Flagging duplicate lines when comparing the input from two seperate text files" New topic
Author

Flagging duplicate lines when comparing the input from two seperate text files

Mark Johnstone
Greenhorn

Joined: Sep 25, 2010
Posts: 17
Hi,

I am trying to write a program to compare the input for each line from two separate text files and flag the duplicates.

When I run the program, I get no output. Thanks for any help you can give in advance.

DB1_records.txt file contents
--------------------------------------------------------------------------------
12345678
01245673


DB2_records.txt file contents
--------------------------------------------------------------------------------
12345678
01245642


fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11139
    
  16

if i recall correctly, each time you call readline(), it, well...reads a line.

So, on line 13, you read the first line of file1
on line 15, you read the first line of file 2
in line 17, you read the SECOND line of file 1 and file 2, which are not equal.

Your program then exits.


There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Mark Johnstone wrote:Hi,

I am trying to write a program to compare the input for each line from two separate text files and flag the duplicates.


In addition to the coding error already pointed out, you need to be aware of a couple things:

1) Before you write a single line of code, you should make sure that you understand the requirements completely and can precisely describe them in English. (This doesn't apply in real-world on-the-job programming so much, but for beginners I would consider it the word of the gods.)

2) The task you describe is tricky to describe precisely.

3) As you've stated it here, you have a very vague description.

Now, maybe you just deliberately provided a quick summary here for brevity's sake, and you actually understand exactly what you need to do. On the other hand, you might not have considered the problem that carefully.

For instance, the simplest interpretation of your problem statement is to indicate where corresponding lines match.


In this case, you compare file1, line 1 to file2, line1; F1L2 to F2L2, and so on. The first lines match, the second lines don't, and the third lines do.

But what about this?


Here there are zero matches according to the simple corresponding-line comparison, but our human perceptions are great at detecting that 4 lines are identical, but just shifted down by one. If your requirements say you're supposed to detect those, it gets trickier.

Or, maybe you're just supposed to find any line in F1 that occurs anywhere at all in F2?

And what if a given line appears N times in F1 and M times in F2?

You don't need to address these comments here; I just want to make sure you're aware of the general issue of clear requirements and how it applies in this particular case.
Brian Burress
Ranch Hand

Joined: Jun 30, 2003
Posts: 122
I think Fred answers your immediate question.

Depending on what you are really trying to accomplish, you may need to look at how you are processing the files to account for missing records. In the file samples provided, the first record is a dup and should be "matched". How do you "resynch" position in the file as you process it?

The algorithm I think you were trying to use seems to be read file 1, read file 2; compare; then read both files again. Given the example below you'd flag "a" but not "c" or "d" as the comparisions would be line1 vs line 2; line 2 vs line 2; etc.

File 1:
a
b
c
d


File 2
a
c
d


You'll want to consider a file match solution which would require some ordering of records.
As another option depending on the sizes of the files, consider some java objects which the file could be stored in for comparisons (ex: load file 1 into a list of some sort; and then process file 2 and look up where the line exists in the list).
Brian Burress
Ranch Hand

Joined: Jun 30, 2003
Posts: 122
Jeff Verdegan wrote:
.
.
.
N
But what about this?

.
.
.


Your comments are much more detailed than mine. We need a "synchronized" option for posting replies ;)
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 37874
    
  22
Jeff Verdegan wrote: . . . This doesn't apply in real-world on-the-job programming so much, . . .
Oh yes, it does! (Yes, it’s still pantomime season).

Maybe in real-world situations the requirements are simpler and more intuitive.
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Campbell Ritchie wrote:
Jeff Verdegan wrote: . . . This doesn't apply in real-world on-the-job programming so much, . . .
Oh yes, it does!


I can't remember ever having complete requirements before coding started. Is a pure Waterfall method even used any more?

(Yes, it’s still pantomime season).


Erm?
Mark Johnstone
Greenhorn

Joined: Sep 25, 2010
Posts: 17
I am trying to accomplish Jeff's second approach.

Mark Johnstone
Greenhorn

Joined: Sep 25, 2010
Posts: 17
Can you help with the program? Though, your analysis is spot on.
Wendy Gibbons
Bartender

Joined: Oct 21, 2008
Posts: 1107

Jeff Verdegan wrote:
Campbell Ritchie wrote:
Jeff Verdegan wrote: . . . This doesn't apply in real-world on-the-job programming so much, . . .
Oh yes, it does!


I can't remember ever having complete requirements before coding started. Is a pure Waterfall method even used any more?

(Yes, it’s still pantomime season).


Erm?

the reponse is "oh no it doesn't" not erm, and it carries on until someone is kidnapped by the monster.

Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Mark Johnstone wrote:Can you help with the program? Though, your analysis is spot on.


You seem to have missed my very first point.
Mark Johnstone
Greenhorn

Joined: Sep 25, 2010
Posts: 17
If I change the looping structure that will complicate things further
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Wendy Gibbons wrote:
Jeff Verdegan wrote:
Campbell Ritchie wrote:
Jeff Verdegan wrote: . . . This doesn't apply in real-world on-the-job programming so much, . . .
Oh yes, it does!


I can't remember ever having complete requirements before coding started. Is a pure Waterfall method even used any more?

(Yes, it’s still pantomime season).


Erm?

the reponse is "oh no it doesn't" not erm, and it carries on until someone is kidnapped by the monster.



D'OH! Can't believe I missed that. "Pantomime" threw me off.

I mean, Oh no it doesn'!
Mark Johnstone
Greenhorn

Joined: Sep 25, 2010
Posts: 17
Please can we come back to the problem, I need to use this tool for a project I am working. Thanks for any help you can give
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Mark Johnstone wrote:Please can we come back to the problem, I need to use this tool for a project I am working. Thanks for any help you can give


You're still ignoring my very first point; You have not clearly and precisely specified your requirements. Additionally, you haven't made clear exactly what specific problem you're having now.
Mark Johnstone
Greenhorn

Joined: Sep 25, 2010
Posts: 17
My requirements are irrespective of the order that the numbers are listed, matches from both columns should be flagged

e.g.

List 1
1
5
6
4

List 2
1
2
4

The output would be 1 and 4
Joanne Neal
Rancher

Joined: Aug 05, 2005
Posts: 3410
    
  12
Have you fixed the problem that Fred pointed out in the very first reply. If so, and it's still not working, post your new code.


Joanne
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7484
    
  18

Mark Johnstone wrote:My requirements are irrespective of the order that the numbers are listed,

In which case you are not comparing input (cf. diff), you are comparing values.

My suggestion: Read each file into a Set and run Set1.retainAll(Set2). That will give you the values that are common to both.

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Mark Johnstone wrote:My requirements are irrespective of the order that the numbers are listed, matches from both columns should be flagged

e.g.

List 1
1
5
6
4

List 2
1
2
4

The output would be 1 and 4


You're still missing my comments about precise and complete requirements.

Here's another example: If a line appears M times in F1 and N times in F2, how many times to you want it to appear in the result? Possible values I can imagine are: 1, M, N, min(M, N), max(M, N).

Note that simple answering that question does not constitute precise, complete requirements. It's only one example of what's not covered by what you've said so far.
Mark Johnstone
Greenhorn

Joined: Sep 25, 2010
Posts: 17
You are overcomplicating the problem. A developer should be able to write a program based on this requirement.
Unless you have a program you would like to share with me which comes close to solving the problem at hand, please don't post a response in this thread
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Mark Johnstone wrote:You are overcomplicating the problem.


No, I'm not. I'm pointing out a glaring hole in the requirements you've stated. You may understand what needs to be done, and that's fine. You don't necessarily have to provide every detail here. I'm just making sure that 1) You do actually have all the bases covered, and 2) If you're asking for help with some specific part, that you provide enough info so that those who would help you know what is needed. As it stands, there are multiple valid interpretations of what you've described.

A developer should be able to write a program based on this requirement.


Yes, but he would be guessing at how to handle certain situations.

Unless you have a program you would like to share with me which comes close to solving the problem at hand


This site is NotACodeMill.

, please don't post a response in this thread


This is a public forum, and it is not for you to say who posts here or what sort of responses they provide. If you feel my comments have been inappropriate or abusive, by all means, report them to a moderator.
Mark Johnstone
Greenhorn

Joined: Sep 25, 2010
Posts: 17
Winston, I have tried your suggestion as per below but when I uncomment the line



I get an exception.

Do you know how I can modify the below program to get the result? I feel that I'm close, but not there yet..



Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Mark Johnstone wrote:Winston, I have tried your suggestion as per below but when I uncomment the line



I get an exception.


People will be better able to help you if you copy/paste the exact, complete error message. There are many possible exceptions and many possible causes for them, and looking at your code, nothing jumps out at me that should be causing any exception on that line.
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Also, this won't work, and my guess is that it's the cause of your problem:



You're skipping every other line, because you're calling readLine() twice for each pass through the loop. What's probably happening is that the while condition is not giving a null, but it's consuming the last line, then, inside the loop, since the while consumed the last line, readLine() is giving null, so you're calling s1.add(null). I would guess that TreeSet does not allow nulls, but because of this bug, you're attempting to add one, and getting an IllegalArgumentException or NullPointerException.

(As a side note, it seems from this code that we can infer that the answer to my earlier question about what to do when there are M occurrences of some line in F1 and N occurrences in F2, that the result should produce one occurrence of that line. This was not made clear earlier, as far as I could tell.)
Mark Johnstone
Greenhorn

Joined: Sep 25, 2010
Posts: 17
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Mark Johnstone wrote:


Yup. See my follow-up post.

Do you understand what I mean about calling readLine() twice for each pass through the loop?
Mark Johnstone
Greenhorn

Joined: Sep 25, 2010
Posts: 17
No sir
Joanne Neal
Rancher

Joined: Aug 05, 2005
Posts: 3410
    
  12

On line 1 you read a line from the file and check if it's null.
On line 3 you read another line from the file and add it to your Set.
You've read two lines from the file but only added one to the Set.

If you've got an even number of lines in the file, the only problem will be that only half the lines will be added to the Set.
If there is an odd number of lines in the file, the last line will be eventually be read on line 1 and checked against null. It won't be null, so you read the next line from the file on line 3 - there isn't one, so readLine returns null, you try to add it to the Set and the add method throw an NPE.
Mark Johnstone
Greenhorn

Joined: Sep 25, 2010
Posts: 17
I changed the code. Would this do the trick?



Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7484
    
  18

Mark Johnstone wrote:I changed the code. Would this do the trick?

Looks much better to me; just don't forget those brackets around the assignment portion.

Winston
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Mark Johnstone wrote:I changed the code. Would this do the trick?



Rather than ask here (or, in addition to asking here, if you want to ask just in case you missed something), you should print out s1 and s2 and make sure they match their respective files. Try it for both even and odd number of lines, since the initial problem was a "2-for-1" issue.

 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Flagging duplicate lines when comparing the input from two seperate text files
 
Similar Threads
I need help with seeking to abitiary positions within a text file.
no such element? eek!
Problem in reading empty line
comparing two files
declaring array of vectors