aspose file tools*
The moose likes Java in General and the fly likes Compare Unix filesystem listing lines between 2 systems Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Compare Unix filesystem listing lines between 2 systems" Watch "Compare Unix filesystem listing lines between 2 systems" New topic
Author

Compare Unix filesystem listing lines between 2 systems

Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235

Hi All,

I am trying to compare the root filesystem listings of two Solaris 10 x86 systems with the following code snippet but not sure whether it is the correct approach since it is took forever to complete and did not pickup any differences onto C:\\FolderMissingFiles.txt:



Below is a small sample of these files and I know C:\\ JupiterSolarisFilesystemListing.txt (181786 lines) is a subset of C:\\VenusSolarisFilesystemListing.txt (183333 lines):





I am not sure whether the string comparison (if (venus_line.compareTo(jupiter_line) != 0) is working. Hope I won’t have to compare individual subfolder names using venus_line.split(“/”) starting from root. E.g. (/var/sadm/install/admin/default - /var, sadm, pkg, SUNWocfd….).

This code snippet uses suggestion from http://www.coderanch.com/t/277350/Streams/java/Comparing-two-huge-files.

I am running Java 7 on Windows 7 (64bit) with 4.0GB RAM.

Your suggestion would be much appreciated.

Thanks,

Jack

Martin Vajsar
Sheriff

Joined: Aug 22, 2010
Posts: 3611
    
  60

I assume this is just an exercise. If it is not, you should use the File Compare (fc.exe) and perhaps the sorter (sort.exe) that comes with Windows.

One problem of your code is that you're re-reading the whole Jupiter file for every line read from Venus file. That's going to take ages, especially as you output every line you read to standard output.

The file listings you have should be sorted by name. So perhaps obtain the listings again and specify a sort-by-name order, or just re-sort the files using Windows sort tool.

Once the listings are sorted, open both files and read them in parallel. If the lines you read are exactly the same, read next line from both files. If they are not the same, the line which is "less than the other" (use the String.compareTo() method) is missing from the other file. Handle that and read a line only from the file from which the "less than the other" line came from. Repeat from start. You also need to handle the situation where end-of-file is reached for some file, but I'll leave that up to you.

You might also avoid sorting, read all lines from the files into sets and compare these sets.

And no, you should not need to parse and split directory names.
Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Hi Martin,

Thanks for helping out on this thread.

Martin Vajsar wrote:
I assume this is just an exercise. If it is not, you should use the File Compare (fc.exe) and perhaps the sorter (sort.exe) that comes with Windows.


I have used both diff (fc.exe) & sort (sort.exe) on both Solaris & Windows but have found them to be too difficult to compare due to the large amount of lines in both files. More importantly, sort appears to stripped out parent folder names as well as inserting special characters such as the following:



It simply destroyed the original content altogether. One of the key problem I suspect that sort and ”if (venus_line.compareTo(jupiter_line) != 0)” is encountering is the forward slash ( / ) which no longer treat the whole line as a single string. Please provide an example of how you would run the sort.exe command, perhaps you are sorting them by field using forward slash as separator?


One problem of your code is that you're re-reading the whole Jupiter file for every line read from Venus file. That's going to take ages, especially as you output every line you read to standard output.

Efficiency is not essential in this case but getting it to work properly is much more important. I have taken out all the debug print statements in the meantime.

You might also avoid sorting, read all lines from the files into sets and compare these sets.


Which sort statement are you referring to? Could you provide an example since I am not familiar with using sets and could not follow what you are suggesting?


In short, I am looking for a one off working solution. It appears that half of my challenge is getting the two files sorted first before comparing them according to your suggestion. If so, let’s get the correct sort.exe syntax and see whether the same code pickup any missing lines.

It is unfortunate that I could not attach VenusSolarisFilesystemListing due to many restriction including .txt, .zip, no extension.

Btw, are there any Java class what works the same way as AWK which supports the comparison of a single line / string against the another file?

Thanks again,

Jack
Martin Vajsar
Sheriff

Joined: Aug 22, 2010
Posts: 3611
    
  60

Hi Jack,

let's have a look at the Windows sort first. You can type sort /? on Windows command line to get help. Generally, you'll use

    sort input-file /O output-file

That command should not alter the input file in any way.

Sort should not do anything else than reorder existing lines. It certainly should not strip parent folders from paths. I'd suggest to use a text file viewer to try to find the lines you got as the sort output (eg. (copy).gz, or these ../../../../.. paths) in the original input lines. I'd be very surprised if these lines weren't there. In other words, verify the input files were created in accordance with your expectations.

Note: I didn't happen to think about unix-style line endings being a problem for Windows sort before, but a quick experiment on my PC indicates it handles them well (I'm on Windows 7 x64 too). You might try to verify that on your box, though.

Similarly, String.compareTo() method certainly does not treat forward slashes specially. It does not even recognize end-of-line characters in strings (they are not ignored, they take part in the comparison, but they do not mark an end of string while comparing or anything similar). I don't see any caveats here, apart from the fact that the way non-alphanumeric characters in Strings are compared might be counter-intuitive perhaps. If the Windows sort compared strings differently from Java, that would be a problem; it should not happen, unless the paths on your source systems contain some special or national characters.


How large are your input files and how any lines they contain? If they can safely fit into your available memory (you may have to use -Xmx parameter though), you could read the input files line-by-line and put the individual lines into sets (probably two HashSets). You could then use Set's methods to remove common items from both sets. It's not exactly straightforward, you might need to create a third set for that. Try to have a look at the methods java.util.Set interface offers, the solution would be quite straightforward.


Googling up java awk class brings up Jawk, which looks promising. (I don't have any experience with either of these.)
Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Hi Martin,

sort input-file /o out-file works the same as on both platforms but I ended up with the following commands on Solaris simply because it had a stripping of duplicate lines switch not available in Windows, I think:





Below are the updated coded snippet which only picked up the first line (copy).gz but not those last 7 devices lines only in VenusSolarisFilesystemListing1.txt:

[/code]public static void main(String[] args)
{
File venusRootFile = new File("C:\\VenusSolarisFilesystemListing1.txt ");
File jupiterFile = new File("C:\\JupiterSolarisFilesystemListing1.txt ");
File folderMissingFile = new File("C:\\FolderMissingFiles.txt");
BufferedReader venusInput = null;
BufferedReader jupiterInput = null;
BufferedWriter folderMissingOutput = null;

try
{
venusInput = new BufferedReader( new FileReader(venusRootFile) );
folderMissingOutput = new BufferedWriter( new FileWriter(folderMissingFile) );

String venus_line;
String jupiter_line;
boolean LINE_MATCHED = false;

while ((venus_line = venusInput.readLine()) != null)
{
jupiterInput = new BufferedReader( new FileReader(jupiterFile) );
LineNumberReader lineNoReader = new LineNumberReader(jupiterInput);
int currentLineNoRead = 0;
while ((jupiter_line = jupiterInput.readLine()) != null)
{
currentLineNoRead = lineNoReader.getLineNumber();
if (venus_line.compareTo(jupiter_line) == 0)
{
LINE_MATCHED = true;
break;
}
}
if (!LINE_MATCHED)
{
jupiterInput.mark(currentLineNoRead);
folderMissingOutput.write(jupiter_line + "\n");
folderMissingOutput.flush();
}
if (jupiterInput != null)
jupiterInput.close();
}
}
}[/code]

I cannot figure out why it did not pickup those missing bold lines at the end of C:\\VenusSolarisFilesystemListing.txt and suspect that it was due to one of the special characters (.:@,).

Any idea on where the issue is coming from?

Thanks again,

Jack
 
 
subject: Compare Unix filesystem listing lines between 2 systems