My title probably does not accurately reflect my problem, so hopefully I can explain it without causing anybody's eyes to glaze over, or drool to drip from the corner of their mouths.
I am comparing two text files, which seems simple enough, but after messing with it for a couple of weeks, I am finding it harder to accomplish that I thought at first.
Essentially all I am trying to accomplish... both text files are loaded into ArrayLists (after being trimmed and split). The first file is the "good" file, the second file is the "changed" file.
I load the first String, and then loop string by string through the second file until a match is found. If a match is found, I want to remove the line from both lists (to avoid the overhead of having to 'search' it again)
If I get all of the way through the second file without a match, then a line is missing, and it's corresponding match needs to be added to the ArrayList labeled "missing".
Whatever is left over in the second file when all is said and done would be "added" lines of text.
In a perfect world, my code would work....
I am guessing that one of the problems that i am having is that by removing elements from 'clean', it is throwing off my count?? At any rate, if both files match, then I should have nothing left over at the end. For the purposes in which I intend to use this, I know for sure that there will be elements in at least two of the arrays...
And while I am thinking about it, it would probably be better to start the comparisons from the end of the ArrayLists to avoid the overhead when elements are removed.
I'm pretty sure I am missing something simple, but I can no longer see the trees for the forest...
EDIT: Or maybe I should be trying to do this in two passes?? [ January 04, 2005: Message edited by: C. Alan ]
<a href="http://www.security-forums.com/forum/viewforum.php?f=48" target="_blank" rel="nofollow">Malware Removal</a> - Get your system running like new again.
Are all the lines unique? If not you may have to remove all occurrences of a line from both sides instead of just the one.
Here's a compare algorithm for text files like Java source. I extracted the algorithm from some really smelly code I found on the net and did a new implementation on my Wiki.
--- For each each unique line of text create a symbol. The symbol state is: OldOnly, NewOnly, UniqueMatch (both files exactly once), or Other.
For each line, create a LineInfo object. Set state = symbol state and establish bidirectional links between UniqueMatch lines in the two files.
For each UniqueMatch in old create a "match block". Stretch match blocks forward and backward to include matching lines with any state, including other match blocks.
Build a Report of edit commands that can be used to tranform Old into New. Matching blocks generate match or move commands. Non-matching blocks generate insert, append, delete or change commands.
Iterate the commands to generate a report. --- And here's an old favorite if you work with sorted lists. It's really the "master-update" pattern from the days of processing cards and tapes.
[ January 05, 2005: Message edited by: Stan James ]
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Joined: Dec 17, 2004
Thanks Stan. That's sort of what I was trying to do originally. For this app, all the lines will be unique. The text I am comparing consists of simply file names, lengths, and MD5 hashes. I guess a loose comparison would be similar to what TripWire does, merely checking to see what files have been added and changed after a system has intentionally been infected with adware, spyware, etc.. With the code Marc provided, that works flawlessly, and works much faster than how I was trying to do it before. I realize I am reinventing the wheel, so to speak, but it is as much for me to learn, as it is my reluctance to pay for any application that doesn't do precisely what I want it to do.
Now that I have an understanding of how to do it, I can try and do what I was initially trying to do in the first place, and that is to be able to compare registry files from before infection, and after infection. The goal is to see what has been changed, and create a .reg file our infected users can run to repair their registry files.
All keys will be unique, so I think that the smae method will work as far as comparisons go, but I am definatley going to be running into memory management issues that I will need to figure out how to address. My current methods of reading each text file into a Linked List will not work, as there will be upwards of 75,000 lines of text in each list. I suspect I will have to read one line at a time, and do it that way, unless anybody has any other clever solutions.
The actual hives themselves, in Hexadecimal form, are quite short, but for the moment, using them in that form is a little beyond my abilities. Given enough time, and the continuing kind help that I have received here so far, I'll figure it our eventually.
I'm not all that familiar with how the registry is implemented.* Is it ordered in any way, for example as it is viewed in the tree editor? If so, then you could implement the second method Stan mentioned (the card stack). Basically you scan down each file, advancing either one pointer or the other or both based on equality or ordering.
If you care to, can you describe in more detail the format of the files? Things like "it's in depth-first tree order" or "the full noode path is included on each line" would give you some great opportunities to optimize the task using a different algorithm.
* I just realized that you're probably just exporting the entire registry to a file rather than accessing it directly. If so, then I'd bet there is a well-specified ordering that you could capitalize on.
Joined: Dec 17, 2004
Yes David, that is all I am doing (exporting the registry as a text file)... that's what makes it huge. I am curious as to what your recommendations would be. Mine always seem to be use the hardest, most complicated means possible.
My next project after this one (lol..there is always one more little project) is to learn how to access the registry directly.
Joined: Jan 29, 2003
I'd probably try to convert the tree into fully qualified names so I could sort them and use that master-update algorithm, but I'm just old fashioned.