aspose file tools*
The moose likes Java in General and the fly likes String Compare with a bit of a twist Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "String Compare with a bit of a twist" Watch "String Compare with a bit of a twist" New topic
Author

String Compare with a bit of a twist

C. Alan
Ranch Hand

Joined: Dec 17, 2004
Posts: 30
My title probably does not accurately reflect my problem, so hopefully I can explain it without causing anybody's eyes to glaze over, or drool to drip from the corner of their mouths.

I am comparing two text files, which seems simple enough, but after messing with it for a couple of weeks, I am finding it harder to accomplish that I thought at first.

Essentially all I am trying to accomplish... both text files are loaded into ArrayLists (after being trimmed and split). The first file is the "good" file, the second file is the "changed" file.

I load the first String, and then loop string by string through the second file until a match is found. If a match is found, I want to remove the line from both lists (to avoid the overhead of having to 'search' it again)

If I get all of the way through the second file without a match, then a line is missing, and it's corresponding match needs to be added to the ArrayList labeled "missing".

Whatever is left over in the second file when all is said and done would be "added" lines of text.

In a perfect world, my code would work....



I am guessing that one of the problems that i am having is that by removing elements from 'clean', it is throwing off my count?? At any rate, if both files match, then I should have nothing left over at the end. For the purposes in which I intend to use this, I know for sure that there will be elements in at least two of the arrays...

And while I am thinking about it, it would probably be better to start the comparisons from the end of the ArrayLists to avoid the overhead when elements are removed.

I'm pretty sure I am missing something simple, but I can no longer see the trees for the forest...

EDIT: Or maybe I should be trying to do this in two passes??
[ January 04, 2005: Message edited by: C. Alan ]

<a href="http://www.security-forums.com/forum/viewforum.php?f=48" target="_blank" rel="nofollow">Malware Removal</a> - Get your system running like new again.
marc weber
Sheriff

Joined: Aug 31, 2004
Posts: 11343

If you don't need random access and you're going to be removing elements from the middle, then you might consider LinkedList instead of ArrayList. Then consider some built-in Collection functionality:
  • Since you're just iterating through the "good" list, why not use an Iterator?
  • Rather than iterating through the "infected" list looking for a match, why not just use indexOf(Object)?
  • When you do find a match, remove it from the "good" list using Iterator.remove(), and from the "infected" list using LinkedList.remove(int index).



  • [ January 05, 2005: Message edited by: marc weber ]

    "We're kind of on the level of crossword puzzle writers... And no one ever goes to them and gives them an award." ~Joe Strummer
    sscce.org
    C. Alan
    Ranch Hand

    Joined: Dec 17, 2004
    Posts: 30
    Thank you. I've never used Iterators before, and only dabbled with linked lists. I'll see how that works out.

    EDIT: Well, I have to admit that I am a bit humbled...indeed the solution was quite simple, and IMHO, elegant.

    Thanks again Marc. Happy New Year!
    [ January 05, 2005: Message edited by: C. Alan ]
    marc weber
    Sheriff

    Joined: Aug 31, 2004
    Posts: 11343

    Originally posted by C. Alan:
    ...the solution was quite simple, and IMHO, elegant...

    Thank you! Sometimes it just takes a fresh perspective -- as you mentioned above, with the forest and the trees.

    Actually, I see now that the try/catch block in my code is not required (since the exceptions thrown by Iterator methods are unchecked), so it would be more elegant without that.
    Stan James
    (instanceof Sidekick)
    Ranch Hand

    Joined: Jan 29, 2003
    Posts: 8791
    Are all the lines unique? If not you may have to remove all occurrences of a line from both sides instead of just the one.

    Here's a compare algorithm for text files like Java source. I extracted the algorithm from some really smelly code I found on the net and did a new implementation on my Wiki.

    ---
    For each each unique line of text create a symbol. The symbol state is: OldOnly, NewOnly, UniqueMatch (both files exactly once), or Other.

    For each line, create a LineInfo object. Set state = symbol state and establish bidirectional links between UniqueMatch lines in the two files.

    For each UniqueMatch in old create a "match block". Stretch match blocks forward and backward to include matching lines with any state, including other match blocks.

    Build a Report of edit commands that can be used to tranform Old into New.
    Matching blocks generate match or move commands. Non-matching blocks generate insert, append, delete or change commands.

    Iterate the commands to generate a report.
    ---
    And here's an old favorite if you work with sorted lists. It's really the "master-update" pattern from the days of processing cards and tapes.

    [ January 05, 2005: Message edited by: Stan James ]

    A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
    C. Alan
    Ranch Hand

    Joined: Dec 17, 2004
    Posts: 30
    Thanks Stan. That's sort of what I was trying to do originally. For this app, all the lines will be unique. The text I am comparing consists of simply file names, lengths, and MD5 hashes. I guess a loose comparison would be similar to what TripWire does, merely checking to see what files have been added and changed after a system has intentionally been infected with adware, spyware, etc.. With the code Marc provided, that works flawlessly, and works much faster than how I was trying to do it before. I realize I am reinventing the wheel, so to speak, but it is as much for me to learn, as it is my reluctance to pay for any application that doesn't do precisely what I want it to do.

    Now that I have an understanding of how to do it, I can try and do what I was initially trying to do in the first place, and that is to be able to compare registry files from before infection, and after infection. The goal is to see what has been changed, and create a .reg file our infected users can run to repair their registry files.

    All keys will be unique, so I think that the smae method will work as far as comparisons go, but I am definatley going to be running into memory management issues that I will need to figure out how to address. My current methods of reading each text file into a Linked List will not work, as there will be upwards of 75,000 lines of text in each list. I suspect I will have to read one line at a time, and do it that way, unless anybody has any other clever solutions.

    The actual hives themselves, in Hexadecimal form, are quite short, but for the moment, using them in that form is a little beyond my abilities. Given enough time, and the continuing kind help that I have received here so far, I'll figure it our eventually.
    David Harkness
    Ranch Hand

    Joined: Aug 07, 2003
    Posts: 1646
    I'm not all that familiar with how the registry is implemented.* Is it ordered in any way, for example as it is viewed in the tree editor? If so, then you could implement the second method Stan mentioned (the card stack). Basically you scan down each file, advancing either one pointer or the other or both based on equality or ordering.

    If you care to, can you describe in more detail the format of the files? Things like "it's in depth-first tree order" or "the full noode path is included on each line" would give you some great opportunities to optimize the task using a different algorithm.

    * I just realized that you're probably just exporting the entire registry to a file rather than accessing it directly. If so, then I'd bet there is a well-specified ordering that you could capitalize on.
    C. Alan
    Ranch Hand

    Joined: Dec 17, 2004
    Posts: 30
    Yes David, that is all I am doing (exporting the registry as a text file)... that's what makes it huge. I am curious as to what your recommendations would be. Mine always seem to be use the hardest, most complicated means possible.

    My next project after this one (lol..there is always one more little project) is to learn how to access the registry directly.
    Stan James
    (instanceof Sidekick)
    Ranch Hand

    Joined: Jan 29, 2003
    Posts: 8791
    I'd probably try to convert the tree into fully qualified names so I could sort them and use that master-update algorithm, but I'm just old fashioned.
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: String Compare with a bit of a twist