| Author |
HELP! picking words from a file and copying it into a new file
|
bob spencer
Greenhorn
Joined: Apr 10, 2003
Posts: 5
|
|
Someone please help me. I need to write a java program that takes the contents of a html file (ignores the tags) and copies the individual words into a separate master file with the name of the html file that it came from by each word. When i repeat this with another file, it shud put those words into the same mater file, but no overwrite the previous data i.e. test.html < H1 > This is a Heading < H1 > masterfile.txt This test.html is test.heml a test.html heading test.html
|
 |
Layne Lund
Ranch Hand
Joined: Dec 06, 2001
Posts: 3061
|
|
|
What have you done so far on this program?
|
Java API Documentation
The Java Tutorial
|
 |
Cindy Glass
"The Hood"
Sheriff
Joined: Sep 29, 2000
Posts: 8521
|
|
|
You might want to read up on RegEx to help you bypassing those tags.
|
"JavaRanch, where the deer and the Certified play" - David O'Meara
|
 |
Leslie Chaim
Ranch Hand
Joined: May 22, 2002
Posts: 336
|
|
Bob, Welcome to Javaranch. This is a nice parsing problem Bob. There are many things you will need to consider such as what exactly is a word and other unwanted contents such as with the SCRIPT tag which you probably don't need. You will also need to avoid duplicates from the masterfile.txt and a whole bunch of other things. I think you should use the power of regex to strip the HTML tags. Read in variable <code>HTMLtext</code> a paragraph or even the whole file. Start with the simple pattern <.*?> to strip the tags (using Sting.replaceAll). Then split the words using \s+. I have tested this approach with Perl and it sounds like a good starting point: I ran this in cygwin and it seems to work fine. There is a lot in this Perl one liner but it follows the same logic which I outlined. Combined with the shell and the '>>' we do a 'sort -u' for uniqueness and append to the masterfile.txt I will just explain the perl bit <code>perl</code> the command <code>-0</code> (That's zero) defines the record separator usually an octal value is supplied. In this case nothing is supplied which means there is no rec sep and the whole file will be slurped. <code>-n</code> tells perl to loop implicitly on the input. (which is supplied by '*.html') <code>-e</code> takes a perl statement as an argument. Then starts the single quote to protect the perl snippet from shell interpretation. <code>s</code> substitute (In java you'd say <code>replace</code> <code>/</code> Delimiter to begin the pattern <code><.*?></code> The pattern to search for. A '<' followed by any character .*? (non-greedy)and a closing '>'. <code>//</code> The closing delimiter and the replacement (which is nothing) <code>s</code> The /s lets '.' to match the newline character. (in Java you would <code>compile</code> with <code>Pattern.MULTILINE</code> <code>g</code> The /g modifier stands for replace globally, that is, all occurrences. (in Java use <code>replaceAll</code> <code>print</code> simple I hope. (in Java you do <code>System.out.print</code> <code>map {"$_ $ARGV\n"}</code> evaluates it's argument (split) and creates a string in the { BLOCK } as word, followed by $ARGV which Perl assigns the 'current' file in progress <code>split</code> The argument to map which splits the input based on the default regex (\s+) Then there is the closing quote followed by the rest of the command. Cheers, Leslie
|
Normal is in the eye of the beholder
|
 |
Leslie Chaim
Ranch Hand
Joined: May 22, 2002
Posts: 336
|
|
HTML is not my strength, why doesn't the <code>code</code> tag work all the time
|
 |
Layne Lund
Ranch Hand
Joined: Dec 06, 2001
Posts: 3061
|
|
First of all, <code> and </code> aren't even valid HTML tags. Secondly, HTML is not enabled on this board, according to the message next to the text area where we type messages. However, if you replace < > with [ ], this will use the UBB tags, which will mark the code similarly to what you expect. Layne [ April 10, 2003: Message edited by: Layne Lund ]
|
 |
Leslie Chaim
Ranch Hand
Joined: May 22, 2002
Posts: 336
|
|
Thanks Layne for the clarification. I have never tried this but I guess UBB tags will work inline. Let's see... (If there would only be a way where I can see the post before posting )
|
 |
Leslie Chaim
Ranch Hand
Joined: May 22, 2002
Posts: 336
|
|
|
So there you go it dosen't. So how would I get this idea?
|
 |
Cindy Glass
"The Hood"
Sheriff
Joined: Sep 29, 2000
Posts: 8521
|
|
Leslie, If you click the little pen and paper icon on any of your own posts, then you can edit them.
|
 |
Layne Lund
Ranch Hand
Joined: Dec 06, 2001
Posts: 3061
|
|
|
Just put all the code between the [ code ] and [ /code ]. This will put it between lines like in your previous post. You should only have one set of these tags surrounding a single block of code. Individual lines of code can just be on their own lines in between the tags.
|
 |
 |
|
|
subject: HELP! picking words from a file and copying it into a new file
|
|
|