File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes HELP! picking words from a file and copying it into a new file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "HELP! picking words from a file and copying it into a new file" Watch "HELP! picking words from a file and copying it into a new file" New topic
Author

HELP! picking words from a file and copying it into a new file

bob spencer
Greenhorn

Joined: Apr 10, 2003
Posts: 5
Someone please help me.
I need to write a java program that takes the contents of a html file (ignores the tags) and copies the individual words into a separate master file with the name of the html file that it came from by each word. When i repeat this with another file, it shud put those words into the same mater file, but no overwrite the previous data i.e.
test.html
< H1 > This is a Heading < H1 >
masterfile.txt
This test.html
is test.heml
a test.html
heading test.html
Layne Lund
Ranch Hand

Joined: Dec 06, 2001
Posts: 3061
What have you done so far on this program?


Java API Documentation
The Java Tutorial
Cindy Glass
"The Hood"
Sheriff

Joined: Sep 29, 2000
Posts: 8521
You might want to read up on RegEx to help you bypassing those tags.


"JavaRanch, where the deer and the Certified play" - David O'Meara
Leslie Chaim
Ranch Hand

Joined: May 22, 2002
Posts: 336
Bob,
Welcome to Javaranch.
This is a nice parsing problem Bob.
There are many things you will need to consider such as what exactly is a word and other unwanted contents such as with the SCRIPT tag which you probably don't need. You will also need to avoid duplicates from the masterfile.txt and a whole bunch of other things.
I think you should use the power of regex to strip the HTML tags. Read in variable <code>HTMLtext</code> a paragraph or even the whole file. Start with the simple pattern <.*?> to strip the tags (using Sting.replaceAll). Then split the words using \s+.
I have tested this approach with Perl and it sounds like a good starting point:

I ran this in cygwin and it seems to work fine. There is a lot in this Perl one liner but it follows the same logic which I outlined. Combined with the shell and the '>>' we do a 'sort -u' for uniqueness and append to the masterfile.txt
I will just explain the perl bit
<code>perl</code> the command
<code>-0</code> (That's zero) defines the record separator usually an octal value is supplied. In this case nothing is supplied which means there is no rec sep and the whole file will be slurped.
<code>-n</code> tells perl to loop implicitly on the input. (which is supplied by '*.html')
<code>-e</code> takes a perl statement as an argument.
Then starts the single quote to protect the perl snippet from shell interpretation.
<code>s</code> substitute (In java you'd say <code>replace</code>
<code>/</code> Delimiter to begin the pattern
<code><.*?></code> The pattern to search for. A '<' followed by any character .*? (non-greedy)and a closing '>'.
<code>//</code> The closing delimiter and the replacement (which is nothing)
<code>s</code> The /s lets '.' to match the newline character. (in Java you would <code>compile</code> with <code>Pattern.MULTILINE</code>
<code>g</code> The /g modifier stands for replace globally, that is, all occurrences. (in Java use <code>replaceAll</code>
<code>print</code> simple I hope. (in Java you do <code>System.out.print</code>
<code>map {"$_ $ARGV\n"}</code> evaluates it's argument (split) and creates a string in the { BLOCK } as word, followed by $ARGV which Perl assigns the 'current' file in progress
<code>split</code> The argument to map which splits the input based on the default regex (\s+)
Then there is the closing quote followed by the rest of the command.
Cheers,
Leslie


Normal is in the eye of the beholder
Leslie Chaim
Ranch Hand

Joined: May 22, 2002
Posts: 336
HTML is not my strength, why doesn't the <code>code</code> tag work all the time
Layne Lund
Ranch Hand

Joined: Dec 06, 2001
Posts: 3061
First of all, <code> and </code> aren't even valid HTML tags. Secondly, HTML is not enabled on this board, according to the message next to the text area where we type messages. However, if you replace < > with [ ], this will use the UBB tags, which will mark the code similarly to what you expect.
Layne
[ April 10, 2003: Message edited by: Layne Lund ]
Leslie Chaim
Ranch Hand

Joined: May 22, 2002
Posts: 336
Thanks Layne for the clarification. I have never tried this but I guess UBB tags will work inline.
Let's see... (If there would only be a way where I can see the post before posting )
Leslie Chaim
Ranch Hand

Joined: May 22, 2002
Posts: 336
So there you go it dosen't. So how would I get this idea?
Cindy Glass
"The Hood"
Sheriff

Joined: Sep 29, 2000
Posts: 8521
Leslie,
If you click the little pen and paper icon on any of your own posts, then you can edit them.
Layne Lund
Ranch Hand

Joined: Dec 06, 2001
Posts: 3061
Just put all the code between the [ code ] and [ /code ]. This will put it between lines like in your previous post. You should only have one set of these tags surrounding a single block of code. Individual lines of code can just be on their own lines in between the tags.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: HELP! picking words from a file and copying it into a new file
 
Similar Threads
Sorting
servlet and file for ie
HELP! reading words from a file and writing them to another file
Sorting xml string
Css External Style Sheets Location