*
The moose likes Java in General and the fly likes Regular Expression to filter text from html file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regular Expression to filter text from html file" Watch "Regular Expression to filter text from html file" New topic
Author

Regular Expression to filter text from html file

Nitin Menon
Ranch Hand

Joined: Jun 13, 2007
Posts: 79
I need to get the content words and keywords from a group of .html files. That means, I must have everything in a html file except the html tags and and the things written within them. But, if the tag is a meta tag, then i need to extract the key words specified in it. Tried some stuff, but not leading any where. Can anyone please help me..!
Thanks in advance..!
Jeanne Boyarsky
internet detective
Marshal

Joined: May 26, 2003
Posts: 30123
    
150

What did you try?

I usually build up my regular expressions gradually. Can you match:
  • An open tag
  • A close tag
  • Both
  • an attribute


  • [Blog] [JavaRanch FAQ] [How To Ask Questions The Smart Way] [Book Promos]
    Blogging on Certs: SCEA Part 1, Part 2 & 3, Core Spring 3, OCAJP, OCPJP beta, TOGAF part 1 and part 2
    Martin Vajsar
    Sheriff

    Joined: Aug 22, 2010
    Posts: 3606
        
      60

    Wouldn't a HTML parser be more up to the task? I personally wouldn't want to maintain code that parsed HTML using regular expressions.

    I have no experience with HTML parsers personally, but googling for Java HTML parser yields some promising links.
    Nitin Menon
    Ranch Hand

    Joined: Jun 13, 2007
    Posts: 79
    Sorry for the late reply. I Was away. I got the solution. I wrote regular expressions in a series of steps.
    Thank you Martin and Jeanne..!
     
     
    subject: Regular Expression to filter text from html file
     
    Similar Threads
    output raw HTML
    HELP! picking words from a file and copying it into a new file
    HELP! reading words from a file and writing them to another file
    How can I write jsp lib that removes html tag
    cannot view applets, and use javac