• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Removing all kind of tags

 
Ranch Hand
Posts: 38
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
HI,
I am working on a project that collects alot of data from web(saved in .80 files). So, I have any possible form of data in these pages: HTML, XML, HTML inside XML, CSS. and also pages like this one: http://www.usustatesman.com/se/the-statesman-rss-1.544390

I need to remove ALL the tags(ANYkind) from the content of these pages ang get pure texts.

is there any parser that can do this for me? or any other way to remove these tags?

Thank you so much!
 
Sheriff
Posts: 5555
326
IntelliJ IDE Python Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Would probably forget about using Java altogether and just use sed with the search expression <[^[<>]*> and a delete action. That will get rid of all <tags> <that> </look> </like> <this> and leave everything else as is.
reply
    Bookmark Topic Watch Topic
  • New Topic