aspose file tools*
The moose likes Struts and the fly likes HTML to Text via Screen scrape Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Frameworks » Struts
Bookmark "HTML to Text via Screen scrape" Watch "HTML to Text via Screen scrape" New topic
Author

HTML to Text via Screen scrape

Dean O'olish
Greenhorn

Joined: Mar 03, 2009
Posts: 16
hello,

I am tasked with providing my client the ability to view a web page as text (Think of a T icon in the top right corner that is clickable from a JSP). The client would like the ability to simply cut and paste what they see in front of them into another source. Now I bet many will say just highlight the HTML and cut it, which is what we originally proposed however they say they are also getting hidden markup which isn't what they want.

In any case I need to find a way to get the HTML into text. I was thinking screen scrape, but am having a hard time passing the information via AJAX since a web page is pretty large and the querystring can only have so much information.

Does anyone know of a good library to do this (whether is it via screen scrape or not) ?

Thanks for your help,

- DTE
Bear Bibeault
Author and ninkuma
Marshal

Joined: Jan 10, 2002
Posts: 60997
    
  65

"Dean Ooooh", please check your private messages for an important administrative matter.

I've also moved this post to a more appropriate location.


[Asking smart questions] [Bear's FrontMan] [About Bear] [Books by Bear]
Bear Bibeault
Author and ninkuma
Marshal

Joined: Jan 10, 2002
Posts: 60997
    
  65

Dean Oo wrote:but am having a hard time passing the information via AJAX since a web page is pretty large and the querystring can only have so much information.

Then don't pass it in the query string.
Dean O'olish
Greenhorn

Joined: Mar 03, 2009
Posts: 16
I am not and nor would I want to... I am just explaining what I have tried or was thinking...

Bear Bibeault wrote:
Dean Oo wrote:but am having a hard time passing the information via AJAX since a web page is pretty large and the querystring can only have so much information.

Then don't pass it in the query string.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

First step would be to determine the requirements.

It doesn't make sense to say you want to "get the HTML into text", because HTML is already text. And there's this "hidden markup"... what is that?

And when your client is viewing a web page, what are they using to do that? A browser connected to the web page? A text editor looking at the source code... the HTML?

And are these web pages just anything they can get to on the web, or just the web pages served by your site?
Dean O'olish
Greenhorn

Joined: Mar 03, 2009
Posts: 16
I am trying to get the HTML, parse it into plain text, however I am having a hard time figuring out a way to get that HTML content from either the front end or the back end. I tried capturing its dynamic (depends on request and session information) information at the servlet, then parse it... but I can not figure that out.


I next thought I would try and screen scrape and pass the string html content to the server through a post, but the content was too big, and Ajax is just annoying with something like this.


Any suggestions on how to get the rendered content of a JSP into my servlet action (using struts)?

Thanks



Paul Clapham wrote:First step would be to determine the requirements.

It doesn't make sense to say you want to "get the HTML into text", because HTML is already text. And there's this "hidden markup"... what is that?

And when your client is viewing a web page, what are they using to do that? A browser connected to the web page? A text editor looking at the source code... the HTML?

And are these web pages just anything they can get to on the web, or just the web pages served by your site?
Eric Pascarello
author
Rancher

Joined: Nov 08, 2001
Posts: 15376
    
    6
"content was too big" What is too big?

Have you gone through basic Ajax tutorials on how to send post data to the server and dispay a result on the page? If you have not, it might be a good time to try it and not jump right into a big task.

Eric
Dean O'olish
Greenhorn

Joined: Mar 03, 2009
Posts: 16
I have, I know Ajax/JS very well... using mootools, prototype, jquery as well as my own libraries. I've decided not to go with the idea of screen scraping and am trying to figure out how I can get the HTML content as a String through my Struts Action class. I've seen some posts where others have either suggested or tried creating a HttpResponseWrapper class, but I have not been able to make it work for me as of yet...

My work does not want to use a filter nor custom tags, which leaves me with wrapping the Response.

I think I may have to brush up on my OutputStream class. What I am trying to accomplish is:

1. view the HTML from the request in a readable form to make sure all the content is there.
2. take that readable content and put it through a proprietary library that turns a String of HTML into a text file
3. display that text file to the user


any suggestions would be greatly appreciated

Thanks




Eric Pascarello wrote:"content was too big" What is too big?

Have you gone through basic Ajax tutorials on how to send post data to the server and dispay a result on the page? If you have not, it might be a good time to try it and not jump right into a big task.

Eric
Eric Pascarello
author
Rancher

Joined: Nov 08, 2001
Posts: 15376
    
    6
Why are you converting it to a textfile?

I would request the page that you are grabbing the contents for and output back the response to the client. No need to store it in a txt file unless you play on using that at some later time and do not want to fetch it again.

Eric
Dean O'olish
Greenhorn

Joined: Mar 03, 2009
Posts: 16
Our client wants to view the jsp as a text file (I dont know why, but I have to build it) with the HTML stripped.

How do you propose that I call the JSP and get a hold of the rendered content?


Eric Pascarello wrote:Why are you converting it to a textfile?

I would request the page that you are grabbing the contents for and output back the response to the client. No need to store it in a txt file unless you play on using that at some later time and do not want to fetch it again.

Eric
Eric Pascarello
author
Rancher

Joined: Nov 08, 2001
Posts: 15376
    
    6
If you are going to request the file via Ajax it is not going to be a textfile in the responseText.

Eric
Dean O'olish
Greenhorn

Joined: Mar 03, 2009
Posts: 16
Eric,

As I've stated I'm not doing that anymore and trying to get the information from the request

Eric Pascarello wrote:If you are going to request the file via Ajax it is not going to be a textfile in the responseText.

Eric
Eric Pascarello
author
Rancher

Joined: Nov 08, 2001
Posts: 15376
    
    6
Offtopic..Can I ask why you type your response before the quoted text. It is rather backward convention.

Eric
Dean O'olish
Greenhorn

Joined: Mar 03, 2009
Posts: 16
Eric Pascarello wrote:Offtopic..Can I ask why you type your response before the quoted text. It is rather backward convention.

Eric


Easier to read
Bear Bibeault
Author and ninkuma
Marshal

Joined: Jan 10, 2002
Posts: 60997
    
  65

Dean O'olish wrote:Easier to read

For who? Certainly not the typical reader of these forums.
Dean O'olish
Greenhorn

Joined: Mar 03, 2009
Posts: 16
ya, I enjoy scrolling down to the bottom of a long thread just to read a one line comment... It's a lot easier ontop Bear

Bear Bibeault wrote:
Dean O'olish wrote:Easier to read

For who? Certainly not the typical reader of these forums.
Christophe Verré
Sheriff

Joined: Nov 24, 2005
Posts: 14687
    
  16

Wouldn't it be even easier not to quote the previous post at all ?


[My Blog]
All roads lead to JavaRanch
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

Top-posting is wrong, and always has been.

On-topic: use a filter; escape response, change header.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: HTML to Text via Screen scrape