aspose file tools*
The moose likes Java in General and the fly likes Parse string and make every URL into hyperlink Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Parse string and make every URL into hyperlink" Watch "Parse string and make every URL into hyperlink" New topic
Author

Parse string and make every URL into hyperlink

Leo Li-Fan Chen
Greenhorn

Joined: Jul 17, 2012
Posts: 3
Dear all,

I have a requirement to write a function that accepts a String that contains html content (from webpage or email), transform every URL found into hyperlink, and return that String.

e.g.
www.yahoo.com
becomes
<a href="http://www.yahoo.com">www.yahoo.com</a>

of course, if the URL is already embedded inside <a> tag, then it will be left as the way it is and I believe this is the most difficult requirement.

Does anyone know where I can get a sample code? Any open source?

thanks


Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8223
    
  23

Leo Li-Fan Chen wrote:I have a requirement to write a function that accepts a String that contains html content (from webpage or email), transform every URL found into hyperlink, and return that String.

Right, just to make your requirements clear:
You want to search a piece of HTML text and convert all strings that are outside <a> tags to something like
<a href="{whatever}">{whatever}</a>
is that right?

Assuming that's right, you'll probably need a parser (ie, Sax or DOM). You could do it in a rudimentary way with regexes, but you're likely to run into embedding situations that make it difficult to guarantee 100% success. And if you need a parser, you'll need some way of converting to XHTML first (there are tons of them out there, but I personally like JTidy, simply because it's a Java port of HTMLTidy, the granddaddy of converters).

After that it's simply a case of deciding what makes a "potential hyperlink" and wrapping it as described. If that's your question, have a look at String.replace().

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Carey Brown
Ranch Hand

Joined: Nov 19, 2001
Posts: 208

Seems like the slippery part of your problem is nailing down exactly what denotes a "URL". Something ending in ".com"? "www.yahoo.com?this=is%20also%20a%20url" ? What if the URL gets broken across lines?


Programs that I've seen attempt this require that a leading "http://" or "https://" be present.



Sent from my IBM 360 mainframe
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8223
    
  23

Carey Brown wrote:Programs that I've seen attempt this require that a leading "http://" or "https://" be present.

Totally agree with what you say, but a leading "www." might not be a bad candidate either.

Winston
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39834
    
  28
And welcome to the Ranch
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8223
    
  23

Campbell Ritchie wrote:And welcome to the Ranch

Dang! Keep forgetting that.

Welcome Leo.

Winston
Leo Li-Fan Chen
Greenhorn

Joined: Jul 17, 2012
Posts: 3
Winston Gutkowski wrote:
Leo Li-Fan Chen wrote:I have a requirement to write a function that accepts a String that contains html content (from webpage or email), transform every URL found into hyperlink, and return that String.

Right, just to make your requirements clear:
You want to search a piece of HTML text and convert all strings that are outside <a> tags to something like
<a href="{whatever}">{whatever}</a>
is that right?

Thanks for the prompt response.
Sorry I didn't make it very clear.

My task is to search a piece of HTML text and convert these URLs (www., http://, https://, rtps://) that is
1. within <body></body>
2. outside any html tags
3. not already enclosed by <a></a>, in other words, not hyperlinked
to <a href="{whatever}">{whatever}</a>.

The purpose is to make the URLS (www., http://, https://, rtps://) clickable when the HTML text is ultimately passed to WebView.

I'd prefer not to use any library though, need to minimize our app's footprint.
I believe point number 3 is the most difficult task.
e.g.
If I detect a URL (e.g. www.yahoo.com), how far before that URL should I scan for a possible <a> tag?

Anyone had experience writing this kind of algorithm?
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8223
    
  23

Leo Li-Fan Chen wrote:I'd prefer not to use any library though, need to minimize our app's footprint.
I believe point number 3 is the most difficult task.

Actually, if you don't want to use a library, #2 is likely to be by far the most difficult.

Assuming that you actually have a String, let's say called 'toLink', that contains a potential 'hyperlink' (e.g. "www.yahoo.com", which, BTW, violates what you said above, because it doesn't start with "http://") that you know is outside of any tags, then changing it is as simple as:

String hyperlink = toLink.replaceFirst("(.*)", "<a href=\"\\1\">\\1</a>");

But, as I say, without a parser, it's that "knowing it's outside of any tags" that's going to be the problem.(*)

You can actually make the replacement quite a bit smarter too, but I'll leave you to read up on regexes.

Winston

[Edit] (*) Actually, I suspect that even the "outside of any tags" is wrong. What about:
"<p>some text that contains http://www.google.com in here</p>"?
Aditya Jha
Ranch Hand

Joined: Aug 25, 2003
Posts: 227

Leo Li-Fan Chen wrote:I believe point number 3 is the most difficult task.
e.g.
If I detect a URL (e.g. www.yahoo.com), how far before that URL should I scan for a possible <a> tag?

So, let's say you blindly replace each URL with <a href="{0}">{0}</a>.

For the URLs which are already wrapped by anchor element (<a href="{0}">foo and bar</a>), it would mean:

<a href="<a href="{0}">{0}</a>">foo and bar</a>

Now, if in this string, we replace "<a href=" with " and ">{0}</a>" with ", I believe we'll be able to remove the extra wrapping.

You will need to perform the above replacement for each URL (on the string you constitute after wrapping found URL in <a> element).

Haven't tested it... just thinking aloud.
Aditya Jha
Ranch Hand

Joined: Aug 25, 2003
Posts: 227

Winston Gutkowski wrote:you'll probably need a parser (ie, Sax or DOM). You could do it in a rudimentary way with regexes, but you're likely to run into embedding situations that make it difficult to guarantee 100% success.

This holds true when we have at least some level of control over how the HTML is going to be constructed. In practical life, I have faced situations where tidying up the HTML resulted in closing of (usually) unclosed elements. The most problematic ones were empty <div> elements, which, when closed like <div />, started messing up rendering on certain browsers (no prize for guessing which one ).

Using regular-expression is no doubt a rather crude way, but in such cases it almost becomes necessary. And I must confess, I'm a huge fan of reg-ex, so I may be a little biased here.

If tidying up is not an issue for the given HTML, then I fully agree to your suggestion with JTidy - a quite handy tool.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8223
    
  23

Aditya Jha wrote:So, let's say you blindly replace each URL with <a href="{0}">{0}</a>.

Actually, I think Leo covered that with his "outside of any tags" statement, but that's what's going to be difficult.

Winston
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8223
    
  23

Aditya Jha wrote:I have faced situations where tidying up the HTML resulted in closing of (usually) unclosed elements. The most problematic ones were empty <div> elements, which, when closed like <div />, started messing up rendering on certain browsers (no prize for guessing which one ).

Hm. You surprise me. I put several million lines of HTML through it back when it was still called htmltidy, and can count the formatting issues we had on the fingers of one hand; and those mostly involved overlapping tags. And I'm quite sure that some of it had <div> tags (maybe not that much though).

But admittedly, it's been quite a while since I used it.

Winston
Panagiotis Kalogeropoulos
Rancher

Joined: May 27, 2011
Posts: 99

For parsing html documents, you could use the classes found in javax.swing.text.html and javax.swing.text.html.parser packages. For instance the following simple program will print the text found inside the html <a> tags:




All you need to do is to get an instance of HTMLEditorKit.Parser (see comment 1) which we obtain by calling its concrete implementation, ParserDelegator. We need that instance in order to invoke the 'parse' method, which of course will parse our html document. You will also need an instance of HTMLEditorKit.ParserCallback, which allows us to read, handle and manipulate html tags and text. Its methods are empty implementations, so you will have to decide which ones you will need to override (depending on what you want to do with the document). I am creating an instance of it in comment 2, while in comments 3, 4 and 5 I am overriding the methods that we need at this moment. Each method will be invoked when the parser is at an "appropriate" place. For instance, if we have the tag <a href="www.test.com">test</a>, the handleStartTag will be invoked when the parser reads the beginning of the tag (the <a), the handleEndTag will be invoked when the parser is at the end of the tag (the </a>) and the handleText will be invoked when the parser reads the text (the 'test' in our case). In the end, we create a Reader with the html document (comment 6) and call the parse method (comment 7) in order for the parser to start reading the document.

In case the above seemed very frustrating to you, don't worry at all! Html parsing in Java is quite complicated, but the minute you understand how it works, you can do many things with it. Of course you can extend the above example to the task that you are facing, which is to read text in the body of the html document to see if it is a url. You can surely give it a try to see if that fits your needs and come back with any questions that you may have.
Leo Li-Fan Chen
Greenhorn

Joined: Jul 17, 2012
Posts: 3
Thanks for your info.
But as I am developing this algorithm on Android, I won't have access to any of the libraries you mentioned.
Another thing is, to avoid parsing through the same HTML content multiple times, we already have a parsing framework like so



so I need to stick to this rule to avoid traversing the content multiple times, therefore which explains why I can't use libraries.
So basically I have to implement parseURLLink(ch).
Aditya Jha
Ranch Hand

Joined: Aug 25, 2003
Posts: 227

Winston Gutkowski wrote:
Aditya Jha wrote:So, let's say you blindly replace each URL with <a href="{0}">{0}</a>.

Actually, I think Leo covered that with his "outside of any tags" statement, but that's what's going to be difficult.

Winston

I'm not sure I'm getting you here. What I proposed was a (possible) solution to point 3 (the rest of my post).
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8223
    
  23

Leo Li-Fan Chen wrote:Another thing is, to avoid parsing through the same HTML content multiple times, we already have a parsing framework like so

I'm sorry, but that isn't a framework, that's a character iterator. In fact it isn't even that.

You plainly have a character array that contains some or all of your html content.
My advice: either make it a String instead, or convert it to one; then you'll at least be able to use regexes (I assume; I know nothing about Android, but the website seems to suggest you can).

Doing it with regexes is going to be a slow and painful task, so I really suggest you try and find a parser. According to this page, JSoup will work, but I've never tried it.

And even then, it seems to me that you've got a fair bit of work ahead of you to work out exactly when you can do the replace and when you can't.

Winston
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8223
    
  23

Aditya Jha wrote:I'm not sure I'm getting you here. What I proposed was a (possible) solution to point 3 (the rest of my post).

I was referring to your "blindly replace each URL" statement. I don't think that's what Leo had in mind at all.

Winston
Aditya Jha
Ranch Hand

Joined: Aug 25, 2003
Posts: 227

Winston Gutkowski wrote:
Aditya Jha wrote:I'm not sure I'm getting you here. What I proposed was a (possible) solution to point 3 (the rest of my post).

I was referring to your "blindly replace each URL" statement. I don't think that's what Leo had in mind at all.

Winston

Right. So, the 2 string replacements after the "blind replace" (as mentioned in my post) will help him negate the bad effect of the blind replace.
Panagiotis Kalogeropoulos
Rancher

Joined: May 27, 2011
Posts: 99

I am developing this algorithm on Android


Oups, didn't know that. Now that you've mentioned it, you could use Html class (found in android.text package) and more specifically the fromHtml (String source) method. This method will return a Spanned instance, which you can use to parse the document. If you notice, the methods have more or less the same meaning as the ones found in the HTMLEditorKit.ParserCallback that I mentioned in my previous post, so it would be easier for you to understand how they work. You can also take a look at TagSoup which is used internally in the fromHtml method, in case you need to know how things work.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Parse string and make every URL into hyperlink