Win a copy of 97 Things Every Java Programmer Should Know this week in the Java in General forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Jeanne Boyarsky
  • Junilu Lacar
  • Henry Wong
Sheriffs:
  • Ron McLeod
  • Devaka Cooray
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Frits Walraven
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • salvin francis
  • fred rosenberger

Extracting a string from HTML Source Code

 
Ranch Hand
Posts: 51
1
Eclipse IDE Python Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,

I want to extract a specific String from HTML, specifically, I want to extract a String from in between <...>

So far, I've got this


The problem I have is when I change the last parameter in this line:



to



i.e. the generic alternative, I get this error message:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -366
at java.lang.String.substring(Unknown Source)
at main.HTMLGrabber.main(HTMLGrabber.java:45)


Is there a better and simple way to extract a substring?

Thank you.

 
Bartender
Posts: 1845
10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do you understand the error message? String index out of range: -366
Do you understand WHY you are getting it?
What does the number -366 indicate?

You do realise that there will be more than one closing tag in the text you are searching, and the indexOf method starts searching from the start of the string...
 
Marcus Rauchfuss
Ranch Hand
Posts: 51
1
Eclipse IDE Python Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, Stefan, I am aware what this message means, yes, Stefan, I know there are other > in the code (because there are lots of > in HTML, I actually do a lot of HTML pushing in my day job) and my question was:

Is there a better and simple way to extract a substring?


because I am aware the one I have chosen does not work.
 
Stefan Evans
Bartender
Posts: 1845
10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sorry, if that came over a little patronizing (a little?!?)
I got focused on fixing the issue rather than answering your actual question.

I'll continue in that line for a moment :
The substring method takes two arguments startIndex and endIndex.
Obviously your change has made it so that the endIndex calculated is before the startindex.
You could possibly fix this by using the version of the indexOf method that specifies a starting point to search from.

ie:




To answer your question: Is there a better way?
Well there is another way: Using a regular expression to capture the part you are interested in.

Using regular expressions to parse full HTML is not generally recommended, but if all you are after is the content of the meta keywords tag, then it should be something like:

 
Marshal
Posts: 15631
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would just use one of the many libraries out there to parse the html then find the meta element with a keywords attribute.
 
Sheriff
Posts: 7616
522
Mac OS X VI Editor BSD Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
http://www.oracle.com/technetwork/articles/java/json-1973242.html
Marcus, have a look at this. You could use JsonParser.
 
Junilu Lacar
Marshal
Posts: 15631
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Liutauras Vilda wrote:http://www.oracle.com/technetwork/articles/java/json-1973242.html
Marcus, have a look at this. You could use JsonParser.



He doesn't have JSON though, he's trying to parse HTML.
 
Liutauras Vilda
Sheriff
Posts: 7616
522
Mac OS X VI Editor BSD Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:

Liutauras Vilda wrote:http://www.oracle.com/technetwork/articles/java/json-1973242.html
Marcus, have a look at this. You could use JsonParser.



He doesn't have JSON though, he's trying to parse HTML.


Sorry, I hastened. I meant Jsoup. Thanks for correcting me.
Marcus, please ignore my previous post, Junilu is absolutely right.
I do apologise for misleading post.

Here is what I wanted to post for Marcus.
http://jsoup.org
 
Those cherries would go best on cherry cheesecake. Don't put those cherries on this tiny ad:
Devious Experiments for a Truly Passive Greenhouse!
https://www.kickstarter.com/projects/paulwheaton/greenhouse-1
    Bookmark Topic Watch Topic
  • New Topic