mj zammit

Ranch Hand
+ Follow
since Nov 16, 2008
Cows and Likes
Cows
Total received
0
In last 30 days
0
Total given
0
Likes
Total received
0
Received in last 30 days
0
Total given
0
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by mj zammit

I have a variable "text"
This variable contains the html page of a Web page.
I want to parse this html page to find the href values of an tag.

My algorithm is as so:
1. get html of a web page
2. place the html content of the Web page in the variable "text"
3. parse the html content to find the
tag. This is done using the regular expression "(?id)<a\\s+(.*?)/?>"
4. For each
tag found parse through it again to find the attribute value of href. This is done using the regular expression "HREF\\s*=\\s*(\"([^\"]+)\"|\'([^']+)\'|[^'\"])"
5. get href's attribute value. This is done in the following line - res_url = m2.group(1);
6. change the variable res_url. Done using the method check_res_url()
7. the href's attribute value is then replaced with the variable new_res_url. This done using m2.replaceAll(new_res_url);

This replacement must be shown in the variable "text" (the html of the Web page)



When i then check the html page after this method was done no replacements where made where there should have been. Example instead of having
I wanted to see in the html page.

The regular expressions that i am using help me (for now, i still have to go over them) But the replaceAll() is not doing what i was expecting it to do. I am not sure if it has got something to do with the algorithm of my program.

12 years ago
I am managing to find the tags and the attribute with the regex expressions i have (for the time being of course)
But what i am not understanding is why the found attribute value is not being replaced by an optimized string in the html page itself (ie: "text")
12 years ago
Its true my regular expressions are not perfect.
Like you pointed out why bother placing (?i) when i have Pattern.CASE_INSENSITIVE already placed.
I am new to regular expressions and those in the code provided are trial and error as yet.
Even though they are not perfect i am using those so I can build a working program as yet and i will later refine them.

For the case of Pattern.MULTILINE i am still uncertain of what it provides. If i understood correctly if the string i am searching can be found on 2 lines by using Pattern.MULTILINE i am also taking that into consideration.

For the question why i am escaping single quotes - since HTML is not well formed i am assuming that the attribute values can either be found between " " (double-quotes) or ' ' (single quotes) or have no quotes what so ever.
Example href = "value" or href = 'value' or href = value

But my question doesnt refer to regular expressions. My question refers to how to replace a found regular expression with a new string. I know i must use replaceAll() method but i dont think the logic in the code sample provided is correct.

What i am trying to do is parse an HTML page
example:

<center>
<div align="center">

</div>
</center>

<table border="0">



I use the regular expression

(?id)<a\\s+(.*?)/?>

to find a tag
In this case i am looking for the tag

m4.group()

will find me the tag, which from the following example would be


I then parse this string to find the href attribute value using the second regular expression

HREF\\s*=\\s*(\"([^\"]+)\"|\'([^']+)\'|[^'\"])


m2.group(1)

would give me the attribute value.

This is the string i would like to change. Example i want to change it to "www.hello.com"
But when i do

String text1 = m2.replaceAll(new_res_url);

the html page which is contained in the variable "text" is not autmatically changed and i dont seen to know how to do that.

Any suggestions on this matter will be greatly appreciated.
12 years ago
Please help!!!

I am parsing an html using regular expressions and replacing the search value with a new value.

The code is:


My problem is that when i check the text (the text i am referring to here is the html page of the web page) after the replacement was done nothing was changed. (the regular expressions did find what i needed, so they did work for some sites and the method check_res_url() does work)
Any suggestions would be greatly appreciated.
12 years ago
solved it...had to do a while loop so with every match i found i show it
12 years ago
Hi

I have the following code, where i parse an html text and use a regex to get all the src attribute values.
The regex works fine on a subset of sites.
My problem is reading the values the regex found...it always gives me an error.
The error is the following

Exception in thread "main" java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:468)
at java.util.regex.Matcher.group(Matcher.java:428)
at httpclientregex.Main.main(Main.java:52)
Java Result: 1



And the code is as following


Am i missing something??

Any suggestions would be greatly appreciated
12 years ago
Hi

I need to change all the relative urls found on a webpage to absolute urls.

Now i know that if the webpage's name is "http://www.hello.com/index.html" then for every relative url found on the webpage, example "tree.html", i must append to it the hostname of the webpage, that is, "http://www.hello.com/tree.html"

But what if that is not the hostname it needs?
For example what if "http://www.hello.com" is a page that is part of the main page called "http://www.greetings.com" and therefore every relative url found on the "http://www.hello.com" page require the hostname of the "http://www.greetings.com" page.
How can I accomodate this?

I am using HttpClient and i am not sure if it does this automatically.

Any comments will be greatly appreciated

MJ
Thanks for all your replies
Chris will your suggestion help with redirections?
I am getting the html web page so i can do some transformations to it.
I am using this architecture mainly because of time constraints and did not have the necessary resources to learn about transparent proxy.
Also what i need at the moment is a working prototype...
Yes i am invoking my application with a Web browser.
I see...
So i must correct the href values on the html before i send it to the Web browser?
It should be noted that my java application is a web server.
It accepts the clients requests, parses them, and sends the url (ex:www.naturenet.com) to a proxy class that retrieves the html using HTTPClient.
What i need to understand is how it is calling for the css files automatically, if all i ask is for it to get the Web page.
okay.
When i run my application I noticed it is automatically calling for these css files whilst i call for a specific Web page. Now how can i intercept this action for me to change the relative values to absolute values before it tries to retrieve them?
Where can i look to solve this problem?
Hey
The java application i am building is in a sense replicating the way a browser works.
I am using HTTPClient to help me do so.
When i am getting the html contents of the Web page, www.naturenet.com, at the same time it tries to retrieve the css files of this web page. Which fails since the css files will have a relative value and not an absolute. I noticed that the css files are located in the Web page's html <link> tag.
Does this mean that when retrieving the html page of this site it will also scan through the html retrieved and try to get all the href values from the link tag?
Also, if this is being done, does that mean that as soon as i get the html i must change all these relative href values to absolute for HTTPClient to be able to retrieve them?

Any comments will be greatly appreciated, since they have always helped me move in the right direction
Hi
I am downloading HTTPClient to use with Netbeans 6.1
I dont know how to do the following

Once you've downloaded HttpClient and dependencies you will need to put them on your classpath.


This was found at the url HTTPClient download

Please help...