• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

How to check if a web page was changed

 
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
... with the least amount of network and disk space/memory use.
I'm trying to make a small application that
1-check from a list of URL if any of them was changed since the last check
2-for all that were changed, fetch a particular part of that page and do something with it.

I would like to know if there is anyway to check if a web page was changed, except storing a copy of the previous page and compare(if that is the only way, guess I'll have to go with it).

Thanks a lot for the help in advance.

Forgot to mention, language of implementation is not a problem, i'm (usually) a fast learner and don't mind to learn a new language.
 
Marshal
Posts: 28175
95
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You can send an "If-Modified-Since" header as part of the request you send for the page. If it hasn't been changed since that date, you'll get a 304 response code. Wikipedia article: List of HTTP header fields.
 
John Gerso
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So if i understood completely i need to send a request to the page with a header of the type "If-Modified-Since" with the last date i checked and then read the response. I'll check if it works (and if i know how to do it).
Thanks a lot for the fast response.
 
Paul Clapham
Marshal
Posts: 28175
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yup, that's right. That's what browsers do, in fact. When they read a page they cache it locally along with the time they read it. Then the next time they get a request for the page, they send that header, and if they get the 304 response, they display the cached version instead of downloading the page again.
 
Rancher
Posts: 4803
7
Mac OS X VI Editor Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Paul's answer is correct technically. That is, if the webserver responds as the RFC state, then you are set.

There is also a larger more philosophical question: how do you tell if the web page has changed if the webserver lies?

The normal way to answer this is to not trust the webserver or its dates, but to get the contents and pass the bytes through a cryptographically strong hash function such as SHA256. Compare the two hash values, and if they are different, the page is different.
 
John Gerso
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Pat Farrell wrote:Paul's answer is correct technically. That is, if the webserver responds as the RFC state, then you are set.

There is also a larger more philosophical question: how do you tell if the web page has changed if the webserver lies?

The normal way to answer this is to not trust the webserver or its dates, but to get the contents and pass the bytes through a cryptographically strong hash function such as SHA256. Compare the two hash values, and if they are different, the page is different.



I did run into this problem. The kind of web pages I'm checking are all created dynamically which means that the creation time is always the same as the current time.
Shall i be using java.security to encrypt and decrypt the data?

And just a small question: Why encrypt the data? Couldn't i just check line by line?

Thanks a lot for the help
 
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Sha cannot be decrypted.

From what I can tell, the reason that Sha256 was suggested was because the source code of the website will produce a hashcode of a set length. Any change to the website, even a very small one will produce a completely different hash code. This will make them easily comparable.

 
Pat Farrell
Rancher
Posts: 4803
7
Mac OS X VI Editor Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

John Gerso wrote:Shall i be using java.security to encrypt and decrypt the data?



I did not suggest encrypting the data. I said run it through the SHA256 hash function, get the resulting hash value and use that to compare.
 
John Gerso
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Got it now (kinda embarrassed for misreading that...).
Thanks a lot for the help, I'll try to check which class allows me to perform the operation needed.
 
Paul Clapham
Marshal
Posts: 28175
95
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

John Gerso wrote:I did run into this problem. The kind of web pages I'm checking are all created dynamically which means that the creation time is always the same as the current time.



Ah. I wouldn't have called those "web pages" then. I assumed because you called them that, they were static pages.
 
John Gerso
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Project works almost perfectly. Still not pleased with the amount of time it takes to process 1 page and retrieve the needed image but i'll get there.
Thanks everyone for the help.

Paul Clapham wrote:

John Gerso wrote:I did run into this problem. The kind of web pages I'm checking are all created dynamically which means that the creation time is always the same as the current time.



Ah. I wouldn't have called those "web pages" then. I assumed because you called them that, they were static pages.



I'm a complete newbie in web stuff. May you elucidate what can i call dynamic pages in the future so the same problem won't happen again?
 
Paul Clapham
Marshal
Posts: 28175
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
That's a good question actually. As a designer of dynamically generated web content, I think of static pages as "web pages" but I don't have a word I attach to dynamically generated content. That just seems different to me, especially when I'm writing AJAX code which refreshes only part of the screen. However the person at the browser can't tell the difference between the two types of content and doesn't care anyway.

So I suppose my confusion really just came from my biases in how I think about things and not from any bad choice of terminology on your part.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic