Can web crawler be able to extract html source of web url that only run on cookie enable brower?
Lester Burnham
Rancher
Joined: Oct 14, 2008
Posts: 1337
posted
0
Some can, some can't, so it depends on which one in particular you're asking about. If you haven't settled on a specific one, start here: http://java-source.net/open-source/crawlers
Sunil Baboo
Greenhorn
Joined: Aug 12, 2010
Posts: 15
posted
0
Hi,
Thanks for your quick response. Well, its java spider. I know this is working great for other web link but I am not sure it can work for web link that required cookie enable browser.
Thank you
Lester Burnham
Rancher
Joined: Oct 14, 2008
Posts: 1337
posted
0
What is "java spider" - some software you wrote? Downloaded? Bought?
Sunil's referring to JSpider in the link you provided.
Cheers,
Naren (SCJP, SCDJWS and SCWCD)
Sunil Baboo
Greenhorn
Joined: Aug 12, 2010
Posts: 15
posted
0
Actually, my concern is:
I had application that crawl on web link. If login required on web link, it login as well. But there are some sites which required cookie enabled browser. I attempt to login on this site, i always return back with login page. In my application, crawler first request login page, retrive cookie information from this header and request another page after login attaching those server send cookie on it. It has been working for most of the site but not site which ask cookie enable browser. Am i missing something? so that my application crawl on those page after login.
The sites that need cookies enabled, store sessions in cookies (usually) and do not support URL based session IDs or URL encoding.
So if your cookie functionality isn't on, you dont get a session and hence the login page.
Hope that helps
Hi Sunil,
Ideally web applications should take care of handling sessions using URL Rewriting if cookies are not enabled. But, it seems in your case the sites you are accessing are not doing that. You can get through the sites you are unable to login by enabling cookies in your browser.
Sunil Baboo
Greenhorn
Joined: Aug 12, 2010
Posts: 15
posted
0
Hello Amit,
yes you are right, the sites that need cookie enabled store session in cookie. so crawler fails to crawl on site that demand cookie enable browser. Am I right?
I know that crawler has no cookie enabled. Can we enable
cookie on crawler by code. Something like web broser embedded crawler?
I agree. Here's the link: http://ej-technologies/jprofiler - if it wasn't for jprofiler, we would need to
run our stuff on 16 servers instead of 3.