hey..! i was tryin 2 build a searc engine 4 which i need a crawler. the idea is to crawl from a starting point and recursively open subsequent pages. this is not gud enuf 2 crawl the entire internet. so i need 2 know if there is some way i can query the DNS so that i can switch from one domain 2 anoher.
Why do you need to query the DNS for switching between domains? The java.net.URL class (which I'm assuming you use to access remote hosts) handles either domain names or IP addresses, wo whichever is used on the pages you are crawling should work finw without additional work.
When building the crawler, be sure to observe the rules laid down by robots.txt and and any applicable meta tags about retrieving and indexing pages.
hey .. im usin java.net.URL class itself. it helps me estblsh a remote connection. but then.. say 4 instance im lukin 4 the word metallica and the most probabl result is www.metallica.com. if i don switch bw domains frm the starting point, there is no way i can come 2 that site in the crawling process. so theres no way i can reach that site frm my starting pt. s i need 2 switch bw domains. soif theres some way i can do this. plz lemme kno.. thx.
Joined: Mar 22, 2005
Not following at all - are you searching or crawling? You seem to combine them in a way I don't understand. Either way, there is no need to access DNS information, so what exactly do you mean by "switching between domains"? The URL class doesn't care which domain you access, or whether you chnage domains with every other call.
By the way, you should really use the same login for posting, even if you post from different IP addresses. It's quite confusing otherwise, and leaves questions as to who is actually posting here.
Joined: Mar 10, 2006
k.. my prob is.. i need to crawl the entire internet.. how do i do it..? i thot i cud start at some site say www.abc.com and recursively crawl the tree. as well as switch domains.
Joined: Mar 22, 2005
Several points in no particular order:
In a forum like this you should UseRealWords. Not everybody here speaks Englisch as their native language, and it's hard enough to follow discussions as it is. Abbreviations like "thot", "cud", "2" and "4" are appropriate in text messaging, but not in a forum like this.
"crawling the whole internet" is a very dubious proposition. My suggestions would be: don't do it. You're hogging bandwidth, putting unnecessary load on peoples servers, and of course, it's not going to work (the internet is kinda big these days).
If you need search, use Google. If you need a crawler, use one of the existing ones. Just be aware that crawlers aren't welcome everywhere (read my earlier remark about robots.txt and related HTTP headers).
The fact that you still think that accessing different domains is a problem indicates that you should do some research about TCP/IP and Java networking.
Originally posted by Ulf Dittmer: Abbreviations like "thot", "cud", "2" and "4" are appropriate in text messaging, but not in a forum like this.
Actually, I wouldn't consider that "appropriate" in text messaging either. Honestly, if you've got something to say that doesn't fit into a single text message spelling words out in full, you'd better make a call instead or simply ask the recipient to call you back.