This week's book giveaway is in the Mac OS forum. We're giving away four copies of a choice of "Take Control of Upgrading to Yosemite" or "Take Control of Automating Your Mac" and have Joe Kissell on-line! See this thread for details.
If something is accessible without restriction, then it's liable to get indexed. How it was generated does not matter. Drawback of a PDF is that you can't add a NOINDEX meta tag. You can set up a robots.txt file for your site, though.
Note that both these approaches rely on the spider cooperating. Google does so, but other spiders may not. If you want to be sure that your information is safe, don't make it publicly available.
Joined: Mar 28, 2007
Thanks for your reply Tim. I will password protect the PDF's now.
One another thing you can do besides setting up a robot.txt is checking the user-agent header to see where the request is coming from. However it should be noted that there are some stupid spiders presenting themselves as Firefox or IE. If you need to prevent all spiders grabbing your sensitive data, you should never expose them in a publicly-accessible page.
That can be spoofed, just like the Referer header, so it can't be relied upon for anything that matters (like security).
Joined: Mar 28, 2007
Its better to password protect pdf's then...In my case, there are two types of Pdf's, one which are publicly available and searchable in google and other ones are private (non indexable). So, i created two servlets one which serves public pdf's and other one for private pdf's....private pdf servlet is now SSO password protected so google wont be able to crawl it.
I am going to test this tomorrow morning on live server but it is currently working fine on my test system
Just following on from what everyone else has said, but if something is accessible without restriction then Google will find and index this. I work in SEO professionally and can honestly say that if you don't want Google to see something then it must be password protected. Personally I prefer to put all sensitive information in a /secure/ directory which required authentication prior to accessing everything, opposed to placing a password on the document as has been described above.
All of the information spoke about above (robots.txt, noindex etc) are all guidelines which Google may choose to ignore, and often does.
subject: Hide Servlet Response from appearing in Google Search