File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Getting the URL by crawling whose content-type is not text/html Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Getting the URL by crawling whose content-type is not text/html" Watch "Getting the URL by crawling whose content-type is not text/html" New topic
Author

Getting the URL by crawling whose content-type is not text/html

Raihan Jamal
Ranch Hand

Joined: Mar 23, 2010
Posts: 86
I can get all those url's whose content/type is text/html, but If I want those urls whose content/type is not text/html. Then how can we check that. As for the string we can use method, but it doesn't have anything like .. Any suggestions will be appreciated.. And also

The key variable contains:



This is the below code to check for text/html and I tried also for content-type that are not text/html but it also prints out those whose content-type are also text/html.




One method is to check individually for each content-type like for pdf it is application/pdf



and in the same way for xml... But any other method other than this...
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Raihan Jamal wrote: but it doesn't have anything like ..

well then use !contains() then
Raihan Jamal
Ranch Hand

Joined: Mar 23, 2010
Posts: 86
@John Jai, I tried doing that way.. You mean I should do it like this:



right??

But the above one prints out every url.. And I want all those url's whose content-type are not text/html
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

So only do that for the header which is the Content-Type header, then.
Raihan Jamal
Ranch Hand

Joined: Mar 23, 2010
Posts: 86
@Paul Clapham, Problem is that I am not sure how many content-types are there, So in my case I need two things, One is all those urls whose content-type are text/html or text/xhtml and Second all those url's whose content-type are other than text/html or text/xhtml. So one way is to print out each urls and see the content-type and then add in if else loop for that content-type. But in future if somebody add any other pages of any other content-type then it is possible that I can miss that content-type.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Raihan Jamal wrote:@Paul Clapham, Problem is that I am not sure how many content-types are there.


Each URL has only one Content-Type header. And you get the headers as a Map, so there is no need to scan through the values (and especially there is no need to do that while ignoring the names of the headers). Just use "Content-Type" as the key to that map and you'll get the content type directly.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul Clapham wrote:Each URL has only one Content-Type header

Yes paul - i tried printing and each one had only one content - type


Output
Parsing URL:- http://www.google.com with Content Type -> text/html; charset=ISO-8859-1
Response Headers size -> 9
Ignoring this key -> null=[HTTP/1.1 200 OK] with the content type -> text/html; charset=ISO-8859-1
Ignoring this key -> Date=[Mon, 11 Jul 2011 20:48:18 GMT] with the content type -> text/html; charset=ISO-8859-1
Ignoring this key -> Transfer-Encoding=[chunked] with the content type -> text/html; charset=ISO-8859-1
Ignoring this key -> Expires=[-1] with the content type -> text/html; charset=ISO-8859-1
Ignoring this key -> X-XSS-Protection=[1; mode=block] with the content type -> text/html; charset=ISO-8859-1
Ignoring this key -> Set-Cookie=[NID=48=WZhw08BQCg5h6jG63nibf5OJOba7oyVX763gZhjk7UHyGYjNMBOvLlNPl8Ov9FrcEjJJmaALYULZJmWhevIyqc4fhuB6fzuNKEqeSSpcKBvfX5wtZplSBWjQVN4KSLhn; expires=Tue, 10-Jan-2012 20:48:18 GMT; path=/; domain=.google.co.in; HttpOnly, PREF=ID=ce54dac9716f500a:FF=0:TM=1310417298:LM=1310417298:S=95wD8CclVxFdztxR; expires=Wed, 10-Jul-2013 20:48:18 GMT; path=/; domain=.google.co.in] with the content type -> text/html; charset=ISO-8859-1
Printing in if -> Content-Type=[text/html; charset=ISO-8859-1]
Ignoring this key -> Server=[gws] with the content type -> text/html; charset=ISO-8859-1
Ignoring this key -> Cache-Control=[private, max-age=0] with the content type -> text/html; charset=ISO-8859-1


Parsing URL:- http://java.sun.com/index.html with Content Type -> text/html
Response Headers size -> 6
Ignoring this key -> null=[HTTP/1.1 200 OK] with the content type -> text/html
Ignoring this key -> Date=[Mon, 11 Jul 2011 20:48:22 GMT] with the content type -> text/html
Ignoring this key -> Transfer-Encoding=[chunked] with the content type -> text/html
Printing in if -> Content-Type=[text/html]
Ignoring this key -> Connection=[Transfer-Encoding, keep-alive] with the content type -> text/html
Ignoring this key -> Server=[Oracle-Application-Server-11g Oracle-Web-Cache-11g/11.1.1.2.0 (TH;max-age=300+0;age=145;ecid=184747478908160597,0)] with the content type -> text/html


Parsing URL:- http://www.coderanch.com/forums with Content Type -> text/html;charset=UTF-8
Response Headers size -> 9
Ignoring this key -> null=[HTTP/1.1 200 OK] with the content type -> text/html;charset=UTF-8
Ignoring this key -> Date=[Mon, 11 Jul 2011 20:48:25 GMT] with the content type -> text/html;charset=UTF-8
Ignoring this key -> Vary=[Accept-Encoding] with the content type -> text/html;charset=UTF-8
Ignoring this key -> Content-Length=[12055] with the content type -> text/html;charset=UTF-8
Ignoring this key -> Keep-Alive=[timeout=60, max=100] with the content type -> text/html;charset=UTF-8
Ignoring this key -> Set-Cookie=[JSESSIONID=B236BD20FC74B2D4378E9CC35D025A9F; Path=/] with the content type -> text/html;charset=UTF-8
Printing in if -> Content-Type=[text/html;charset=UTF-8]
Ignoring this key -> Connection=[Keep-Alive] with the content type -> text/html;charset=UTF-8
Ignoring this key -> Server=[Apache-Coyote/1.1] with the content type -> text/html;charset=UTF-8


Parsing URL:- http://www.youtube.com with Content Type -> text/html; charset=utf-8
Response Headers size -> 10
Ignoring this key -> null=[HTTP/1.1 200 OK] with the content type -> text/html; charset=utf-8
Ignoring this key -> X-Frame-Options=[SAMEORIGIN] with the content type -> text/html; charset=utf-8
Ignoring this key -> Date=[Mon, 11 Jul 2011 20:48:27 GMT] with the content type -> text/html; charset=utf-8
Ignoring this key -> Transfer-Encoding=[chunked] with the content type -> text/html; charset=utf-8
Ignoring this key -> Expires=[Tue, 27 Apr 1971 19:44:06 EST] with the content type -> text/html; charset=utf-8
Ignoring this key -> Set-Cookie=[GEO=9c8ba04d05dd1342cd45ad5ba46aa636cwsAAAAzSU4OYzMzThthmw==; path=/; domain=.youtube.com, VISITOR_INFO1_LIVE=-AqgGmOp620; path=/; domain=.youtube.com; expires=Wed, 07-Mar-2012 20:48:27 GMT, use_hitbox=72c46ff6cbcdb7c5585c36411b6b334edAEAAAAw; path=/; domain=.youtube.com] with the content type -> text/html; charset=utf-8
Printing in if -> Content-Type=[text/html; charset=utf-8]
Ignoring this key -> Server=[Apache] with the content type -> text/html; charset=utf-8
Ignoring this key -> X-Content-Type-Options=[nosniff] with the content type -> text/html; charset=utf-8
Ignoring this key -> Cache-Control=[no-cache] with the content type -> text/html; charset=utf-8

Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

John Jai wrote:
Paul Clapham wrote:Each URL has only one Content-Type header

Yes paul - i tried printing and each one had only one content - type


Of course it did. That's how the HTTP specification says it is supposed to work. Anyway what would it mean for an HTTP response to have more than one content type? That wouldn't make any sense.
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4646
    
    5

Raihan Jamal wrote: Problem is that I am not sure how many content-types are there


You can never list all of the content types. They are added all the time, and you can't ever know the complete list.

All you can do is have the list that you support, and a graceful strategy for the ones you have no idea how to implement.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Pat Farrell wrote:
Raihan Jamal wrote: Problem is that I am not sure how many content-types are there


You can never list all of the content types. They are added all the time, and you can't ever know the complete list.


Ah yes... that could have been the meaning of that statement as well.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Getting the URL by crawling whose content-type is not text/html
 
Similar Threads
Accessing files that are there in classes directory
how to write the hashtable to a file
Object to byte[]
pdf byte to pdf file converstion
please help with text formating in java