File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Parsing a log file containing spaces and strings into separate tokens/fields Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Parsing a log file containing spaces and strings into separate tokens/fields" Watch "Parsing a log file containing spaces and strings into separate tokens/fields" New topic
Author

Parsing a log file containing spaces and strings into separate tokens/fields

Abhishek Joshi
Greenhorn

Joined: Sep 30, 2010
Posts: 11
Hey guys

I am trying to parse a log file, with each line containing for example :

2003-11-17 23:41:05 61.114.210.116 GET /origin.myfavoritecompany.com/carts/greenbutton/5door/img/side_btn12_off.gif 304 189 3 "http://www.greenbutton.myfavoritecompany.com/5door/packaging.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90)" "aid=YSy8ICUyUNf5n; sid=FyXXY04WaHwl"

Any pointers on how do I parse the above? I was thinking of a regex in java, but wasn't totally sure how that would pan out. Any help would be really appreciated.

Here's what my sample program for the above does :

public static void main(String[] args) {

Pattern pattern = Pattern.compile("^\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} [12]?[0-9]?[0-9](\\.[12]?[0-9]?[0-9]){3} [A-Z_]+ [^ ]+ \\d+ \\d+ \\d+ [^ ]+ [^ ]+ [^ ]+$", Pattern.CASE_INSENSITIVE);

String[] words = pattern.split("2003-11-17 23:41:05 61.114.210.116 GET /origin.myfavoritecompany.com/carts/greenbutton/5door/img/side_btn12_off.gif 304 189 3 \"http://www.greenbutton.myfavoritecompany.com/5door/packaging.html\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90)\" \"aid=YSygOv8ICUyUNf5n; sid=FyXXY0ttVk4WaHwl\"\\");

System.out.print(words.length);
}

Thanks.
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1035
    
  10

There are libraries for parsing log files and I'm sure somebody will post references but if you are going to use regular expressions then please make them readable and maintainable. This is very unlikely to be exactly what you need but something along the lines of


It is my experience that lines in a log files do not have a single fixed format so you may have to create a pattern for each variant you wish to process.
Abhishek Joshi
Greenhorn

Joined: Sep 30, 2010
Posts: 11
Richard, I highly appreciate you taking the time to guide me.
Abhishek Joshi
Greenhorn

Joined: Sep 30, 2010
Posts: 11
Whilst we are at this, let me ask another question :

If the above input was part of a file, would that change the way we process it via a regex? In other words, would it have to be pre-processed to eg. escape quotes etc? Or just reading from a file that has multiple such log lines and using the above regex suffices?
Abhishek Joshi
Greenhorn

Joined: Sep 30, 2010
Posts: 11
As an update, I modified the code to input a log file, and read it a line at a time and parse it. Seems to work as expected.
Abhishek Joshi
Greenhorn

Joined: Sep 30, 2010
Posts: 11
Hi Richard,

I am stuck again on a similar regex. I am posting my code: any pointers will be appreciated!

Abhishek Joshi
Greenhorn

Joined: Sep 30, 2010
Posts: 11
Believe this should do it:

Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1035
    
  10

Since the hyphen regex does not capture the hyphen it does not have a group yet your comments indicates that you think it does! Don't try to create the whole regular expression and then think about testing it. Start by looking for the IP address with the rest of the line optional (i.e. using (.*)) and then when that it working add the hyphen detection and then when that is working move on to the next part and so on.

It does seem that you need to spend more time with the regex tutorial.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Parsing a log file containing spaces and strings into separate tokens/fields
 
Similar Threads
Grinder
I keep getting on Javascript Error on the bottom left of IE
parsing logfile
EL doubt
firefox vs. ie