Win a copy of Mesos in Action this week in the Cloud/Virtualizaton forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Parsing a log file containing spaces and strings into separate tokens/fields

 
Abhishek Joshi
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey guys

I am trying to parse a log file, with each line containing for example :

2003-11-17 23:41:05 61.114.210.116 GET /origin.myfavoritecompany.com/carts/greenbutton/5door/img/side_btn12_off.gif 304 189 3 "http://www.greenbutton.myfavoritecompany.com/5door/packaging.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90)" "aid=YSy8ICUyUNf5n; sid=FyXXY04WaHwl"

Any pointers on how do I parse the above? I was thinking of a regex in java, but wasn't totally sure how that would pan out. Any help would be really appreciated.

Here's what my sample program for the above does :

public static void main(String[] args) {

Pattern pattern = Pattern.compile("^\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} [12]?[0-9]?[0-9](\\.[12]?[0-9]?[0-9]){3} [A-Z_]+ [^ ]+ \\d+ \\d+ \\d+ [^ ]+ [^ ]+ [^ ]+$", Pattern.CASE_INSENSITIVE);

String[] words = pattern.split("2003-11-17 23:41:05 61.114.210.116 GET /origin.myfavoritecompany.com/carts/greenbutton/5door/img/side_btn12_off.gif 304 189 3 \"http://www.greenbutton.myfavoritecompany.com/5door/packaging.html\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90)\" \"aid=YSygOv8ICUyUNf5n; sid=FyXXY0ttVk4WaHwl\"\\");

System.out.print(words.length);
}

Thanks.
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There are libraries for parsing log files and I'm sure somebody will post references but if you are going to use regular expressions then please make them readable and maintainable. This is very unlikely to be exactly what you need but something along the lines of


It is my experience that lines in a log files do not have a single fixed format so you may have to create a pattern for each variant you wish to process.
 
Abhishek Joshi
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Richard, I highly appreciate you taking the time to guide me.
 
Abhishek Joshi
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Whilst we are at this, let me ask another question :

If the above input was part of a file, would that change the way we process it via a regex? In other words, would it have to be pre-processed to eg. escape quotes etc? Or just reading from a file that has multiple such log lines and using the above regex suffices?
 
Abhishek Joshi
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As an update, I modified the code to input a log file, and read it a line at a time and parse it. Seems to work as expected.
 
Abhishek Joshi
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Richard,

I am stuck again on a similar regex. I am posting my code: any pointers will be appreciated!

 
Abhishek Joshi
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Believe this should do it:

 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Since the hyphen regex does not capture the hyphen it does not have a group yet your comments indicates that you think it does! Don't try to create the whole regular expression and then think about testing it. Start by looking for the IP address with the rest of the line optional (i.e. using (.*)) and then when that it working add the hyphen detection and then when that is working move on to the next part and so on.

It does seem that you need to spend more time with the regex tutorial.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic