Hi Raghav,
First a couple of comments:
I would not use the csv module here because your log files are not CSV files. You can just read the file line by line and extract the timestamp using line[0:20]Using filter is a good idea because you are loading all the lines into memory. But usually it is better to use generation expressions instead of map/filter. For example (x for x in L if x > 10) instead of filter(lambda x: x > 10, L)).
To find all the log files I would use the
glob module instead of
os.listdir because it lets you use wildcard characters (useful if the directory contains more than just the log file). Then it is a matter of sorting the paths to ensure the logs are processed chronologically then read each file line by line, extract the timestamp and keep only the lines falling within the desired date/time range.
Here is how I would do it:
The
lines_in_range function works but is not efficient because it will open each log file and read each line which could be an issue if you have lots of files and lots of lines overall. I have used a generator (yield) to prevent all the lines from being loaded into memory which is a good start but the run time could be slow. To make this function more efficient you would have to
test if a log file is likely to contain lines within the range or not before reading all its lines. This could be done by testing that the timestamp of the first line and/or the last modification date/time of the file itself are within the range.
Also I am including the start but excluding the end because 1) this is usually the convention in Python and 2) it will prevent picking the same line twice if you call the function multiple times with contiguous ranges (eg a-b then b-c).
Hope this helps.
Nic