• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Trying to extract data from multiple log files in python (kindly help if you know python)

 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I am trying to extract data from multiple log files stored in the same directory. I am passing a starting and an ending date as the parameters. The log files are named in ascending order and the format is :

I have been successful in extracting the data if the start and end date are present in the same file. However, how do i extract the data if the start and end date are in different files.

This is what i have tried so far :

 
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Raghav,

First a couple of comments:
  • I would not use the csv module here because your log files are not CSV files. You can just read the file line by line and extract the timestamp using line[0:20]
  • Using filter is a good idea because you are loading all the lines into memory. But usually it is better to use generation expressions instead of map/filter. For example (x for x in L if x > 10) instead of filter(lambda x: x > 10, L)).


  • To find all the log files I would use the glob module instead of os.listdir because it lets you use wildcard characters (useful if the directory contains more than just the log file). Then it is a matter of sorting the paths to ensure the logs are processed chronologically then read each file line by line, extract the timestamp and keep only the lines falling within the desired date/time range.

    Here is how I would do it:



    The lines_in_range function works but is not efficient because it will open each log file and read each line which could be an issue if you have lots of files and lots of lines overall. I have used a generator (yield) to prevent all the lines from being loaded into memory which is a good start but the run time could be slow. To make this function more efficient you would have to test if a log file is likely to contain lines within the range or not before reading all its lines. This could be done by testing that the timestamp of the first line and/or the last modification date/time of the file itself are within the range.

    Also I am including the start but excluding the end because 1) this is usually the convention in Python and 2) it will prevent picking the same line twice if you call the function multiple times with contiguous ranges (eg a-b then b-c).

    Hope this helps.

    Nic
     
    Don't get me started about those stupid light bulbs.
    reply
      Bookmark Topic Watch Topic
    • New Topic