aspose file tools*
The moose likes Clojure and the fly likes Extracting and Comparing Data Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Languages » Clojure
Bookmark "Extracting and Comparing Data" Watch "Extracting and Comparing Data" New topic
Author

Extracting and Comparing Data

Dan King
Ranch Hand

Joined: Mar 18, 2009
Posts: 84
I hacked together the following code to extract and compare data from two files, I'd appreciate feedback/insight on the good and bad aspects of the code, and if/how the code can be improved. Below the code I've included sample data. Thanks.




Employee Id Name Time In Time Out Dept.
mce0518 Jon 2011-01-01 06:00 2011-01-01 14:00 ER
mce0518 Jon 2011-01-02 06:00 2011-01-01 14:00 ER
mce0518 Jon 2011-01-04 06:00 2011-01-01 13:00 ICU
mce0518 Jon 2011-01-05 06:00 2011-01-01 13:00 ICU
mce0518 Jon 2011-01-05 17:00 2011-01-01 23:00 ER



Employee Id Name Time In Time Out Dept.
pdm1705 Jane 2011-01-01 06:00 2011-01-01 14:00 ER
pdm1705 Jane 2011-01-02 06:00 2011-01-01 14:00 ER
pdm1705 Jane 2011-01-05 06:00 2011-01-01 13:00 ER
pdm1705 Jane 2011-01-05 17:00 2011-01-01 23:00 ER
Sean Corfield
Ranch Hand

Joined: Feb 09, 2011
Posts: 261
    
    5

A couple of things spring to mind:

1. Since you want to drop the headings in the files, do that first before splitting the (remaining) lines into tokens
2. You don't need doall in strip-data - doall is needed inside with-open to realize the sequence but not after that

I changed the code up a little, to use a main function with any number of arguments and then modified calls to accept any number of timesheets.
Some notes: I use Leiningen so I created a new project called dan (lein new dan) and inside that, I placed joe.txt and jane.txt - your sample files - and edited src/dan/core.clj to look like this:

Now I can run the program with: lein run -m dan.core jon.txt jane.txt

If I have more files, I can add those too and the program with find the intersection of all matching timesheet entries.

Instead of (comp strip-data coll-data), you might prefer #(strip-data (coll-data %)).

I left #(drop 2 %) alone but was tempted to use (partial drop 2) instead - or you could just use nnext which is equivalent to #(next (next %)) although next returns nil if there are no more elements where drop (and rest) return an empty sequence. It doesn't matter in this case so it's just stylistic preference. (comp rest rest) would be another possibility.
Dan King
Ranch Hand

Joined: Mar 18, 2009
Posts: 84
Sean Corfield wrote:
2. You don't need doall in strip-data - doall is needed inside with-open to realize the sequence but not after that


Could you elaborate on why "doall" isn't needed in "strip-data"? As I understand it, "doall" forces lazy-sequences to be fully realized, and since "map" produces a lazy-sequence, shouldn't it be needed?

Also, I've modified the code to match time intervals (via java interop) rather than straight string matches. Unfortunately, however, I'm encountering a null pointer exception when I attempt to determine interval overlaps. I've posted the code below, do you see where or what is causing the problem?

Note: Everything up to "prep-data" works as expected - i.e. a collection of collections is made, where each sub-collection has two parts 1. A Joda-Time Interval 2. A String representing a department.

Also the input data structure has changed; see below the code. Lastly, I too am now using leiningen, so I had to add [joda-time "1.6.2"] to dependencies in project.clj.




Employee Id Name Time In Time Out Dept.
pdm1705 Jane 01/01/2011 06:00 AM 01/01/2011 02:00 PM ER
pdm1705 Jane 01/02/2011 06:00 AM 01/02/2011 02:00 PM ER
pdm1705 Jane 01/04/2011 06:00 AM 01/04/2011 01:00 PM ER
pdm1705 Jane 01/05/2011 05:00 AM 01/05/2011 11:00 PM ER


Employee Id Name Time In Time Out Dept.
mce0518 Jon 01/01/2011 06:00 AM 01/01/2011 02:00 PM ER
mce0518 Jon 01/02/2011 06:00 AM 01/02/2011 02:00 PM ER
mce0518 Jon 01/04/2011 06:00 AM 01/04/2011 01:00 PM ICU
mce0518 Jon 01/05/2011 06:00 AM 01/05/2011 01:00 PM ICU
mce0518 Jon 01/05/2011 05:00 PM 01/05/2011 11:00 PM ER



Sean Corfield
Ranch Hand

Joined: Feb 09, 2011
Posts: 261
    
    5

Dan King wrote: Could you elaborate on why "doall" isn't needed in "strip-data"? As I understand it, "doall" forces lazy-sequences to be fully realized, and since "map" produces a lazy-sequence, shouldn't it be needed?

I'll answer this question first, and then I'll take a look at your code later.

In general, you want to be lazy unless you need to force a result. doall is needed inside with-open because you need that sequence fully realized before the file is closed (without it, processing the line sequence would happen "later" and the reader would already have been closed).

You don't need to force realization in strip-data because you later process the entire sequence (in order to do the comparison / intersection etc).

Printing something will force realization and thus cause processing of any necessary data, lazy or not. The great thing about being lazy wherever you can is that any computation that isn't actually needed will be avoided. It doesn't make much difference in your example but if you were processing large amounts of data and could figure out the result without processing all of the data, you could avoid unnecessary work.

Hope that helps?
Sean Corfield
Ranch Hand

Joined: Feb 09, 2011
Posts: 261
    
    5

Dan, I think something's missing from your code - when I run it as-is, I get: Unable to resolve symbol: fmt in this context (core.clj:30) and indeed I don't see a function called fmt defined in your overlap? function. I replaced it with .toString and now I get the NPE in overlap? (as "expected").

The problem is that .overlap returns NULL if there's no overlap and you're comparing it against false - interval is already appropriately truthy or falsey so you can just use it as-is:

At this point, you probably want to filter out the nil entries:

BTW, Just as a stylistic point, (nth coll 0) feels really odd to me so I'd use (first coll) - and (second coll) instead of (nth coll 1).
Dan King
Ranch Hand

Joined: Mar 18, 2009
Posts: 84
Sean Corfield wrote:Hope that helps?

Thanks for the explanation; it did help. I was had misimpression that in order to use 'map' I had to use 'doall' to realize the resulting sequence.

Furthermore, I found the cause of the NPE. Within an 'if' statement I check for false rather than nil values. After correcting this issue the code almost functions as desired; unfortunately, there is one remaining issue:

The collection returned from "cmp-time-dept" contains nil values, when it should not (see line 32 in posted code). How can I eliminate the nil values?

NOTE: After posting the above, I saw that you had already posted a message. I read your post that I failed to include the 'fmt' function in the second (version) posting of my code; I'm sorry for the confusion/inconvenience. But impressively, you found a work-around AND read my mind about removing the nil values. Thanks a bunch.

I've corrected the earlier (second version) code posting and I've included a final version below.

I agree that 'first' and 'second' function are better than 'nth coll n'; I've changed my code to use 'first' or 'second' but there is one instance were I used 'nth' to preserve stylistic continuity, since there is no 'third' function.


Sean Corfield
Ranch Hand

Joined: Feb 09, 2011
Posts: 261
    
    5

Looks like our messages passed in the ether since I already answered that question (and addressed the false / nil issue - idiomatic Clojure tends not to specifically test for false or nil unless both values are possible in an expression and you need to tell them apart: everything is truthy except false and nil - they're falsey ).
Dan King wrote:I've changed my code to use 'first' or 'second' but there is one instance were I used 'nth' to preserve stylistic continuity, since there is no 'third' function.


Dan King
Ranch Hand

Joined: Mar 18, 2009
Posts: 84
Sean,

Out of curiosity and for future reference, what tool(s) do you use for debugging Clojure?

Also, if you don't mind I'd like to ask you a non-clojure web application development question. Would you prefer I ask my question here on javaranch, another forum or by email? Thanks.
Sean Corfield
Ranch Hand

Joined: Feb 09, 2011
Posts: 261
    
    5

Dan King wrote:Out of curiosity and for future reference, what tool(s) do you use for debugging Clojure?

I try to follow a TDD approach, as well as exercising code in small pieces in the REPL, and that helps me avoid "debugging" for the most part but not entirely, at which point I tend to add println calls into my code - ugly, but effective. Source level debugging is still fairly immature in Clojure (but work is being done in that area).
 
Don't get me started about those stupid light bulbs.
 
subject: Extracting and Comparing Data