aspose file tools*
The moose likes Beginning Java and the fly likes Program only reads in 11338 of 44674 lines Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Program only reads in 11338 of 44674 lines" Watch "Program only reads in 11338 of 44674 lines" New topic
Author

Program only reads in 11338 of 44674 lines

Ed Dablin
Ranch Hand

Joined: Oct 09, 2012
Posts: 32
Hi all.
I am 2 weeks into starting to learn Java and this is my first forum post. Hope someone can help.
I have a database of 44,674 world airports. (Filesize 5 MByte). Each airport is on its own line in a CSV file.
I can open the file with Excel or Notepad++ or TextWrangler which all show 44,674 rows.
But when I run my program (below) it stops after reading in exactly 11338 lines.
I can't figure out why it does not read in the whole file. There are no error messages. The Eclipse console
output is 11338 for the count variable. The data on line 11338 is very similar to its neighbouring rows. Any ideas???
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 10905
    
  12

the simplest debugging techniques are often the best. Try printing out the lines as you read them - i.e. make sure it is reading every line, not skipping any, etc. Try it on a smaller file - if you have a file with 20 lines, does it only read 5? Does it read the first five, the last five, every fourth one, etc.

You need to know what it is REALLY doing before you try and figure out WHY it is doing it.


There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Ed Dablin
Ranch Hand

Joined: Oct 09, 2012
Posts: 32
fred rosenberger wrote:the simplest debugging techniques are often the best. Try printing out the lines as you read them - i.e. make sure it is reading every line, not skipping any, etc. Try it on a smaller file - if you have a file with 20 lines, does it only read 5? Does it read the first five, the last five, every fourth one, etc.

You need to know what it is REALLY doing before you try and figure out WHY it is doing it.


Thanks, Fred. Yes I have done that. Actually my program is much larger. It reads in a line at a time and formats the data neatly, and calculates distance and bearings based on the lat & long data for each airport.
It works great for the first 11338 airports then it terminates. I have extracted this much smaller test program file to try to narrow down the issue. I'm stumped!!
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 10905
    
  12

Ok...a few other things to try...

did using a smaller file help? I.e. take only the first 100 lines from your current file and process them.

Take the 50 lines before and the 50 after...it is possible there is some funky non-printing character in your file. I would literally copy it, then edit it with notepad or something similar, and cut out the first 11,350ish lines, and then the last <however many> lines.

Is it possible to regenerate the source file? maybe something bad happened when you created/copied/moved it.

It is hard to say without actually running the code and looking at the data file, and posting an 11k line file here is probably not going to help anyone...

Ed Dablin
Ranch Hand

Joined: Oct 09, 2012
Posts: 32
fred rosenberger wrote:Ok...a few other things to try...

did using a smaller file help? I.e. take only the first 100 lines from your current file and process them.

Take the 50 lines before and the 50 after...it is possible there is some funky non-printing character in your file. I would literally copy it, then edit it with notepad or something similar, and cut out the first 11,350ish lines, and then the last <however many> lines.

Is it possible to regenerate the source file? maybe something bad happened when you created/copied/moved it.

It is hard to say without actually running the code and looking at the data file, and posting an 11k line file here is probably not going to help anyone...



Hi Fred, I have done that.
First of all, the program always quit on the line which contained airport identifier "BGUK".
I reduced the size of the airportsTest file in 5 steps and recorded the results:
airportsTest.csv of size 5360 kB and 44,674 airports. Program terminated at the "BGUK" line and count=11337.
I then deleted first approximately 11000 airports and saved as airportsTest1.
airportsTest1.csv of size 4130 kB and 33,675 airports. Program terminated at the "BGUK" line and count=338.
I then deleted first 300 airports.
airportsTest2.csv of size 4094 kB and 33,375 airports. Program terminated at the "BGUK" line and count=38.
I then deleted the last 33,000 airports.
airportsTest3.csv of size 44 kB and 374 airports. Program terminated at the "BGUK" line and count=38.
I then deleted the first 32 airports.
airportsTest4.csv of size 39 kB and 342 airports. Program terminated at the "BGUK" line and count=6.
I then deleted the last 332 airports.
airportsTest5.csv of size 2 kB and 10 airports. Program DID NOT TERMINATE EARLY when count=9 (corresponding to 10 lines).
SO, IN THE FINAL CASE THE PROGRAM SUCCEEDED IN GETTING PAST THE "BGUK" LINE.
Could it be there is some weird character in the last 340 (or so) lines of airportsTest4.csv ??
What weird character could bring the program to an early stop?
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 10905
    
  12

beats me...I'd probably open the (smallest) file that still dies with a hex editor and look for non-printing characters.

also...hasNextLine and nextLine can both throw exceptions...are you catching and printing those in your 'real' code?

I admit I am out of my element here...I'm really just spitballing.
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Ed Dablin wrote:
First of all, the program always quit on the line which contained airport identifier "BGUK".
...
I then deleted the last 332 airports.
airportsTest5.csv of size 2 kB and 10 airports. Program DID NOT TERMINATE EARLY when count=9 (corresponding to 10 lines).
SO, IN THE FINAL CASE THE PROGRAM SUCCEEDED IN GETTING PAST THE "BGUK" LINE.
Could it be there is some weird character in the last 340 (or so) lines of airportsTest4.csv ??
What weird character could bring the program to an early stop?


You could try opening the file with a hex editor and seeing what the actual characters are.

Maybe Scanner requires the platform-specific EOL and the last 340 lines have just \n instead of \r\n or something like that? (I wouldn't expect that--Scanner should work with any standard EOL convention, I'd think.)

You could also try using a BufferedReader and calling readLine() until that returns null to see if you get different results.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36453
    
  15
The likeliest character to bring it to an early halt is ctrl-D (\u0004) or ctrl-Z (\u0001a) which are end‑of‑file characters in *nix and Windows respectively.
Alternative technique:
Try going through the entire file with the read() method of a FileReader in a loop. This returns an int for each character and -1 for end of file (I think). If the result is within the expected range, ignore it. Otherwise print it out as a hexadecimal number and whereabouts it appears.
Put the expected characters into a Set<Character> remembering to include \u0020 = space \u000a = \n and \u000d = \r. Or use the methods of the Character class.
If you get an error message like “Unexpected control character \u4321 at position 456789”, then you know you only have to read 456788 characters before you find it
Ed Dablin
Ranch Hand

Joined: Oct 09, 2012
Posts: 32
OK I think I've nailed it. Thanks for your help.
The early airports in the list are North American and don't have any non-standard characters. Later on, as we get into more exotic regions, there are all sort of extended ASCII types.
So I made up this listing to go through each byte of the file in turn and look out for dodgy characters. Tomorrow I will hunt down the exact character that causes Scanner to think EOF has been reached.

Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

There's a caveat with that approach. The read() method reads a byte. There may be valid 2-byte characters where one of the bytes is one of the end-of-stream characters.

If the encoding you're using (either the charsetName param you passed when creating your Scanner, or the default for your system) is a one-byte encoding (like ASCII) or doesn't match the encoding in which the file was written, that could be the source of incorrectly interpreting a valid character (or part of one) as end-of-stream, or as something else that might cause problems.

Your best bet is to find out conclusively what encoding was used to create the file, and use the same encoding to read it, and look at characters rather than bytes. If you can't find out the encoding for sure, try some common ones like UTF-8, ISO-8559 (or whatever that one is... 8859? something like that), etc., and see what they turn up.

If you can get the same error on the same airport consistently with a smaller file, use that for your testing.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36453
    
  15
ISO8859-1 is the most likely encoding used.
Ed Dablin
Ranch Hand

Joined: Oct 09, 2012
Posts: 32
I've gone back to the source of the data and it is UTF-8.
Whatever that means!
I will read up about character encodings - a subject on which I confess total ignorance
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Ed Dablin wrote:I've gone back to the source of the data and it is UTF-8.
Whatever that means!
I will read up about character encodings - a subject on which I confess total ignorance


UTF-8 is a common encoding that covers the majority of the world's writing systems, or at least majority by measure of the population that uses them. It's the default encoding for a lot of software produced in the last 5 years or so. A character can be represented by 1, 2, or even 3 (I think) or 4 bytes. ASCII characters are represented by their normal 1-byte values, and using 2 bytes, we can get Japanese, Arabic, etc.

When you create your Scanner, if you don't tell it otherwise, it will use the default encoding for your system. You can see what that is with that method call, and/or you can try explicitly creating the Scanner with a charset of "UTF-8". Between these options and the character or byte examination discussed earlier, you should get a better idea of what you're receiving and how it's being interpreted.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36453
    
  15
UTF‑8 ought to be able to show every Unicode character. The Joel Spolsky article explains how UTF‑8 works. Every character ≥ 0x80 (=128) is represented by two or more bytes, all the bytes beginning with a 1‑bit. Every byte contains a 0‑bit somewhere, too. Control characters like ctrl-D or ctrl-Z which are not in the original text cannot therefore appear mysteriously when you encode it in UTF‑8.
I don’t know how a delete character (0x7f=127) would affect your scanner. The Scanner documentation is a bit vague about what it interprets as line end characters.
Ed Dablin
Ranch Hand

Joined: Oct 09, 2012
Posts: 32
It works! Now I get all 44,674 airports.
I changed my Scanner parameters as follows:
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Campbell Ritchie wrote:Control characters like ctrl-D or ctrl-Z which are not in the original text cannot therefore appear mysteriously when you encode it in UTF‑8.


Are you saying the bytes corresponding to ctl-D and ctl-Z cannot appear as any part of a multibyte character in UTF-8? If that's what you're saying, then I have no idea what his specific problem was, other than that it appears to have been related to an encoding mismatch somehow. (Given that he says it now works, specifying UTF-8 for his Scanner.)

If that's not what you're contending, then it seems quite possible that one of those bytes was present as part of a multi-byte character, which was perfectly valid in UTF-8, but which showed up in "raw form" in whatever default encoding his Scanner was using.
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Ed Dablin wrote:It works! Now I get all 44,674 airports.
I changed my Scanner parameters as follows:


Awesome! Glad you got it fixed!

(And I have to admit to a mild curiosity as to what the source of the problem ultimately was. In particular, what java.nio.charset.Charset.defaultCharset() gives you, and what were the contents of the last successfully read line and first failed line before. Only if you have time and it's not too much trouble. Not a big deal at all though if you'd rather not muck with it any more.)
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36453
    
  15
No, I said not in the original text. If there is a ctrl-D it will be (char)0x0004, which will appear as a single byte (byte)0x04 or (byte)0b0000_0100.
What I meant is that when you encode into UTF‑8, you either have the ASCII (≤0x7f=127) value unchanged, or a byte whose most significant bit is 1. So UTF‑8 cannot manufacture spurious ctrl-D or ctrl-Z characters, but it can copy those already existing.

And well done getting it to work (). I’d be interested to know what you found at the location where it stopped reading, too.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Program only reads in 11338 of 44674 lines
 
Similar Threads
Count the number of vowels, words, and sentences.
ArrayList<Future<Integer>> results = new ArrayList<Future<Integer>>();
Question on searching ArrayLists
Parsing a text file to an arraylist
Scan a text file.