I am writing a program which will split a file into multiple smaller files. Following this I then need to make sure that all of the data is in the split files.
Following the splitting operation I was thinking about counting the number of lines in the master file and then counting the number of lines in each of the smaller files. The sum of the lines in the smaller files should be the same as the number of lines in the master file. I'm thinking that this might take a long time and wondering if there is any quicker way of doing it.
Another possible way is the File.length method?? Should the sum of the lengths of the smaller files be the same as the length of the master file?
Sounds like your files are all text? A line count audit should add up ok. A byte count might not since you're losing one line terminator per file (except the last file) in the split, and you might change the line break from CR to CRLF or vice versa.
Probably the best way to make sure you didn't lose or break any data would be to put the files back together and see if the new merged file matches the original, maybe with a 3rd party compare tool.
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Hm, Stan seems to be making several assumptions about how line terminators are being processed here, and I'm not sure they're warranted. I think the poster needs to determine whether things like line terminators will be changed during processing, and whether a new line terminator may be added at the end of each split file. Offhand I don't think either of those is necessary, though they may be desirable, or not, depending what this application is for.
If you're concerned about time, then counting lines in a file does ultimately require that you read each and every byte of the file to discover if it's a line terminator (or part of one). If you're going to do that, you might as well also use some sort of checksum to verify that all the data is valid, not just the number of lines. The time spend calculating a checksum should be small compared to the time spend reading from the file in the first place. If you want something less reliable but much faster, and if you're not changing line terminators (something I typically find unnecessary or even undesirable anyway) then simply adding up the total file sizes should work reasonably well.