This week's book giveaway is in the OCMJEA forum.
We're giving away four copies of OCM Java EE 6 Enterprise Architect Exam Guide and have Paul Allen & Joseph Bambara on-line!
See this thread for details.
The moose likes Performance and the fly likes Better way to process flat file and encrypt Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Better way to process flat file and encrypt" Watch "Better way to process flat file and encrypt" New topic
Author

Better way to process flat file and encrypt

Balasubramaniam Muthusamy
Ranch Hand

Joined: Nov 30, 2010
Posts: 51
Hello EveryOne,
Could you please let me know how to read the data files ( around 300 million) in better way and encrypt using SHA-512? I currently reading file record by record and each record contains around 20 fields. Each field will be encrypted one by one and finally the record will be written to output file. Which is currently running for more than 40 hours. is there any better way to process them? threading concepts will be helpful? Thanks in advance

Bala
Tony Docherty
Bartender

Joined: Aug 07, 2007
Posts: 2253
    
  47
I think you may get a better answer in the performance forum so I'll move the thread there for you.

The first thing to do though is to profile the application (record some timings) to find out which area(s) are taking the most time because until you do that you don't know which bit(s) you need to optimize.
Balasubramaniam Muthusamy
Ranch Hand

Joined: Nov 30, 2010
Posts: 51
Thank you. i think the applying SHA-512 taking much time thanks
Tony Docherty
Bartender

Joined: Aug 07, 2007
Posts: 2253
    
  47
I already have moved it there for you, but unless you provide some profiling information people are unlikely to be able to provide definitive answers.
Just noticed you have edited your last post, so ignore the above.

I would imagine you are correct about the encryption being the bottleneck but without actual timings you don't know that for sure and I've been wrong many times before in where I guessed a bottleneck was occurring. Also without actual timings you don't know if any changes you make are improving the situation or not.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41631
    
  55
What are you trying to accomplish? SHA-512 is a hash (or digest), not a cipher, so you won't be able to decrypt it. If you're trying to create a checksum for the file then there are existing tools that are much better suited (much faster).


Ping & DNS - my free Android networking tools app
Balasubramaniam Muthusamy
Ranch Hand

Joined: Nov 30, 2010
Posts: 51
Thanks much for your replies. As i said earlier i need to process the file which contains around 450 to 500 millions with 30 to 40 columns. Now these columns have to be encrypted using SHA-512. As of now we loop through entire flat file read the record by record and column by column. Now each column will be digested using SHA which return 32 bytes and again some process using those bytes.
Finally record will be written to output file and the process will continue till last record.

Now I am trying to process the file using thread concept rather than one by one to improve the permonance.... What is the best way to process this file? is thread method will be helpful? any other better way? Please explain me... your help will be much appreciated

Thanks
Bala
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11256
    
  16


That does not make sense. "Encryption" implies that you will later want to "decrypt" it. This is not possible.

You can think of it as like taking the sine of a number. If I said "The sign of the number is 0.5. Tell me the original.", you can't. Or if I said "I am a traveler who is now in St. Louis. Where did I come from?" - again, you can't. There are lots of ways to get from A to B, so if you only know B, you can't work back to A


There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Balasubramaniam Muthusamy
Ranch Hand

Joined: Nov 30, 2010
Posts: 51
One thing here is that we are not going to encrypt all columns and only few columns will be encrypted and those doesn't need to decrypt as per my requirement. I was just thinling to improve the processing speed...

Thanks much for your reply...
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12769
    
    5
I suggest that the set-up time for encryption of millions of small chunks is what is time consuming.

Is there some reason you can't just encrypt the whole file?

One thing here is that we are not going to encrypt all columns and only few columns will be encrypted and those doesn't need to decrypt as per my requirement.


Encrypting data that will never be decrypted why not just throw it away?

Bill
Balasubramaniam Muthusamy
Ranch Hand

Joined: Nov 30, 2010
Posts: 51
these column will be again loaded to some table research purpose
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41631
    
  55
No, they won't. As 3 people have said by now, you can't decrypt the data. Please read my previous post.
Balasubramaniam Muthusamy
Ranch Hand

Joined: Nov 30, 2010
Posts: 51
I think you got me wrong.....I never wanted to decrypt them. I just want to improve the performance while processing file and encrypting..Thanks again
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11256
    
  16

If you are never going to decrypt them, then why spend the time encrypting them? Isn't that just a waste of time/processor power? Why not just null them out? Or make them all literally "XXXXXXXXXXXXXXXXXXXXXXXX"?

What do you think you gain by running them through the hash?
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12769
    
    5
Please read this wikipedia entry on the SHA family of hash functions to understand why people keep telling you this is a bad idea.

Bill
Balasubramaniam Muthusamy
Ranch Hand

Joined: Nov 30, 2010
Posts: 51
thanks much. i look for some alternatives. thanks
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11256
    
  16

Folks around here will be happy to give you alternatives, but only if you tell us what you are really trying to accomplish.
R. Grimes
Ranch Hand

Joined: Aug 23, 2009
Posts: 42
I think if I had to encrypt that large of a file, I might explore ECC (Ellliptic Curve Cryptography).
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1050
    
  10

R. Grimes wrote:I think if I had to encrypt that large of a file, I might explore ECC (Ellliptic Curve Cryptography).


I wouldn't ! Ellliptic Curve Cryptography is slow compared to AES which is the industry standard. If I were going to encrypt I might use a hybrid approach using ECC in conjunction with AES where a random session key is used for the AES and ECC is used to encrypt the session key. Even then I would probably use RSA rather than ECC since RSA is ubiquitous and ECC is not (at this time) . Also, the hybrid approach would require a different session key for each cleartext or database row or column or whatever unit of encryption is required.

On a more general point - I do know that some medical data is made anonymous before being distributed for research purposes by in essence hashing fields that might identify people. This allows researchers to identify common entities but not actual individuals. There has been some bad press over this since on it's own it is sufficient to hide identities but taken with other publicly available data and a knowledge of the hash algorithms used some but not all individuals can be identified. Some further anonymisation can be achieved by using a keyed hash and the key kept very secret but this is not considered enough to completely protect identities.
R. Grimes
Ranch Hand

Joined: Aug 23, 2009
Posts: 42
Richard Tookey wrote:
R. Grimes wrote:I think if I had to encrypt that large of a file, I might explore ECC (Ellliptic Curve Cryptography).


I wouldn't ! Ellliptic Curve Cryptography is slow compared to AES which is the industry standard.


Well, readers can refer to this document from Oracle, who apparently has a different view, and decide which is best. See link.

A couple of noteworthy quotes:

"The Elliptic Curve Cryptosystem
(ECC), off ers the highest strength per bit of any known
public-key cryptosystem today."


"We repeated these experiments using 2048-bit RSA keys
and 193-bit ECC keys. We found ECC to perform better
than RSA without any exceptions, "

For a bit more recent document, if the above is too dated for you, I would refer to this 2010 abstract.

A noteworthy quote from this document is:

"From the above we conclude that, computationally speaking,
cracking 160-bit ECC is at least three orders of magnitude
harder than cracking 1024-bit RSA. "

Or, perhaps this presentation, given by QualComm in Nov 2012. See page 10 for speed comparisons.
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1050
    
  10

R. Grimes wrote:
Richard Tookey wrote:
R. Grimes wrote:I think if I had to encrypt that large of a file, I might explore ECC (Ellliptic Curve Cryptography).


I wouldn't ! Ellliptic Curve Cryptography is slow compared to AES which is the industry standard.


Well, readers can refer to this document from Oracle, who apparently has a different view, and decide which is best. See link.
A couple of noteworthy quotes:

"The Elliptic Curve Cryptosystem
(ECC), off ers the highest strength per bit of any known
public-key cryptosystem today."


"We repeated these experiments using 2048-bit RSA keys
and 193-bit ECC keys. We found ECC to perform better
than RSA without any exceptions, "

For a bit more recent document, if the above is too dated for you, I would refer to this 2010 abstract.

A noteworthy quote from this document is:

"From the above we conclude that, computationally speaking,
cracking 160-bit ECC is at least three orders of magnitude
harder than cracking 1024-bit RSA. "

Or, perhaps this presentation, given by QualComm in Nov 2012. See page 10 for speed comparisons.


None of your references compare secret key encryption using AES with public key encryption using ECC; they compare ECC with other public key crypto systems such as RSA. I am not arguing for RSA or any other public key algorithm; I'm arguing for using the industry standard for secret key encryption i.e. AES . Whether or not the OP should use AES on its own (see note 1) or in a hybrid system using one of the public key encryption algorithms depends very much on the sort of data he is encrypting. The OPs obvious confusion between encrypting and digesting is making it difficult for me to understand his requirements but as I hinted at in my previous post I suspect he is trying to 'annonymise' data rather than encrypt it but I am just guessing.

Note 1 - using AES will pretty much always require the use one of the feedback modes and for most feedback modes one also needs to add padding.


Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7718
    
  20

Richard Tookey wrote:None of your references compare secret key encryption using AES with public key encryption using ECC; they compare ECC with other public key crypto systems such as RSA.

I can't imagine anyone trying to compare any symmetric key encryption system (I assume that's what you mean by 'secret') with an asymmetric one, either for strength or speed, since they are likely to be orders of magnitude different.

As I recall, all government restrictions on key sizes are based on symmetric lengths; I don't even know if there are any such rules for asymmetric ones (though no doubt the bureaucrats have them tucked up their sleeve somewhere ).

Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1050
    
  10

The point I was trying to make, Winston, is that I was advocating usage of AES rather than any public key system whether ECC, RSA or whatever. R. Grimes presented an argument that ECC was superior to RSA but since I was not advocating RSA as the primary encryption algorithm the argument was irrelevant. There may be strong case for using a hybrid system (see section 13.6 in Practical Cryptography by Ferguson and Schneier) but the OP has not presented a use case so it is difficult for me to judge. It is unusual for a public key system to be used for bulk encryption (that is what (symmetric) secret key ciphers are designed and optimized for) and a hybrid system whether it be ECC+AES or RSA+ AES is the norm when encrypting files.

Yes - there are US restriction on the size of the RSA modulus (it used to be 1024 bits but I'm not sure of the current size) and without the "Unlimited Strength" files installed the JCE enforces it. Presumably NSA has also placed similar arbitrary and irrelevant restrictions on ECC key sizes.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7718
    
  20

Richard Tookey wrote:Presumably NSA has also placed similar arbitrary and irrelevant restrictions on ECC key sizes.

No doubt. I remember back when I was "security administrating" for the first time (more than 10 years ago now) being amazed that encryption laws came under the heading of "Weaponry", and fine limits, even then, were in the hundreds of millions of dollars.

Winston
R. Grimes
Ranch Hand

Joined: Aug 23, 2009
Posts: 42
Richard Tookey wrote:R. Grimes presented an argument that ECC was superior to RSA but since I was not advocating RSA as the primary encryption algorithm the argument was irrelevant.


Oh, I'm sorry. I thought that, in your post I was responding to, you said:

"Even then I would probably use RSA rather than ECC since RSA is ubiquitous and ECC is not (at this time) ."
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7718
    
  20

R. Grimes wrote:Oh, I'm sorry. I thought that, in your post I was responding to, you said:
"Even then I would probably use RSA rather than ECC since RSA is ubiquitous and ECC is not (at this time) ."

I think what we're probably both saying is that most PK encryption systems don't actually use PKE all the time; they use it for the "handshake" (ie, source verification and symmetric key-exchange) and then hand over to a symmetric algorithm; so the efficiency of the PK algorithm is unlikely to make a huge amount of difference to overall throughput.

Winston
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Better way to process flat file and encrypt