I'm new to Java and i'm facing some problems with Multimap tools.
I've got a text file that i parsed to collect some datas. In every single line, i collect the sequence Id, the gene names and its corresponding alleles and optionnaly the comments about the sequences (if there's one).
The aim of my work is first to sort the alleles out according to their corresponding genes (which is easy with the MultiMap by taking as a parameter an Arraylist which contains the list of all the alleles).
There is an example of my text file (i just made it more simple so its easier to understand):
So i need to get something like that :
which i could do it.
The problem starts here: For every different allele, i need to calculate:
-the number of the total sequences in which the allele appears
-the number of the redundant sequences
-the number of the non-redundants sequences
-the number of the sequences which contain a comment
To sum up, when i finished to read the file, i need to be able to say for every allele, how many sequences are associated to this allele and among those sequences, i need to be able to say how many are redundant and how many are not as well as how many contain a comment.
All this, while keeping the order defined first, which means the alleles sorted out according to their corresponding genes.
For example, for the allele 1 of the Gene A, i need to get as an output, something like this:
What would you propose as solutions to my problem please?
Any help will be really appreciated.
Ps: sorry for my english, i'm french
Joined: Sep 16, 2008
I don't know how many of these files you have and how many genes and sequences we're talking about. I would put all this data into a SQL database and do the sorting and counting there. Btw. how do you know if an allele is redundant or not?
... oh and about you beeing French, I totaly forgive you
SCJP 6, OCMJD 6, OCPJWSD 6
I no good English.
Joined: Jul 15, 2010
Hi Martin and thank you very much for your reply.
Actually i get this file from another program after submitting a request so the file changes all the time in its content but the format stays the same no matter what so thats why i prefered to make the treatments directly on the file without using any database as i wont be able to get all the datas before anyway.
About the redundancy of the sequences (not the alleles), its easy (well in theory). If you have 2 lines with the same allele AND the same sequence, then the sequence is redundant for this allele.
oh and thanks for forgiving me.... its really not easy to be french everyday lol, just kidding.