Please explain how many duplicates you will have. One way you can do it is to put each String into a Set<String> and work out the intersections of those Sets. But I am a bit worried that you will end up with 1,000,000 Sets in memory simultaneously, which will be expensive in terms of memory consumption and performance. So it is quite likely there will be better ways to do it.
Are those Strings common words? If so, you can probably save space by interning every single String as soon as you read it from the database.
Are you reading from a CSV file or a database? It is probably better to create an SQL query to look for duplicates, but I can't think how at the moment.
And welcome to the Ranch
posted 8 months ago
Thank you for your Time
Really I dont know How many duplicates may be inside of it, now I read CSV but I can make an mysql from it,
But How can I make a query for this when they are not in the same column and I do not know where are there
Those are not common words
I thout maybe with NoSql Databases like mongodb I can manage it.