Introduction As a new member of this forum I must first say that it seems you have quite a professional community going on here!
So, searching the web for answers (including this forum), I decided to ask you directly for help on a HADOOP topic.
Before I get any further, I must say I'm a begginer in HADOOP and Big Data, which dind't stop my company from giving me an important project to handle.
Because of security reasons (iposed by my employer), I cannot share with you all the details of my work and/or other specific technical details. But if finding the help I need depends on these details, I might make an exception or two (just don't tell my boss...).
Environment & Problem Description I work in a company where the Engineering Department guys produce an amazing amount of CAD files (Computer Assisted Design). So over the years we ended up having hundreds of thousands if not millions of files hosted on different Filler Systems. But quite often, the engineers need to access those files to modify/evolve/consult the information inside. The problem is that even though the engineers know precisely the name of the file they want, it takes quite a while (sometimes more than an hour) for the Filer System to actually find it and send it back to the engineer's PC. And that is because no indexing system exists on the Filer Hosting System (the system tests every single inode until the correct one is found). The files are not very big (a couple of dozens of MB) - but there are so many of them...
So the project I've been given is to study whether HADOOP could help up index those files and send them faster to the engineers.
The Question(s) Given the fact that HADOOP has its own File System (the HDFS), that means that importing the data into HADOOP will make us double the used disk space. But from what I understood, HADOOP can jump this step if the data is hosted by certain Linux distribution OS. Only problem there is, is that I don't think one can install HADOOP over a Filer System. Does anybody know whether that is even possible?
Whatever the answer to my previous question, the main question I would like to ask is the following.
The only need I have is to index that data. Once the data is indexed by HADOOP there will be no data manipulations/treatments done to it through HADOOP. The data is there only to be found very fast and to be sent back to a client PC. From my understanding, HADOOP is destined to data processing. It is made to create new "result" files based on the existing ones, and not to send back the data it already hosts. Would you agree with this statement?
All in all, should one use HADOOP to index this kind of data?
Would HADOOP do a better job at indexing files than other products?
What other products would you advise me to look closer to in order to solve the problem?
If more details are needed in order to express an opinion, please let me know and I'll give as many as possible.
Thank you in advance for your time and answers!
Any opinion is greatly apriciated!