I need some information to take a decision which matters a lot to me. I know this is a
bit lengthy and I request you to read through it.
Currently I am working as an application developer with the Java language and
frameworks like Spring and Hibernate. Sometimes I do have to write custom frameworks
for specific purposes. I like to design application / algorithms and frameworks more than
just learning an existing framework and using it. Yes, I do use and love Vim as my primary
editor and I love to code.
Recently my boss has offered me a position in which I will have to work with big data.
Below are the prerequisite skills for for the position:
hands-on experience in Core Java, Unix/Linux, SQL and good analytical skills to grasp
and apply the concepts in Hadoop.
- Mandatory: Core java, Unix / Linux, SQL
- Good to have: Python, Spring / .Net / C++
- Highly desirable: Linux administration, Big Data Skills (Hadoop, HBase, Cassandra,
Spark, Splunk, Marklogic, MongoDB etc)
I have an intermediate understanding of SQL, Python, Unix/Linux. From a quick
search I have found that big data is about huge volumes of data and Hadoop is a framework
for handling such data.
Having read what I enjoy doing, do you guys think I will be comfortable with
big data? Will I get to design some framework / algorithm or write code from
scratch? Is big data only about data analysis and making sense of the data to a business?
Any light in this matter would be of immense help.
I see big data as more of organizing data through ETL (extraction/transform/loading) and data warehousing. Java is indeed the foundation in learning frameworks like Hadoop/HBase/Hive/MongoDB etc. Most of these (as least the ones I hear often) are Java based.
As for switching, I can't say it's a 100% "switch". At some point in time, there will be custom coding and/or in-house customization or similar to get such and such done.
For big data there are the 5Vs = volume, velocity, variety, veracity, value
volume = amount of data
velocity = time dimension of data
variety = the different type of data
veracity = the messiness of data
value = the usefulness of data
Big data is a very broad term. Everyone is calling their project big data nowadays. If they think they are handling lot of data, they call it Big Data. If they are using technology (for example NoSQL database) that someone else who is doing Big Data is using, then they call their project Big Data. Never mind, that they don;t need that technology. As far as I can tell: People are tired of fighting with Oracle RDBMS, so they want to do Big Data. We should just call it ND - No Database. The term is becoming meaningless... like "The Cloud" Now that everyone is on the cloud, they can all do big data. Next they will do super slides, or something like that.
Seriosuly!. I sound like someone who makes fun of technology that I haven't used.. but that's not true. I've designed a ultra-scalable application that runs on the Amazon cloud. I started 5 years ago, when Big Data was starting up. I've been doing Big Data before Big Data was cool. Now, everyone is Big Data, and there are like 20 new technologies to learn, and everyone is coming up with BS like 3 V's. Also, the Big Data technology stack is not been standardized yet. There is a huge mix of technologies that can come together in a Big Data project. Just knowing that it;s a Big Data project doesn't tell you much. GO to 2 differrent companies, and what they are doing with their Big Data project will be completely differrent.
The only thing that it tells you that you might have to learn something new. SO, if you are going into this project, be ready to learn something new. It might be a good or a bad thing depending on your own personality. You might try asking what your exact role is going to be, but if it's a completely new project, then they might not have decided on their tech stack yet, and might still be figuring out what will be the structure of the team.
Jayesh is right: "Big data" is as much a marketing buzzword as a meaningful description of a particular field of work. But it is also a rapidly growing area and there are lots of interesting new technologies and applications emerging in the real world. I reckon there will be plenty of real opportunities in "Big Data" over the next few years.
I've recently moved from Oracle/PostgreSQL database application development (with some Java) into a pilot project looking at "Big Data", and I'm having to learn some of these technologies pretty fast too, which is a lot of fun (mostly...), so here's my perspective on some of the things you mentioned.
Python is easy to get started with and a really useful general purpose programming/scripting language. It seems to be widely used in many areas of data analysis and big data, partly because it's easier for "data scientists" to use than Java, and partly because you can develop prototypes or throwaway scripts for things like data-cleansing really easily. It also has lots of powerful libraries for serious data analysis and specialist purposes such as machine learning. There are also streaming Python APIs for some of the Big Data tools like Hadoop (via Hadoop streaming) and Apache Spark.
Hadoop is a total nightmare to install/administer, but it's interesting from an application development perspective. It's based on distributed data/processing, using MapReduce jobs written e.g. in Java to process data in parallel on multiple server nodes. There is a scripting language called Pig, which is widely used, while Hive provides a SQL-like API for accessing data in the Hadoop file system. The easiest way to get started with Hadoop is via one of the pre-packaged VMs e.g. there is a short free introductory course in Hadoop with Cloudera at Udacity.
However, there seems to be a trend away from relatively low-level Hadoop coding e.g. in Pig or Java, and towards the use of more abstract libraries e.g. Cascading that allow you to code your processing at a higher level , perhaps using other languages such as Scala (Scalding) or Clojure (Cascalog).
Right now, the Apache Spark project is also getting a lot of attention as it offers an in-memory alternative to Hadoop's MapReduce approach, but still allows you to do distributed processing etc. It's written in Scala but there are Java and Python APIs as well. I think this looks really interesting.
NoSQL is a parallel strand to Big Data and Hadoop e.g. some people are using Hadoop for batch processing and NoSQL for interactive processing with large volumes of data, while other poeple are using NoSQL instead of/alongside their traditional relational DBs. There are lots of different NoSQL databases and they each have different characteristiscs which make them useful for different purposes. Check out Seven Databases In Seven Weeks for a good introduction to various NoSQL databases. MongoDB seems to be the most widely used NoSQL database and is very easy to get started with e.g. check out the free online courses from MongoDB.
Finally, there seems to be a lot of positive noise around Scala as the language of choice for Big Data in the "enterprise", and a broader awareness of the benefits of functional programming for Big Data e.g. see Dean Wampler's talk on Copious Data, the "Killer App" for Functional Programming.
So if some of these topics sound interesting to you, and you like learning new stuff, then why not give it a go?