• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Jeanne Boyarsky
  • Devaka Cooray
  • Paul Clapham
Sheriffs:
  • Tim Cooke
  • Knute Snortum
  • Bear Bibeault
Saloon Keepers:
  • Ron McLeod
  • Tim Moores
  • Stephan van Hulst
  • Piet Souris
  • Ganesh Patekar
Bartenders:
  • Frits Walraven
  • Carey Brown
  • Tim Holloway

Why does Java Spark wordcount program requires RDD as additional step unlike in Scala?

 
Ranch Hand
Posts: 1115
4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In Java for Spark word count , one has to first create RDD as shown in below code:



Whereas in scala we can write wordcount program as below without first creating RDD:


Why does Java require additional step of first creating RDD?

Thanks
 
Marshal
Posts: 64622
225
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What's an RDD?

Too hard a question for the “Beginning” forum: moving.
 
Saloon Keeper
Posts: 3289
145
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My guess is that this 'java for Spark' has not been updated to the latest java versions (>= 10). I reckon that once they do, you could write something like
 
Saloon Keeper
Posts: 10302
217
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Monica Shiralkar wrote:Why does Java require additional step of first creating RDD?


Both programs do the exact same thing, they both create a Resilient Distributed Dataset. The difference is just that in the Java program you explicitly declare the type of the variable that holds the dataset, while in the Scala program the type of the variable is inferred.

As Piet has shown, if you use Java 10+ you can also use var to declare the variable, this has nothing to do with Apache Spark. However, I doubt you'll be able to use the collect() method, because an RDD is not a Stream.
 
Monica Shiralkar
Ranch Hand
Posts: 1115
4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks.In the java code we can see that RDD is getting created. In Scala code is it not at all required to know that we are creating RDD?
 
Stephan van Hulst
Saloon Keeper
Posts: 10302
217
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Depends on what you find clearer.

I'd probably prefer to use the explicit type to avoid confusion with the Stream API.
 
Monica Shiralkar
Ranch Hand
Posts: 1115
4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator






Thanks. The question is wrong as there is not difference and both are doing the same thing.

What is the reason for Java code using the flatMap method whereas Scala code using the map method?

Java code

textFile
   .flatMap(s -> Arrays.asList(s.split(" ")).iterator())
   .mapToPair(word -> new Tuple2<>(word, 1))

Scala code


 
Stephan van Hulst
Saloon Keeper
Posts: 10302
217
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Honestly, you'd have to ask whoever wrote the code.
 
Monica Shiralkar
Ranch Hand
Posts: 1115
4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The questions which I am trying to find answers for are (1)Why does the java code require the String array with splitted words to be converted into a List (using asList method)  and (2) Why does the scala not require mapToPair instead of map method  as required by the java code 3) Why does the scala code not require
new Tuple2 (word, 1) instead of (word, 1).
 
Stephan van Hulst
Saloon Keeper
Posts: 10302
217
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Monica Shiralkar wrote:Why does the java code require the String array with splitted words to be converted into a List (using asList method)


Because the flatMap() method requires a FlatMapFunction, which is a functional interface with a method that returns an Iterator. You can't get an Iterator from an array, only from an Iterable, such as a List.

Why does the scala not require mapToPair instead of map method  as required by the java code


Because Scala has a concept known as 'implicit conversions'. An RDD of tuples is automatically converted so that you can call PairRDDFunctions on it. This is not possible in Java, so you need to tell Java explicitly that you want to work on an RDD of tuples after remapping the elements.

Why does the scala code not require new Tuple2 (word, 1) instead of (word, 1).


Because tuples are a built-in type in Scala. When you write (a, b), it means the exact same thing as new Tuple2(a, b).
 
Monica Shiralkar
Ranch Hand
Posts: 1115
4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
thanks
 
Monica Shiralkar
Ranch Hand
Posts: 1115
4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Because Scala has a concept known as 'implicit conversions'. An RDD of tuples is automatically converted so that you can call PairRDDFunctions on it. This is not possible in Java, so you need to tell Java explicitly that you want to work on an RDD of tuples after remapping the elements.



Spark supports 3 languages Java, Scala and Python. Out of these 3 only Java requires mapToPair  whereas Python also uses map method like Scala. It means something like implicit conversion must also be happening in case of Python.
 
Stephan van Hulst
Saloon Keeper
Posts: 10302
217
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I believe Python is duck-typed, which means that objects don't really have strict types and you can just call functions on them as long as they have certain properties. That means you don't have to convert the RDD to a specialized RDD of key-value pairs. If you call reduceByKey() on an object that is not an RDD of key-value pairs, it will probably cause an exception at runtime.
 
Monica Shiralkar
Ranch Hand
Posts: 1115
4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks.I have understood that in java we cannot transform to a Tuple(key value pair) directly using a Map and have to instead use mapToPair method.

Because the flatMap() method requires a FlatMapFunction, which is a functional interface with a method that returns an Iterator. You can't get an Iterator from an array, only from an Iterable, such as a List.



If flatMap method requires a FlatMapFunction then for Scala too flatMap method should require FlatMapFunction but as we are using the flatMap method but why are we are not returning Iterable there?
 
Stephan van Hulst
Saloon Keeper
Posts: 10302
217
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Because TraversableOnce is more natural to use in Scala, and Array implicitly converts to TraversableOnce.

Seriously though, for questions about the design you should probably turn to the developers of the Spark API.
 
Monica Shiralkar
Ranch Hand
Posts: 1115
4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I thought a good way to learn Scala is by doing some sample application. As scala is used a lot in Spark, thought of starting with word count program and creating a sample application to learn. While doing word count, I had questions regarding understanding of the word count (hello word of spark) program. Thanks, some of the questions have got cleared. I will put effort to understand the entire program completely.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!