HadoopExam Blogs

HadoopExam Learning Resources

PairRDD: It’s simple to understand, if you have ever worked with the MapReduce algorithm (you see many of the things are existing but new frameworks are created for optimization, performance and usability perspective). PairRDD is an RDD of key-value pair, it is a tuple in Scala with only two values like (value1, value2). So here value1 is a key and value2 would become value. And PairRDD represents such data. Beautiful API is available to work with the PairRDD (paired data). You don’t have to write so much code, like you to with the MapReduce programming, just few lines of code with the PairRDD API it could accomplish many things in distributed manner. Most of the places you can see word count application as an example to show you, that how 50 lines of MapReduce code can be written using Spark PairRDD with just 5 to 6 lines to accomplish the same thing. Let’s discuss each topic in detail about PairRDD for MCSD certification perspective.

  1. Creating PairRDD: First of all you need to understand the concepts of the PairRDD and how it can be used. Once you understand this then learn how to convert simple RDD into PairRDD, when you load the data how it can be directly converted into PairRDD. Similar to simple RDD, PairRDD also have actions and transformations. How serialization affect while working with keys and values of the PairRDD.  Yes, you will get many questions around simple RDD and PairRDD, both combined will cover at-least 30-35% questions. Most of the questions will be asked with code snippet and sample data.
  2. PairRDD transformations: This is something difficult to grasp, as reduce() function in case of simple RDD is an action but in case of PairRDD, reduceByKey() is a transformation. So, how do you differentiate between transformation and actions? Concept remain same transformation will return new RDD and action return the expected results. When actions are applied in RDD, then transformation will be lazily evaluated.
  3. reduceByKey and groupByKey : Certainly, you will get questions based on these two methods of PairRDD. So it is expected you understand how this functions work with the data in RDD, where is and when data shuffling, what is the issue with the Network I/O etc. Which should be consider in which situation.
  4. PairRDD functions: There are various functions available in PairRDD for transformations and actions. You need to understand, how to use them for example join, leftjoin, rightjoin, union etc. What is the role of the key, when join operation is applied. Basically, you should have good exposure to PairRDD API. You will get 4-5 questions with both actions and transformations of PairRDD.
  5. Action on PairRDD: Similarly you should have good experience with the use of PairRDD actions. Certainly 2-3 questions from here.

Partitioning: Very important concepts, you should be able to create partitioning of data. How the partitioning affect your data processing, how overall performance is impacted. Like having more partitions will give you highest parallelism, but when you need to shuffle data, then partitioning is kill for performance. Yes, 2-3 questions on this section as well. Different types of available partitioner.

Oreilly Databricks Spark Certification     Hortonworks HDPCD Spark Certification     Cloudera CCA175 Hadoop and Spark Developer Certifications    MCSD : MapR Certified Spark Developer  

  1. Apache Spark Professional Training with Hands On Lab Sessions 
  2. Oreilly Databricks Apache Spark Developer Certification Simulator
  3. Hortonworks Spark Developer Certification 
  4. Cloudera CCA175 Hadoop and Spark Developer Certification 

Watch below Training Video

You are here: Home MapR Certification MapR:Spark MAPR Certified Spark Developer Syllabus Part-3