HadoopExam Blogs

HadoopExam Learning Resources

Advanced Spark and MLib: As you are involved with the Spark framework, you will start using some advanced concepts. These advanced things are very helpful for writing efficient Spark program. Like you know, what is the broadcast and accumulators in Spark and how to use them? Can RDD be directly broadcasted, so that lookup data will be available on each node of the Spark cluster? No, you need to convert RDD into Scala collection before broadcast (you will only get to know, if you have done some hands-on exercises, else it is not that easy). Spark developer had create Machine Learning library and well known and mostly used algorithms of Machine Learning are already implemented. You should have some basic knowledge of Machine Learning and able to find which algorithm fall under supervised learning and which in un-supervised learning. What exactly the difference between these two, what is clustering, classification and understanding of recommendation engine etc. Let’s discuss each topic in little more detail. It is not expected that you are a Machine Learning expert.

  1. Broadcast variable: When you have two sets of data, one is very small (can fit into memory) and another is very huge and want to join both the data. Hence, to get the performance, what you will be doing broadcasting small set of data. So it would already be available on each node. You must know, which data can be broadcasted and which should not. Here, only factor is size of the data. Smaller size data should always be broadcasted.
  2. Accumulators: When you wanted to do some counting (like counters in MapReduce framework), you can use accumulators provided by Spark framework. Accumulators can only be updated on the worker node of the Spark cluster and final aggregated values can be received at the driver. The most important concepts in case of Accumulator, is that you should be able to use it in actions only. Nobody, is going to stop you using accumulators in transformation, but it will not give the correct results in case of node crash, or same function is executed twice on different node for performance etc.
  3. Supervised v/s unsupervised learning: You must know the difference between supervised and unsupervised learning like supervised learning has some set of test data which can be used to supervise the output. Something is there to supervise the outcome. But in case of unsupervised learning there is no such test data and results.
  4. Classification: You will be given some data and needs to classify them in pre-defined classes e.g. email Spam filter.
  5. Clustering: grouping of data, and you don’t know initially what all groups will be created once this algorithms are executed on the data.
  6. Recommendation: If you have purchased or visited some products on website, based on that you will be recommended similar products.

MLib: You need to identify, which library and API, you will be using for particular Machine Learning Algorithm.

Oreilly Databricks Spark Certification     Hortonworks HDPCD Spark Certification     Cloudera CCA175 Hadoop and Spark Developer Certifications    MCSD : MapR Certified Spark Developer  

  1. Apache Spark Professional Training with Hands On Lab Sessions 
  2. Oreilly Databricks Apache Spark Developer Certification Simulator
  3. Hortonworks Spark Developer Certification 
  4. Cloudera CCA175 Hadoop and Spark Developer Certification 

Watch below Training Video

You are here: Home MapR Certification MapR:Spark MAPR Certified Spark Developer Syllabus Part-7