HadoopExam Blogs

HadoopExam Learning Resources

DataFrame: One of the beautiful abstractions created for the Spark programmer, you can convert an RDD to DataFrame and DataFrame can be converted into RDD. Once you have DataFrame created (You have to use Scala case classes to assign schema to your RDD). You can use various SQL queries on this DataFrame. It make your overall work very simple. You should be able to convert an RDD to DataFrame and vice-versa. How to register this DataFrames as a temporary tables/views, so that you can query them. Other advantage of using DataFrame is, optimizes your queries for execution perspective.

  1. Creating Data Frame: As mentioned you need to convert existing RDD into DataFrame, using reflections. So it means, you should understand what the case classes in Scala are and how to use them for converting RDD to DataFrame. Again, you need to save back query output in a local or HDFS file system.
  2. Running Queries: Once you created DataFrame, write a code which register DataFrame as a temp table. Once, registered as a temp table, apply various queries. If you know basic SQL like select, where, joins, unions, subtract etc. then it would be quite easy for you to work with the Spark DataFrame. You not only have SQL’s to be applied on DataFrame, but it comes with the convenient API as well, so many people prefer to use API, rather than SQL. So you should be well versed with both the use cases.
  3. UDF: If you are familiar with the SQL, then you can see there are some functions you can use like count(), max(), min() etc. These are commonly used functions, and whenever you have some custom requirement which are not simply solved by existing functions, you can create your own custom functions. Once, you create this functions, you have to register them, so that you can use in your SQL queries. This functions are known as User Defined Functions, before using them you have to register them.

Re-partitioning: We have already discussed about the partitioning in Spark and its impact on performance. Sometime, looking the impact you want re-partition DataFrame, how can you do that. What is the impact of re-partitioning? You must be aware about this. 

Oreilly Databricks Spark Certification     Hortonworks HDPCD Spark Certification     Cloudera CCA175 Hadoop and Spark Developer Certifications    MCSD : MapR Certified Spark Developer  

  1. Apache Spark Professional Training with Hands On Lab Sessions 
  2. Oreilly Databricks Apache Spark Developer Certification Simulator
  3. Hortonworks Spark Developer Certification 
  4. Cloudera CCA175 Hadoop and Spark Developer Certification 

Watch below Training Video

You are here: Home MapR Certification MapR:Spark MAPR Certified Spark Developer Syllabus Part-4