HadoopExam Blogs

HadoopExam Learning Resources

Spark Streaming: This is one of the reason, Apache Spark framework become suddenly popular in BigData world. There were many existing framework for processing continuous stream of data but they were not convenient and not giving the expected performance as well and huge volume of data is another challenge. Same event/message and data should not be processed again. Another beautiful feature in Spark streaming is Window operation on continuous data. This entire complexity is very well implemented in Spark framework. So, I can say, you need to know your business logic and DStream API to work with. All the complexity of handing this stream on fault-tolerant manner is the Spark framework responsibility. Let’s discuss the topics, which will be asked in the exam.

  1. Spark Streaming Architecture: You have to understand the concept of micro batching in case of Spark streaming. Which will create DStream objects, this DStream can also have transformation and action API. What is the minimum window size you can create in Spark streaming, how to apply Spark functions on this DStream, how to do analytics on continuous stream of data, it’s very challenging.
  2. DStream API: Live data is sent as small batches, Dstream is nothing but batches of RDDs. Dstream represents a continuous stream of data, namely a sequence of RDDs. RDD is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. (Learn Spark Streaming in detail from here ) A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous sequence of RDDs (of the same type) representing a continuous stream of data. DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume, etc.) using a StreamingContext or it can be generated by transforming existing DStreams using operations such as map, window and reduceByKeyAndWindow. While a Spark Streaming program is running, each DStream periodically generates a RDD, either from live data or by transforming the RDD generated by a parent DStream. This class contains the basic operations available on all DStreams, such as map, filter and window. In addition, PairDStreamFunctions contains operations available only on DStreams of key-value pairs, such as groupByKeyAndWindow and join. These operations are automatically available on any DStream of pairs (e.g., DStream [(Int, Int)] through implicit conversions. DStreams internally is characterized by a few basic properties: - A list of other DStreams that the DStream depends on - A time interval at which the DStream generates an RDD - A function that is used to generate an RDD after each time interval
  3. Dstream stateful operations: In Apache Spark 1.6, it had been improved support for stateful stream processing with a new API, mapWithState. The new API has built-in support for the common patterns that previously required hand-coding and optimization when using updateStateByKey (e.g. session’s timeouts). As a result, mapWithState can provide up to 10x higher performance when compared to updateStateByKey. One of the most powerful features of Spark Streaming is the simple API for stateful stream processing. Programmers only have to specify the structure of the state and the logic to update it, and Spark Streaming takes care of distributing the state in the cluster, managing it, transparently recovering from failures, and giving end-to-end fault-tolerance guarantees. While the existing DStream operation updateStateByKey allows users to perform such stateful computations, with the new mapWithState operation we have made it easier for users to express their logic and get up to 10x higher performance.
  4. Actions on Dstream: Similarly you can apply actions on the RDD, you can apply on the Dstream as well. Once, you apply the action it will initiate the computation on the Dstream data. Similarly you can save the Dstream data using saveAsTextFile and some other available API. Example of Action is foreach method of Dstream.
  5. Window operations: We have already discussed little about the windowing function in Apache Spark. Window function means like you’re receiving continuous stream of data every 100 millisecond. Now you want to apply some calculations on all the data you have received in each 1 second (Like calculating average bid price on HadoopExam stock ticker). What you will do in this case, you define window size as either in time format like on each second, which will consider last 10 micro-batches. Or you can create window based on number of batches as well. You need to understand in depth. And there are various API methods, which you can use countByWindow, reduceByWindow, countByValueAndWindow etc.
  6. Fault-tolerance and process only once: When you are processing continuous stream of data, then it is very challenging you should not lose any single bit of data, as well as no data should be processed by twice. Yes, you are not worried about this because Spark Streaming framework handle this. But you need to know, how Spark take care of this problem. You will get question based on this criteria as well.

Oreilly Databricks Spark Certification     Hortonworks HDPCD Spark Certification     Cloudera CCA175 Hadoop and Spark Developer Certifications    MCSD : MapR Certified Spark Developer  

  1. Apache Spark Professional Training with Hands On Lab Sessions 
  2. Oreilly Databricks Apache Spark Developer Certification Simulator
  3. Hortonworks Spark Developer Certification 
  4. Cloudera CCA175 Hadoop and Spark Developer Certification 

Watch below Training Video

You are here: Home MapR Certification MapR:Spark MAPR Certified Spark Developer Syllabus Part-6