Mastering Apache Spark

Spark is fast emerging as an alternative to Hadoop & Map/Reduce due to its speed. Spark Programming is often necessary to address complex processing loads, involving huge data volumes, which can’t processed by Hadoop in a timely manner. Its in-memory computing engine makes Spark the choice of platform for real-time analytics, which requires high speed data ingestion and processing within seconds. A whole new generation of analytics applications is now emerging to process geo-location data, streaming web events, sensors data, as well as data received from mobile and wearable devices.

Register Now
or call us now on +91 9850033661


The program is designed to provide an overall conceptual framework and common design patterns. Key concepts in each area will be explained and working code provided. Participants will be able to run the examples and expected to understand code. While explanation of key concepts is provided, a detailed code walk-though is usually not feasible in the interest of time. Code is written in Java and Scala. Therefore prior knowledge of these languages will be helpful to understand the code-level implementation of key concepts.

Training Highlights

Training Goals

To provide a thorough understanding of concepts of in-memory distributed computing and Spark API, so as to enable participants in development of Spark programs of moderate complexity.


Day 1

  • Introduction to Big Data Analytics
    • What is Big Data? – The 3V Paradigm
    • Limitations of Conventional Technologies
    • Essentials of Distributed Computing
    • Introduction to Hadoop & Its Ecosystem
  • Spark Essentials
    • Spark Background & Overview
    • Spark Architecture & RDD Basics
    • Common Transformations & Actions
  • Setting Up a Spark Cluster
    • Exercise : Installing & Configuring Spark 1.6.2 Cluster
    • Exercise : Simple Word Count Program Using Eclipse
    • Exercise: Analysing Stock Market Data

Day 2

  • Working With Pair RDDs
      • Concepts of Key/Value Pair
      • Per Key Aggregations using Map, Shuffle & Reduce
      • Transformations & Actions on Pair RDDs
      • Exercise:Finding Companywise Total Dividend Paid
      • Exercise : Determining Top 5 Dividend playing companies
      • Two Pair RDD Transformations- Joins in Spark
      • Exercise : Correlating price movement to dividend payment
  • Basic Input / Output
      • Various Sources of RDDs
      • Exercises : Loading & Saving Data from Flat Files
      • Introduction To HDFS
      • Exercise Setting Up a HDFS Cluster
      • Loading & saving Files in HDFS
  • Deploying On a Cluster
      • Spark Submit & Job Configuration
      • Job Execution LifeCycle
      • Introduction to YARN
      • Exercise: Deploying Spark Applications On Yarn Cluster

Day 3

  • SparkSQL
        • SparkSQL Basics & Architecture
        • Exercise: Creating Data frames from JSON, CSV & MySQL Tables
        • Creating Temporary Tables & querying them with HQL
        • Exercise: Analyzing and summarizing Weather Data using HQLStoring & Caching Tables
        • Exercise :Setting Up a Metastore Service and querying tables using external JDBC applications
  • Hive Integration
        • Hive Formats & SerDe’s
        • Exercise: Working with Hive Tables & Formats – ORC & XML
        • Exercise: Creating & Using User Defined Functions(UDF)
        • Using SparkSQL shell to query Hive Tables
        • Using Beeline Shell

Day 4

  • Spark Streaming
        • Spark streaming Architecture Dstreams
        • Transformations on Dstreams
        • Exercise:Counting Words used from streaming Data
        • Stateful & Stateless Word Count
        • Recovering from Faults in Spark Streaming
        • Introduction to Flume
        • Using Spark streaming as a flume Sink
        • Exercise: RealTime Analytics Design Patterns
  • Advanced Topics
        • Using Combiners to avoid multiple actions & data Columns
        • Accumulators ,Broadcast Variables
        • Exercise : Counting corrupted records using Accumulators
        • Exercise; Sharing a Lookup Tables across Partitions
        • Working on a Per Partiton Basis
  • Performance Tuning
        • Understanding Spark’s Execution Plan(DAG)
        • Using Logs and Web UI to identify Problems
        • Key Performance Considerations

%d bloggers like this: