The Big Data Spark course has been designed to impart an in-depth knowledge of Big Data processing using Spark. The course is packed with real-life projects and case studies.
  • Next batch: 03 Feb, 2018
  • 80 hrs
  • 8 weekends (Sat & Sun)
About the Course
Spark Core
Spark Streaming with Kafka
Spark SQL, Data Frames and Data Sets
Machine Learning with Spark (MLLib)
Spark Integretions with HDFS, HBase, Cassandra, MongoDB, Redis, Elastic Search, Flume and many others
Sentiment analysis using Spark and NLP, Twitter trends using Spark Streaming and Cassandra
Real-time weather reporting using Spark streaming
Comparison between different Programming Paradigms
Scala Installation, sbt, Project Creation Steps
Scala Constructs
Control Structures - if-else, loops, match clause
Functions – declaration, partial functions, anonymous functions
Lazy evaluation
Immutable Collections - Array, List, Map & Set
Mutable Collections – Array Buffer, List Buffer, Mutable Map
Classes & Objects
Singleton object / companion class
Exception Handling (try-catch, Except, Try)
Python Introduction
Python Language constructs
Python Functions
Python Modules
Python Classes & Objects
Python tools pip, PyPi
Python analytical modules like SciPy, NumPy, Pandas
Difference between Map Reduce & Spark
Spark Overview
Components of Spark
Spark Features
Spark Architecture
Spark Cluster Overview
Spark Installation
Components of Spark Application(Driver, Executor)
Spark Context & Cluster Manager)
Creating RDDs
Working of an Application (DAG Scheduler, Task Scheduler, Executor, Tasks)
Job, Stage, pipeline, partition & task
Discussion about shuffle | Hands-on: Developing Spark using Interactive Shells and Standalone applications
Spark-shell, pyspark, spark-sql, spark-submit
Transformations (in detail- performance impact of distinct, intersect & ByKey operations)
Handson: Actions
Performance impact of shuffle
Saving RDD
Persisting RDD
Shared Variables (Broadcast and Accumulators)
Submitting Spark Application
Deploying Spark jobs in YARN cluster , Spark Standalone cluster, Local and Mesos cluster
Best Practices
Tuning Spark Jobs
Batch Interval
Spark Streaming Context
Data Sources
Stateful Transformations
Output operation – for each RDD
Transform & update state-by-key
Tuning Spark Streaming Jobs
Working of Spark Streaming Application
Basic Structure of spark streaming application
Kafka- Spark Streaming (Receiver based approach, Direct approach)
Introduction to Spark SQL
Usability across Applications
Data Sources
Features of Spark SQL
Query optimization
Catalyst Optimizer
SQL Context
Creating Spark Session
Creating Datasets
Working with Datasets
Creating from Data Sources – JSON, Parquet, ORC, CSV
Operating with Data Frames
Interoperating with RDDs (infer schema using reflection and programmatically)
Interconversion between RDD and Data Frames.
Actions on Data Frames
Load and save data into Hive Metastore
Saving Data Frame in different formats
Performance Tuning
Using Spark SQL over Streaming
Basics of Statistics & Machine Learning
Applications of Machine Learning
Numerical vs Categorical variables
Training & Test data
Supervised vs Unsupervised Learning
Spark MlLib data types – Vectors, Labeled Point
Regression – (Linear Regression) Sales Forecasting
Classification - (Logistic Regression) – predicting if a person has cancer
Recommendation – (Collaborative Filtering) – movie recommendation
Spark ML pipeline
Pipeline components (Transformer, Estimator, Pipeline and Parameter).
Clustering and classification Examples with ML Pipeline
Basics of R and Overview of SparkR
Install SparkR package
Creating Spark Data Frame
From Local Data Frames
From Data Sources
From Hive Tables
Working with Spark Data Frames
Grouping, Aggregation
Operating on Columns
Applying user defined function (dapply, dapplyCollect, gapply Collect, Data Type mapping, lapply)
Run SQL queries on SparkR
GLM and Kmeans on SparkR
Model Persistence
Spark Integrations With HDFS, HBase, Cassandra, MongoDB, Redis, Elastic Search, Flume and many others
Trainer details
Naga Mallikarjun Ineni
Open source contributor 6+Years in Hadoop eco-system projects Delivered trainings to 20+ batches / 600+ people since 2012