Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is the way to go. Learn how it performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining. Learn how it provides in-memory cluster computing for lightning fast speed and supports Java, Python, R, and Scala APIs for ease of development. Learn how it can handle a wide range of data processing scenarios by combining SQL, streaming and complex analytics together seamlessly in the same application. Learn how it runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3.
  • Next batch: 17 Nov, 2018
  • 80 hrs
  • 8 weekends (Sat & Sun)
About the Course
comparasion between programing paradigms
scala instalation, sbt, project creation steps
scala constructs
control structers-if -else,loops,match clause
functions-declaration,partial functions,anonmous functions
lazy evaluation
immutable collections
mutabale collections- array buffer,list buffer,mutable map
clases and objects
singleton object/companion class
exception handling
python introduction
python language constructs
python functions
python modules
pyuthon clases and objects
python tools pip,pypi
python analaytical modules like scipy,numPY,pandas
Difference between Map Reduce & Spark
| Spark Overview | Components of Spark | Spark Features| Spark Architecture | Spark Cluster Overview | Spark Installation
Components of Spark Application(Driver, Executor) | Spark Context & Cluster Manager) | RDD | Creating RDDs
Transformations | Actions | Working of an Application (DAG Scheduler, Task Scheduler, Executor, Tasks) | Job, Stage, pipeline, partition & task | Discussion about shuffle | Hands-on: Developing Spark using Interactive Shells and Standalone applications |
Spark-shell, pyspark, spark-sql, spark-submit | Transformations (in detail- performance impact of distinct, intersect & ByKey operations) | Handson: Actions | Performance impact of shuffle | Saving RDD | Persisting RDD
Shared Variables (Broadcast and Accumulators)| Submitting Spark Application | Deploying Spark jobs in YARN cluster , Spark Standalone cluster, Local and Mesos cluster| Best Practices | Tuning Spark Jobs
Applications | Architecture | DStream | Batch Interval | Receiver
Spark Streaming Context | Data Sources | Transformations | Actions | Stateful Transformations | Output operation – for each RDD | Transform & update state-by-key | Persisting | Tuning Spark Streaming Jobs | Working of Spark Streaming Application | Basic Structure of spark streaming application | Kafka- Spark Streaming (Receiver based approach, Direct approach)
Introduction to Spark SQL | Usability across Applications | Data Sources | Features of Spark SQL | Query optimization | Catalyst Optimizer | SQL Context | Creating Spark Session | Creating Datasets | Working with Datasets | Creating from Data Sources – JSON, Parquet, ORC, CSV | Operating with Data Frames | Interoperating with RDDs (infer schema using reflection and programmatically) | Interconversion between RDD and Data Frames. | Persistence | Actions on Data Frames. | Load and save data into Hive Metastore | Saving Data Frame in different formats | Performance Tuning | Using Spark SQL over Streaming
Basics of Statistics & Machine Learning | Applications of Machine Learning | Numerical vs Categorical variables | Training & Test data | Supervised vs Unsupervised Learning | Spark MlLib data types – Vectors, Labeled Point | Regression – (Linear Regression) Sales Forecasting | Classification - (Logistic Regression) – predicting if a person has cancer | Recommendation – (Collaborative Filtering) – movie recommendation | Spark ML pipeline | Pipeline components (Transformer, Estimator, Pipeline and Parameter). | Clustering and classification Examples with ML Pipeline
Basics of R and Overview of SparkR | Install SparkR package | Creating Spark Data Frame | From Local Data Frames | From Data Sources | From Hive Tables | Working with Spark Data Frames | Selection | Grouping, Aggregation | Operating on Columns | Applying user defined function (dapply, dapplyCollect, gapply Collect, Data Type mapping, lapply) | Run SQL queries on SparkR | GLM and Kmeans on SparkR | Model Persistence
Spark Integrations With HDFS, HBase, Cassandra, MongoDB, Redis, Elastic Search, Flume and many others
Trainer details
Architect, Delivered trainings to 700+ professionals since 2012 More than 10 Years of experience in Training Has worked on multiple realtime HADOOP AND SPARK Training Working in a top MNC company in Bangalore Strong Theoretical & Practical Knowledge