HADOOP
Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model. Unstructured data such as log files, Twitter feeds, media files, data from the internet in general is becoming more and more relevant to businesses. Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze this kind of big data in the organizations. Hadoop is a framework that has the ability to store and analyze data present in different machines at different locations very quickly and in a very cost effective manner. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel.
  • Next batch: 01 Dec, 2018
  • 80 hrs
  • 8 weekends (Sat & Sun)
About the Course
Motivation for Hadoop and NoSQL (Big Data, Scalability, Problems with Traditional Systems, Distributed Systems) | An Overview of Hadoop | Comparison with SQL Databases/ Data Warehouses/ Other Distributed Systems | Hadoop Distributed File System | MapReduce Programming model | Yarn – Resource Management | Hadoop Common Utilities | Hadoop 1.0 Vs Hadoop 2.0 | Hadoop Ecosystem Components
Hadoop Architecture (1.0 and 2.0) | Name Node (NN) | Data Node(DN) | Job Tracker(JT) | Task Tracker(TT) | Secondary Namenode(SNN) | Backup Node (BN) | Check Point Node (CN) | Resource Manager (RM) | Node Manager (NM) | Application Master (AM) | Job History Server (JHS) | Timeline Server
Apache Hadoop Distribution (Prerequisites Software Installation, Configuration details, Local mode, Pseudo distributed mode, Fully Distributed mode) | Cloudera Distribution - CDH (Prerequisites Software Installation, Cloudera Manager Installation, Creating CDH cluster (Parcels, Packages) | Hortonworks Distribution - HDP (Prerequisites Software Installation , Apache Ambari Installation, Creating HDP cluster) | Planning of Hadoop Clusters (Hardware Details, Dev clusters, Testing Clusters, Production Clusters
Overview of HDFS | HDFS Architecture & Internals | HDFS Data Organization | Basic File System Operations | HDFS Commands [Admin + User] | HDFS Java Client API | Data Integrity, Compression, Data Archival | High Availability, Federation, Encryption | Data Backups, Short Circuit Reads, ACLs, Quotas | Upgrades, Storage Policies, Data Balancing, Snapshots | Web Interface, WebHDFS, HttpFS | HDFS Compatible File Systems | HDFSMetrics
Overview of YARN | YARN Vs MapReduce (Hadoop 1.0) | YARN Architecture and Internals | Resource Schedulers | YARN High Availability | YARN Commands | YARN Web Interface, YARN REST API | YARN Metrics | YARN Applications
Overview of MapReduce | Hadoop MapReduce Architecture and Internals | Difference between MR1 & MR2 | Hadoop MapReduce API Concepts | Mapper, Reducer, Partitioner, Shuffle, Combiner, Sorting, Counters | Hadoop MapReduce Data Flow | Hadoop MapReduce Job Template | Hadoop Data Types | Hadoop Serialization | Distributed Cache, Speculative Execution, Data Localization | Hadoop File Formats - Sequence File, Map File, Avro, etc. | Hadoop Streaming – Non Java MapReduce Programming | Custom Data Types, Partitioners, Input/output Formats | MapReduce Application Master | MapReduce Job History Server | MapReduce Commands | MapReduce Joins | MapReduce Hands On. MapReduce algorithms (2 Hours)
NoSQL CAP theorem description | Cassandra Internals - columnar data store, high throughput, heavy loads, concurrent writes | Use cases on Spark – Cassandra
Developing MapReduce Programs | Integration with Eclipse IDE | Monitoring MapReduce jobs | Configuration Tuning | Debugging MapReduce Jobs | Task Profiling | Performance tuning | Sending Job specific parameters | Unit Testing with MRUnit
Provisioning and Monitoring Cluster | Configuration Management | Cluster Health Management | Cluster Metrics | Security | Commissioning - Decommissioning
Overview of Pig | Installation | Architecture and components | Pig Engine | Grunt | Pig Latin (Operators – Functions (UDFs) – sds – Macros – Data types – Storage types – Language constructs – Parameter substitution – Pig Commands – Pig unit testing) | Pig administration | Best practices
Overview of Hive | Installation | Architecture and components | Hive Query language [HQL] (DDL – DML – DQL – DCL) | Functions (UDFs) | Views | Joins | Partitioning | Bucketing | Indexing | PLHQL | Parameter Substitution | Hive Commands | Storage handlers | File Formats and Ser-De | Hive over Tez and Spark | HCataLog | Hive security
Overview | Architecture and components | Installation | Data model (Conceptual view – Physical view) | DB operations | HBase clients (real-time & batch) | HBase Query Language (DDL – DML – DQL – DCL) | HBase tools (hbck – compaction – region splits and merge – WAL – Snapshots – Replication – Backups) | Hot spotting | HBase administration | HBase security | HBase integrations (Phoenix – Spark – MapReduce
Overview | Installation | Commands (Import – Export – Job – Merge – Metastore – Eval – Codegen)
Overview | Architecture and components | Installation | Oozie internals (Bundle engine – Coordinator engine – Workflow engine) | Workflow (Controlling nodes – Action nodes) | Oozie language – hPDL with few example
Overview | Architecture and components | Data flow | Events | Agents (Sources – Channels – Syncs) | Interceptors
Zookeeper (1 hour) | Ambari | Avro | Accumulo | Spark | Flink | Mahout | Storm | DataFu | Kafka/Chukwa/Falcon | Drill/Impala | Lipstick | Sentry/Knox | Tajo/Presto | Giraph | Kudu | Thrift | MADlib | Hue | ORC | Gora | Cassandra | Tez | Lucene/Nutch | Parquet | Singa
Analysing unstructured data (movie reviews, food reviews & twitter) for Trends, Sentiment analysis, Topic modelling.
Trainer details
NAGAMALLIKARJUNA
Architect, Delivered trainings to 700+ professionals since 2012 More than 10 Years of experience in Training Has worked on multiple realtime HADOOP AND SPARK Training Working in a top MNC company in Bangalore Strong Theoretical & Practical Knowledge