Hadoop Analyst Training

Setting up Hadoop infrastructure with single and multi node cluster on amazon ec2(CDH4).
1st Floor, 10th Cross,28th Main,Bangalore

Course at a Glance

Mode of learning : Online - Self Paced

Domain / Subject : Engineering & Technology

Function : Information Technology(IT)

Duration : 41 Hours

Difficulty : Basic

About the Course

The Hadoop Big data online Video Certification Analyst course has been created to cater for professionals working in Data warehouse, Business Intelligence, Databases, Main frames or people who are comfortable with basic SQL coding language and want to pursue a career in designing, development, and architecting Hadoop-based solutions.

It lays emphasis on understanding what is Hadoop, how flow of data takes place in it , how it can enable storage and large-scale processing with deep dive into Hive, Pig and introduction to Impala along with other Hadoop ecosystem projects and basic administration like installation of single node cluster and Multi node cluster on ec2. Hadoop Tutorial provided as part of training contains in depth detail description of the topics mentioned.

As part of Hadoop Data warehousing / Analyst course we also cover how ETL tools like Pentaho or Talend can connect to Hadoop ecosystem.

Key Objectives:

The key objectives of this online training is to:

• Setting up Hadoop infrastructure with single and multi node cluster on amazon ec2(CDH4).

• ETL tool connectivity with Hadoop, real time case studies etc.

• Detailed hands on with Impala for real time queries on Hadoop.

• Writing Hive and Pig Scripts and working with Sqoop.

• Understanding in YARN (MRv2) latest version of Hadoop Release 2.0.

• Implementation of HBase, MapReduce Integration, Advanced Usage and Advanced Indexing.

• Work on a Real Life Project on Big Data Analytics and gain Hands on Project Experience.

• Implement linked-in algorithms – Identification of Shortest path for 1st level or 2nd level connection using Map Reduce.

• Play with Datasets – Twitter data set for sentiment analysis, Whether dataset, Loan Data set.

• Guidance and Quiz to prepare for Professional Certification exams like – Cloudera, etc.

• Ability to design and develop applications involving large data using Hadoop eco system.

• 3 months support to latest version of technology or Product , it will shared in form of recorded sessions.

Module 1 – Introduction to Hadoop and its Ecosystem, Map Reduce and HDFS

  • Big Data ,Factors constituting Big Data
  • Hadoop and Hadoop Ecosystem
  • Map Reduce -Concepts of Map, Reduce, Ordering, Concurrency, Shuffle , Reducing, Concurrency
  • Hadoop Distributed File System (HDFS) Concepts and its Importance
  • Deep Dive in Map Reduce – Execution Framework, Partioner, Combiner, Data Types, Key pairs
  • HDFS Deep Dive – Architecture, Data Replication, Name Node, Data Node, Data Flow
  • Parallel Copying with DISTCP, Hadoop Archives 

Module 2 – Hands on Exercises

  1. Installing Hadoop in Pseudo Distributed Mode , Understanding Important configuration files ,their Properties and Demon Threads
  2. Accessing HDFS from Command Line
  3. Map Reduce – Basic Exercises
  4. Understanding Hadoop Eco-system

1.Introduction to Sqoop , use cases and Installation
2.Introduction to Hive , use cases and Installation
3.Introduction to Pig , use cases and Installation
4.Introduction to Oozie , use cases and Installation
5.Introduction to Flume , use cases and Installation
6.Introduction to Yarn

Assignment – 1

Mini Project – Importing Mysql Data using Sqoop and Querying it using Hive

Module 3 – Deep Dive in Map Reduce

  • How to develop Map Reduce Application , writing unit test
  • Best Practices for developing and writing , Debugging Map Reduce applications
  • Joining Data sets in Map Reduce


Module 4 – Hive

1. Introduction to Hive

  • What Is Hive?
  • Hive Schema and Data Storage
  • Comparing Hive to Traditional Databases
  • Hive vs. Pig
  • Hive Use Cases
  • Interacting with Hive

2. Relational Data Analysis with Hive

  • Hive Databases and Tables
  • Basic HiveQL Syntax
  • Data Types
  • Joining Data Sets
  • Common Built-in Functions
  • Hands-On Exercise: Running Hive Queries on the Shell, Scripts, and Hue

3. Hive Data Management

  • Hive Data Formats
  • Creating Databases and Hive-Managed Tables
  • Loading Data into Hive
  • Altering Databases and Tables
  • Self-Managed Tables
  • Simplifying Queries with Views
  • Storing Query Results
  • Controlling Access to Data
  • Hands-On Exercise: Data Management with Hive

4.Hive Optimization

  • Understanding Query Performance
  • Partitioning
  • Bucketing
  • Indexing Data

5. Extending Hive

  • User-Defined Functions

6. Hands on Exercises – Playing with huge data and Querying extensively.

7. User defined Functions,Optimizing Queries, Tips and Tricks for performance tuning


Module 5 – Pig

1. Introduction to Pig

  • What Is Pig?
  • Pig’s Features
  • Pig Use Cases
  • Interacting with Pig

2. Basic Data Analysis with Pig

  • Pig Latin Syntax
  • Loading Data
  • Simple Data Types
  • Field Definitions
  • Data Output
  • Viewing the Schema
  • Filtering and Sorting Data
  • Commonly-Used Functions
  • Hands-On Exercise: Using Pig for ETL Processing

3. Processing Complex Data with Pig

  • Complex/Nested Data Types
  • Grouping
  • Iterating Grouped Data
  • Hands-On Exercise: Analyzing Data with Pig

4. Multi-Dataset Operations with Pig

  • Techniques for Combining Data Sets
  • Joining Data Sets in Pig
  • Set Operations
  • Splitting Data Sets
  • Hands-On Exercise

5. Extending Pig

  • Macros and Imports
  • UDFs
  • Using Other Languages to Process Data with Pig
  • Hands-On Exercise: Extending Pig with Streaming and UDFs

6. Pig Jobs

Module 6 – Impala

1. Introduction to Impala

  • What is Impala?
  • How Impala Differs from Hive and Pig
  • How Impala Differs from Relational Databases
  • Limitations and Future Directions
  • Using the Impala Shell

2. Choosing the Best (Hive,Pig,Impala)

Assignment -2

Module 7 – Cluster Planning

  • FileCluster planning File
  • FileCluster planning explanation File

Module 8 – Hadoop Cluster Setup and Running Map Reduce Jobs – Multinode Setup

  • Hadoop Multi Node Cluster Setup using Amazon ec2 – Creating 4 node cluster setup
  • Running Map Reduce Jobs on Cluster

Module 9 – Major Project – Putting it all together and Connecting Dots

  • Putting it all together and Connecting Dots
  • Working with Large data sets, Steps involved in analyzing large data 

Module 10 – ETL Connectivity with Hadoop Ecosystem

  • How ETL tools work in Big data Industry
  • Connecting to HDFS from ETL tool and moving data from Local system to HDFS
  • Moving Data from DBMS to HDFS
  • Working with Hive with ETL Tool
  • Creating Map Reduce job in ETL tool
  • End to End ETL PoC showing Hadoop integration with ETL tool.

Module 11 – Job and certification support

  • Major Project , Hadoop Development , cloudera Certification Tips and Guidance and Mock Interview Preparation, Practical Development Tips and Techniques, certification preparation

Assignment – 3


Write Your Own Review

Write your review here (required)

Is the price of course overrated?
would you recommend this course to others?
Is duration of the course sufficient enough?
Did you like the faculties?
What would you prefer in future classroom or online learning?

Key features

Related Courses:

Disclaimer: The contents of the course & Institute are obtained from the institute’s website by automated scraping or manual updates. For the latest information, please refer the institute website directly. For any discrepancies in the content, contact us at

Sample Video