Big Data with Apache Hadoop Distributed File System and Hive

Duration: Hours

Training Mode: Online


Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Locus IT has decade long industry experience in “Hadoop” consulting, staffing & training services.

a). Introduction to Hadoop

1. High Availability

2. Scaling

3. Advantages and Challenges 

b). Introduction to Big Data

1. What is big data

2. Big Data opportunities, Challenges

3. Characteristics of Big data 

c). Introduction to Hadoop

1. Hadoop Distributed File System

2. Comparing Hadoop & SQL

3. Industries using Hadoop

4. Data Locality

5. Hadoop Architecture

6. Map Reduce & HDFS

7. Using the Hadoop single node image (Clone)

d). Hadoop Distributed File System (HDFS)

1. HDFS Design & Concepts

2. Blocks, Name nodes and Data nodes

3. HDFS High-Availability and HDFS Federation

4. Hadoop DFS The Command-Line Interface

5. Basic File System Operations

6. Anatomy of File Read, File Write

7. Block Placement Policy and Modes

8. More detailed explanation about Configuration files

9. Metadata, FS image, Edit log, Secondary Name Node and Safe Mode

10. How to add New Data Node dynamically, decommission a Data Node dynamically (Without stopping cluster)

11. FSCK Utility. (Block report)

12. How to override default configuration at system level and Programming level

13. HDFS Federation

14. ZOOKEEPER Leader Election Algorithm

15. Exercise and small use case on HDFS

e). Map Reduce

1. Map Reduce Functional Programming Basics

2. Map and Reduce Basics

3. How Map Reduce Works

4. Anatomy of a Map Reduce Job Run

5. Legacy Architecture ->Job Submission, Job Initialization, Task Assignment, Task Execution, Progress and Status Updates

6. Job Completion, Failures

7. Shuffling and Sorting

8. Splits, Record reader, Partition, Types of partitions & Combiners

9. Optimization Techniques -> Speculative Execution, JVM Reuse and No. Slots

10. Types of Schedulers and Counters

11. Comparisons between Old and New API at code and Architecture Level

12. Getting the data from RDBMS into HDFS using Custom data types

13. Distributed Cache and Hadoop Streaming (Python, Ruby, and R)

14. YARN

15. Sequential Files and Map Files

16. Enabling Compression Codec’s

17. Map side Join with distributed Cache

18. Types of I/O Formats: Multiple outputs, NLINEinputformat

19. Handling small files using CombineFileInputFormat

f). Map Reduce Programming – Java Programming

1. Hands on “Word Count” in Map Reduce in standalone and Pseudo distribution Mode

2. Sorting files using Hadoop Configuration API discussion

3. Emulating “grep” for searching inside a file in Hadoop

4. DB Input Format

5. Job Dependency API discussion

6. Input Format API discussion, Split API discussion

7. Custom Data type creation in Hadoop


1. ACID in RDBMS and BASE in NoSQL

2. CAP Theorem and Types of Consistency

3. Types of NoSQL Databases in detail

4. Columnar Databases in Detail (HBASE and CASSANDRA)

5. TTL, Bloom Filters and Compensation

h). HBase

1. HBase Installation, Concepts

2. HBase Data Model and Comparison between RDBMS and NOSQL

3. Master & Region Servers

4. HBase Operations (DDL and DML) through Shell and Programming and HBase Architecture

5. Catalog Tables

6. Block Cache and sharding


8. DATA Modeling (Sequential, Salted, Promoted and Random Keys)

9. JAVA API’s and Rest Interface

10. Client Side Buffering and Process 1 million records using Client side Buffering

11. HBase Counters

12. Enabling Replication and HBase RAW Scans

13. HBase Filters

14. Bulk Loading and Co processors (Endpoints and Observers with programs)

15. Real world use case consisting of HDFS,MR and HBASE

i). Hive

1. Hive Installation, Introduction and Architecture

2. Hive Services, Hive Shell, Hive Server and Hive Web Interface (HWI)

3. Meta store, Hive QL

4. OLTP vs. OLAP

5. Working with Tables

6. Primitive data types and complex data types

7. Working with Partitions

8. User Defined Functions

9. Hive Bucketed Tables and Sampling

10. External partitioned tables, Map the data to the partition in the table, Writing the output of one query to another table, Multiple inserts

11. Dynamic Partition

12. Differences between ORDER BY, DISTRIBUTE BY and SORT BY

13. Bucketing and Sorted Bucketing with Dynamic partition

14. RC File



17. Compression on hive tables and Migrating Hive tables

18. Dynamic substation of Hive and Different ways of running Hive

19. How to enable Update in HIVE

20. Log Analysis on Hive

21. Access HBASE tables using Hive

22. Hands on Exercises

j). Pig

1. Pig Installation

2. Execution Types

3. Grunt Shell

4. Pig Latin

5. Data Processing

6. Schema on read

7. Primitive data types and complex data types

8. Tuple schema, BAG Schema and MAP Schema

9. Loading and Storing

10. Filtering, Grouping and Joining

11. Debugging commands (Illustrate and Explain)

12. Validations,Type casting in PIG

13. Working with Functions

14. User Defined Functions

15. Types of JOINS in pig and Replicated Join in detail

16. SPLITS and Multiquery execution

17. Error Handling, FLATTEN and ORDER BY

18. Parameter Substitution

19. Nested For Each

20. User Defined Functions, Dynamic Invokers and Macros

21. How to access HBASE using PIG, Load and Write JSON DATA using PIG

22. Piggy Bank

23. Hands on Exercises


1. Sqoop Installation

2. Import Data. (Full table, Only Subset, Target Directory, protecting Password, file format other than CSV, Compressing, Control Parallelism,  All tables Import)

3. Incremental Import (Import only new data, Last Imported data, storing Password in Metastore, Sharing Metastore between Sqoop Clients)

4. Free Form Query Import

5. Export data to RDBMS, HIVE and HBASE

6. Hands on Exercises

l). HCatalog

1. HCatalog Installation

2. Introduction to HCatalog

3. About Hcatalog with PIG, HIVE and MR

4. Hands on Exercises

m). Flume

1. Flume Installation

2. Introduction to Flume

3. Flume Agents: Sources, Channels and Sinks

4. Log User information using Java program in to HDFS using LOG4J and Avro Source, Tail Source

5. Log User information using Java program in to HBASE using LOG4J and Avro Source, Tail Source

6. Flume Commands

7. Use case of Flume: Flume the data from twitter in to HDFS and HBASE. Do some analysis using HIVE and PIG

n). More Ecosystems

1. HUE. (Hortonworks and Cloudera)

o). Oozie

1. Workflow (Action, Start, Action, End, Kill, Join and Fork), Schedulers, Coordinators and Bundles.,to show how to schedule Sqoop Job, Hive, MR and PIG

2. Real world Use case which will find the top websites used by users of certain ages and will be scheduled to run for everyone hour

3. Zookeeper

4. HBASE Integration with HIVE and PIG

5. Phoenix

6. Proof of concept (POC)


1. Spark Overview

2. Linking with Spark, Initializing Spark

3. Using the Shell

4. Resilient Distributed Datasets (RDDs)

5. Parallelized Collections

6. External Datasets

7. RDD Operations

8. Basics, Passing Functions to Spark

9. Working with Key-Value Pairs

10. Transformations

11. Actions

12. RDD Persistence

13. Which Storage Level to Choose?

14. Removing Data

15. Shared Variables

16. Broadcast Variables

17. Accumulators

18. Deploying to a Cluster

19. Unit Testing

20. Migrating from pre-1.0 Versions of Spark


For more inputs on Hadoop Training/staffing you can connect here.
Contact the L&D Specialist at Locus IT

Locus Academy has more than a decade experience in delivering the training/staffing  on Hadoop for corporates across the globe. The participants for the training/staffing on Hadoop are extremely satisfied and are able to implement the learnings in their on going projects. Locus IT has decade long industry experience in “Hadoop” consulting, staffing & training services.


There are no reviews yet.

Be the first to review “Big Data with Apache Hadoop Distributed File System and Hive”

Your email address will not be published. Required fields are marked *

Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.