-
Hive Training
- Course Overview
- 1. Introduction to Hadoop Hive and important modules
- 2. Data encapsulation and data analysis basis
- 3. Data transformation and format handling
- 4. Finding bugs and deriving useful information
- 5. Cloud data management and handling
- 6. Training solutions and workforce management
- 7. Rules validating and security control
-
OpenRefine Course
This course covers the foundation of OpenRefine and its scripting language GREL. You will learn how to: Use the facet/filter. Leverage OpenRefine point and click transformation and fuzzy matching function for quick but powerful data cleaning. Write complex transformation in GREL, OpenRefine script language. Call API and parse results in OpenRefine. We developed the course in 2015 using OpenRefine 2.6-beta. We do our best to keep the content current with the latest version of OpenRefine. It is possible that some third-party services referenced in the course are not available any longer or have changed their interfaces. Please let us know if there is out of date information. This course was developed by RefinePro in partnership with BigData University and Cognitive Class. Since 2011, RefinePro has developed training programs for OpenRefine, including free online courses, in-person or remote courses, and individual coaching sessions.
- General Information:
- Lesson 1 Introduction to OpenRefine
- Understand the principles of data preparation
- Getting familiar with the OpenRefine community and the software interface
- Installing OpenRefine
- Lesson 2 Data Mining and Discovery
- Learn the different facet type
- Learn to combine facet to create complex filtering
- Learn how to sort data in OpenRefine
- Lesson 3 Data Preparation and Normalization
- Learn point and click data normalization (clustering, removing duplicate, splitting cells
- Understand how the history and undo / redo works
- Start using GREL to concatenate two fields
- Lesson 4 General Refine Expression Language
- Understand and master GREL syntax
- Learn basic GREL expression to replace, split and compare string
- Lesson 5 Data Enrichment
- Join OpenRefine project together
- Call an API to enrich your project
- Parse a JSON answer from an API
-
Apache Flink Course
- 1. Demystify Scala
- a. Introduction to Scala
- b. Setup, Installation, and configuration of Scala
- c. Develop and execute Scala Programs
- d. Scala operators and features
- e. Different Functions, procedures, and Anonymous functions
- f. Deep dive into Scala APIs
- g. Collections Array, Map, Lists, Tuples and Loops
- h. Advanced operations – Pattern matching
- i. Eclipse IDE with Scala
- 2. Object Oriented and Functional Programming
- a. Object oriented programming
- b. Oops concepts
- c. Constructor, getter, setter, singleton, overloading and overriding
- d. Type Inference, Implicit Parameters, Closures
- e. Lists, Maps and Map Operations
- f. Nested Classes, Visibility Rules
- g. Functional Structures
- h. Functional programming constructs
- 3. Introduction to Apache Flink
- a. Learn What and why Apache Flink
- b. Understand Features of Apache Flink
- c. Apache Flink architecture and Flink design principles
- d. Work of master process – Job Manager
- e. Role of worker process – Task Manager
- f. Workers, Slots and Resources
- g. Overview of Apache Flink APIs
- h. Understand difference between Apache Spark and Apache Flink to learn Flink vs Spark.
- 4. Master Flink Stack
- a. Distributed Streaming Dataflow at Runtime with Flink
- b. Apache Flink APIs
- c. Apache Flink Libraries
- d. Data Flow in Apache Flink
- e. Fault tolerance in Apache Flink
- 5. Setup and Installation of single node Flink
- a. Setup of Apache Flink environment and pre-requisites
- b. Installation and configuration of Flink on single node
- c. Troubleshooting the encountered problems
- 6. Setup and Installation of multi node Flink cluster and Cloud
- a. Setup environment on Cloud
- b. Install pre-requisites on all nodes
- c. Deploy Apache Flink on cluster and Cloud
- d. Play with Flink in cluster mode
- 7. Master DataStream API for Unbounded Streams
- a. Introduction to Flink DataStream API
- b. Different DataStream Transformations in Flink
- c. Various Data Sources – File based, Socket based, Collection based, Custom
- d. Responsibility of Data Sink in Apache Flink
- e. Iterations in DataStream APIs
- f. DataStream Execution Parameters – Fault tolerance, Controlling Latency
- 8. Learn Flink DataSet APIs for Static Data
- a. Overview of DataSet APIs in Flink
- b. Various DataSet Transformations in Flink
- c. Different Data Sources – File based, Collection based, Generic
- d. Responsibility of Data Sink in Flink DataSet APIs
- e. Iteration Operators in DataSet APIs
- f. Operating on Data Objects in Functions – Object Reuse Disabled/Enabled
- 9. Play with Flink Table APIs and SQL Beta
- a. Registering Tables in Flink
- b. Table Access and various Table API operators in Flink
- c. SQL on batch tables and Streaming Tables
- d. Writing Flink Tables to external sinks
- 10. Apache Flink Libraries
- a. Overview of Flink Libraries
- b. Flink CEP – Complex Event Processing library
- c. Apache Flink Machine Learning library
- d. Apache Flink Gelly -Graph processing API and Library
- 11. Flink Integration with other Big data tools
- a. Integrate Flink with Hadoop
- b. Process existing HDFS data with Flink
- c. Yarn and Flink integration
- d. Flink Data Streaming with Kafka
- e. Consume data in real time from Kafka
- 12. Programming in Flink
- a. Parallel Data Flow in Flink
- b. Develop complex Streaming applications in Flink
- c. Handle Batch processing in Flink using DataSet APIs
- d. Troubleshooting and Debugging Flink Programs
- e. Best Practices of development in Flink
- f. Real time Apache Flink Project
-
CouchDB training
- DESCRIPTION
- Prerequisites
- Module 1: Installing CouchDB
- Introduction: CouchDB briefly
- Installation: Get up and running fast
- Technical Overview: Details of the CouchDB technology
- Basics: Getting started with CouchDB
- Module 2: Configuring CouchDB
- Base Configuration
- couch_peruser
- CouchDB HTTP Server
- Authentication and Authorization
- Compaction Configuration
- Logging
- Replicator
- Query Servers
- External Processes
- HTTP Resource Handlers
- CouchDB Internal Services
- Miscellaneous Parameters
- Proxying Configuration
- Module 3: CouchApp
- Design Functions
- View Functions
- Show Functions
- List Functions
- Update Functions
- Filter Functions
- Validate document update functions
- Guide to Views
- Introduction Into the Views
- Views Collation
- Joins With Views
- View Cookbook for SQL Jockeys
- Pagination Recipe
- Module 4: CouchDB External APIs
- Module 5: Query Server
- Query Server Protocol
- reset
- add_lib
- map_doc
- reduce
- rereduce
- ddoc
- Raising errors
- Logging
- JavaScript
- Design functions context
- CommonJS Modules
- Module 6: Fauxton
- Installation
- Get the source
- Fauxton Setup
- Dev Server
- Deploy Fauxton
- Writting Addons
- Generating an Addon
- Routes and hooks
- Module 7: Cluster
- Setup
- Theory
- Node Management
- Database Management
- Sharding
- Module 8: JSON Structure
- All Database Documents
- Bulk Documents
- Module 9: Troubleshooting
- Breaking Changes
- Error Messages
- Known Problem
- Official CouchDB bug tracker
-
Cassandra Training
- Cassandra Training Overview
- Objectives of the Course
- Who should do the course?
- What is Big Data
- Technology Landscape
- Big Data Relevance
- Distributed Systems and Challenges
- Why NoSQL Databases
- Relational DB vs. NoSQL
- Type of NoSQL Databases
- NoSQL Landscape
- CAP Theorem and Eventual Consistency
- Key Characteristics of NoSQL Database systems
- ACID vs BASE
- Cassandra Fundamentals
- Distributed and Decentralized
- Elastic Scalability
- High Availability and Fault Tolerance
- Tuneable Consistency
- Row-Oriented
- Schema-Free
- High Performance
- The Cassandra Data Model
- The Relational Data Model
- A Simple Introduction
- Clusters
- Keyspaces
- Hands-on Session
- Installation and Setup of Cassandra
- Single Node Setup
- Multi-Node Cluster Setup
- Key Configurations for Cassandra
- CLI and Hands-On with Cassandra
- Cassandra Modeling
- Cassandra (Column Family NoSQL DB)
- Key Concepts
- Secondary Indexes in Cassandra
- Difference between Custom Indexes and Secondary Indexes
- Difference between Relational Modeling and Cassandra Modeling
- Key Points to note while modeling a Cassandra Database
- Patterns and Anit-Patterns in Cassandra Modeling
- Cassandra Architecture & Intro to CQL
- Anatomy of Reading operation in Cassandra
- Anatomy of the Write operation in Cassandra
- How is Deletes handled in Cassandra
- System Keyspace
- Peer to Peer Model Logical Data Model: Keyspace, Column Family/Table, Rows, Columns
- Traditional Ring design vs. VNodes
- Partitioners: Murmer3, Random (md5) and ByteOrdered
- Gossip and Failure Detection
- Anti-Entropy and Read Repair
- Memtables, SSTables and Commit Log
- Compaction fundamentals to reduce SSTable data files
- Hinted Handoff
- Compaction
- Bloom Filters, Tombstones
- Managers and Services
- VNodes
- Indexes and Caches
- Coordinator node
- Seed nodes
- Write/Read consistency levels: Any, One, Two, Three, Quorum
- Snitches: Dynamic snitching, Simple Snitch, Rack Inferring Snitch, Property File Snitch, Gossiping Property File Snitch
- Routing Client requests
- Nodetool commands: gossipinfo, cfstats, describing
- YAML file fundamentals
- Operations management web GUI
- Stress testing Cassandra
- CQL command fundamentals
- Cassandra API
- Key concepts for Reading and Write in Cassandra
- Tunable Consistency
- Simple Get, Multi-get Slice
- Range and Slice
- Slice Predicate
- Delete
- Hands-on CLI commands
- Cassandra CQSHL
- SQL over Cassandra
- Composite Keys
- Hands-on examples on CQL 3.0
- Cassandra Clients
- How to establish Client Connections
- Thrift Client
- Connection Pooling
- Auto-discovery and Failover in Hector
- Client with CQL
- Cassandra Monitoring and Administration
- Tuning Cassandra
- Backup and Recovery methods
- Balancing
- Bootstrapping
- Node Tools Commands
- Upgrades
- Monitoring critical metrics
- Bulk Loading Data to Cassandra
- Bulk Export of Data from Cassandra
- Hands-on Examples for each of them
- Cassandra Analytics Cluster
- Cassandra Hadoop Integration
- Cassandra Search Cluster
- Integration of Solr with Cassandra
- Search Query on Cassandra
-
Apache Storm
- Module 1
- Baysean Law
- Hadoop Distributed Computing
- Legacy Architecture of Real-Time System
- Difference b/w Storm and Hadoop
- The fundamental concept of storm
- Storm Development Environment
- Real Life Storm Project
- Module 2
- Apache Storm Installation
- Storm Architecture
- Logical Dynamic and Components in Storm
- Topology in Storm
- Storm Execution Components
- Stream Grouping
- Tuple
- Spout
- Reliable versus Unreliable Messages
- Bolt Lifecycle
- Bolt Structure
- Bolt-normalization bolt
- Reliable versus Unreliable Bolts
- Multiple Streams
- Multiple Anchoring
- Using IBasicBolt to Ack Automatically
- Hands-On:
- Creating Storm project in eclipse
- Running Storm bolt and spouts
- Running twitter example using Storm
- Module 3
- Grouping and its different types
- Reliable and unreliable messaging
- How to get Data – Direct connection and Enqueued message
- Life cycle of bolt
- Module 4
- Stream Grouping
- Fields Grouping
- All Grouping
- Custom Grouping
- Direct Grouping
- Global Grouping
- None Grouping
- Hands-On:
- Using different grouping techniques in Storm topologies
- Module 5
- What is Trident
- Trident Spouts
- Types of Trident Spouts
- Trident Spout components
- Trident spout Interface
- Trident filter, function & Aggregator
- Hands-On:
- Implementing Trident Spouts and Bolts
- Module 6
- Transactional Topologies
- Partitioned Transactional Spouts
- Opaque Transactional Topologies
- Hands-On:
- Implementing transactional system using Transactional topologies
- Module 7: Introduction to Kafka & Messaging Systems
- Basic Kafka Concepts
- Kafka vs Other Messaging Systems
- Intra-Cluster Replication
- An Inside Look at Kafka’s Components
- Log Administration, Retention, and Compaction
- Hardware and Runtime Configurations
- Monitoring and Alerting
- Cluster Administration
- Securing Kafka
- Using Kafka Connect to Move Data
-
Hadoop Training
- Hadoop Course Overview
- Hadoop Training Course Prerequisites
- Hadoop Course System Requirements
- Hadoop Training Course Duration
- Introduction to Hadoop
- High Availability
- Scaling
- Advantages and Challenges
- Introduction to Big Data
- What is big data
- Big Data opportunities, Challenges
- Characteristics of Big data
- Introduction to Hadoop
- Hadoop Distributed File System
- Comparing Hadoop & SQL
- Industries using Hadoop
- Data Locality
- Hadoop Architecture
- Map Reduce & HDFS
- Using the Hadoop single node image (Clone)
- Hadoop Distributed File System (HDFS)
- HDFS Design & Concepts
- Blocks, Name nodes and Data nodes
- HDFS High-Availability and HDFS Federation
- Hadoop DFS The Command-Line Interface
- Basic File System Operations
- Anatomy of File Read, File Write
- Block Placement Policy and Modes
- More detailed explanation about Configuration files
- Metadata, FS image, Edit log, Secondary Name Node and Safe Mode
- How to add New Data Node dynamically, decommission a Data Node dynamically (Without stopping cluster)
- FSCK Utility. (Block report)
- How to override default configuration at system level and Programming level
- HDFS Federation
- ZOOKEEPER Leader Election Algorithm
- Exercise and small use case on HDFS
- Map Reduce
- Map Reduce Functional Programming Basics
- Map and Reduce Basics
- How Map Reduce Works
- Anatomy of a Map Reduce Job Run
- Legacy Architecture ->Job Submission, Job Initialization, Task Assignment, Task Execution, Progress and Status Updates
- Job Completion, Failures
- Shuffling and Sorting
- Splits, Record reader, Partition, Types of partitions & Combiners
- Optimization Techniques -> Speculative Execution, JVM Reuse and No. Slots
- Types of Schedulers and Counters
- Comparisons between Old and New API at code and Architecture Level
- Getting the data from RDBMS into HDFS using Custom data types
- Distributed Cache and Hadoop Streaming (Python, Ruby, and R)
- YARN
- Sequential Files and Map Files
- Enabling Compression Codec’s
- Map side Join with distributed Cache
- Types of I/O Formats: Multiple outputs, NLINEinputformat
- Handling small files using CombineFileInputFormat
- Map Reduce Programming – Java Programming
- Hands on “Word Count” in Map Reduce in standalone and Pseudo distribution Mode
- Sorting files using Hadoop Configuration API discussion
- Emulating “grep” for searching inside a file in Hadoop
- DB Input Format
- Job Dependency API discussion
- Input Format API discussion, Split API discussion
- Custom Data type creation in Hadoop
- NOSQL
- ACID in RDBMS and BASE in NoSQL
- CAP Theorem and Types of Consistency
- Types of NoSQL Databases in detail
- Columnar Databases in Detail (HBASE and CASSANDRA)
- TTL, Bloom Filters and Compensation
- HBase
- HBase Installation, Concepts
- HBase Data Model and Comparison between RDBMS and NOSQL
- Master & Region Servers
- HBase Operations (DDL and DML) through Shell and Programming and HBase Architecture
- Catalog Tables
- Block Cache and sharding
- SPLITS
- DATA Modeling (Sequential, Salted, Promoted and Random Keys)
- JAVA API’s and Rest Interface
- Client Side Buffering and Process 1 million records using Client side Buffering
- HBase Counters
- Enabling Replication and HBase RAW Scans
- HBase Filters
- Bulk Loading and Co processors (Endpoints and Observers with programs)
- Real world use case consisting of HDFS,MR and HBASE
- Hive
- Hive Installation, Introduction and Architecture
- Hive Services, Hive Shell, Hive Server and Hive Web Interface (HWI)
- Meta store, Hive QL
- OLTP vs. OLAP
- Working with Tables
- Primitive data types and complex data types
- Working with Partitions
- User Defined Functions
- Hive Bucketed Tables and Sampling
- External partitioned tables, Map the data to the partition in the table, Writing the output of one query to another table, Multiple inserts
- Dynamic Partition
- Differences between ORDER BY, DISTRIBUTE BY and SORT BY
- Bucketing and Sorted Bucketing with Dynamic partition
- RC File
- INDEXES and VIEWS
- MAPSIDE JOINS
- Compression on hive tables and Migrating Hive tables
- Dynamic substation of Hive and Different ways of running Hive
- How to enable Update in HIVE
- Log Analysis on Hive
- Access HBASE tables using Hive
- Hands on Exercises
- Pig
- Pig Installation
- Execution Types
- Grunt Shell
- Pig Latin
- Data Processing
- Schema on read
- Primitive data types and complex data types
- Tuple schema, BAG Schema and MAP Schema
- Loading and Storing
- Filtering, Grouping and Joining
- Debugging commands (Illustrate and Explain)
- Validations,Type casting in PIG
- Working with Functions
- User Defined Functions
- Types of JOINS in pig and Replicated Join in detail
- SPLITS and Multiquery execution
- Error Handling, FLATTEN and ORDER BY
- Parameter Substitution
- Nested For Each
- User Defined Functions, Dynamic Invokers and Macros
- How to access HBASE using PIG, Load and Write JSON DATA using PIG
- Piggy Bank
- Hands on Exercises
- SQOOP
- Sqoop Installation
- Import Data
- Incremental Import
- Free Form Query Import
- Export data to RDBMS, HIVE and HBASE
- Hands on Exercises
- HCatalog
- HCatalog Installation
- Introduction to HCatalog
- About Hcatalog with PIG, HIVE and MR
- Hands on Exercises
- Flume
- Flume Installation
- Introduction to Flume
- Flume Agents: Sources, Channels and Sinks
- Log User information using Java program in to HDFS using LOG4J and Avro Source, Tail Source
- Log User information using Java program in to HBASE using LOG4J and Avro Source, Tail Source
- Flume Commands
- Use case of Flume: Flume the data from twitter in to HDFS and HBASE. Do some analysis using HIVE and PIG
- More Ecosystems
- HUE. (Hortonworks and Cloudera)
- Oozie
- Workflow,to show how to schedule Sqoop Job, Hive, MR and PIG
- Real world Use case which will find the top websites used by users of certain ages and will be scheduled to run for everyone hour
- Zookeeper
- HBASE Integration with HIVE and PIG
- Phoenix
- Proof of concept (POC)
- SPARK
- Spark Overview
- Linking with Spark, Initializing Spark
- Using the Shell
- Resilient Distributed Datasets (RDDs)
- Parallelized Collections
- External Datasets
- RDD Operations
- Basics, Passing Functions to Spark
- Working with Key-Value Pairs
- Transformations
- Actions
- RDD Persistence
- Which Storage Level to Choose?
- Removing Data
- Shared Variables
- Broadcast Variables
- Accumulators
- Deploying to a Cluster
- Unit Testing
- Migrating from pre-1.0 Versions of Spark
- Where to Go from Here