BigData Hadoop
Hadoop, as a Big Data framework, provides businesses with the ability to distribute data storage, parallel processing, and process data at higher volume, higher velocity, variety, value, and veracity.
Course Contents
Hadoop uses distributed storage and parallel processing to handle big data and analytics jobs, breaking workloads down into smaller workloads that can be run at the same time.
Intro
HDFS
MapReduce
Pig
Sqoop
Hive
HBase
Flume
Oozie, HCatalog
Mahout
Intro
- What is Big Data
- Need and significance of innovative technologies
- What is Hadoop
- 3 Vs (Characteristics)
- History of Hadoop and its Uses Different Components of Hadoop
- Various Hadoop Distributions
- Traditional Database vs Hadoop
HDFS
- Significance of HDFS in Hadoop
- HDFS Features
- Daemons of Hadoopand functionalities
- NameNode
- DataNode
- JobTracker
- TaskTrack
- Secondary NameNode
- Data Storage in HDFS
- Blocks
- Heartbeats
- Data Replication
- HDFS Federation
- High Availability
- Accessing HDFS
- CLI (Command Line Interface) Unix and Hadoop Commands
- Java Based Approach
- Data Flow
- Anatomy of a File Read
- Anatomy of a File Write
- Hadoop Archives
MapReduce
- Introduction to MapReduce
- MapReduce Architecture
- MapReduce Programming Model
- MapReduce Algorithm and Phases
- Data Types
- Input Splits and Records
- Blocks Vs Splits
- Basic MapReduce Program
- Driver Code
- Mapper Code
- Reducer Code
- Combiner and Shuffler
- Creating Input and Output formats in MapReduce Jobs
- File Input / Output Format
- Text Input / Output Format
- Sequence File Input / Output Format,etc.
- Data Localization in MapReduce Distributed Cache
- A Sample Map reduce Program
- Identity Mapper
- IdentityReducer
Pig
- Introduction to Apache Pig
- MapReduce Vs. Apache Pig
- SQL Vs. Apache Pig
- Different Data types in Apache Pig
- Modes of Execution in Apache Pig
- Local Mode
- Map Reduce or Distributed Mode
- Execution Mechanism
- Grunt shell
- Script
- Embedded
- Data Processing Operators
- Loading and Storing Data
- Filtering Data
- Grouping and Joining Data
- Sorting Data
- Combining and Splitting Data
- How to write a simple PIG Script
- UDFs in PIG
Sqoop
- Introduction to Sqoop
- Sqoop Architecture and Internals
- MySQL client and server installation
- How to connect relational database using Sqoop
- Sqoop Commands
- Different flavors of imports
- Export
- HIVE imports
Hive
- The Metastore
- Comparison with Traditional Databases
- Schema on Read Versus Schema on Write
- Updates, Transactions, and Indexes
- HiveQL
- Data Types
- Operators and Functions
- Tables
- Managed Tables and External Tables
- Static Partitions and Dynamic Partitions
- Partitions and Buckets
- Storage Formats
- Importing Data
- Altering Tables
- Dropping Tables
- Querying Data
- Sorting and Aggregating
- Hive Query Language
- MapReduce Scripts
- Joins
- Subqueries
- Views
- User-Defined Functions
- Writing a UDF
- Writing a UDAF
- Limitations of Hive
- Hive vs Pig
HBase
- Introduction to Hbase
- HBaseVs HDFS
- Use Cases
- Basics Concepts
- Column families
- Scans
- Hbase Architecture
- Zoo Keeper
- SQL databases vs NoSQL databases
- Clients
- REST
- Thrift
- Java Based
- Avro
- MapReduce integration
- MapReduce over Hbase
- Schema definition
- Basic CRUD Operations
Flume
- Introduction to Flume
- Uses of Flume
- Flume Architecture
- Flume Master
- Flume Collectors
- Flume Agents
Oozie, HCatalog
- Introduction to Oozie
- Uses of Oozie
- Oozie workflow basics
Mahout
- Introduction to Mahout