Basic SQL, Python would be great.
This training program will help participants to learn
- Big Data and Hadoop
- Spark
- Kafka
- Hive and Sqoop
- AWS
Introduction to Big Data
- Definition of Big Data
- Characteristics (Volume, Velocity, Variety, Veracity, Value)
- Importance and Use Cases
Big Data Ecosystem
- Overview of Big Data Tools and Technologies
- Hadoop Ecosystem
- NoSQL Databases
- Data Warehousing Solutions
Introduction to Hadoop
- What is Hadoop?
- History of Hadoop
- Use Cases and Advantages
Hadoop Architecture
- HDFS (Hadoop Distributed File System)
- MapReduce
- YARN (Yet Another Resource Negotiator)
Lab
- Setting up Hadoop cluster
- Basic HDFS commands
- Running a simple MapReduce job
Hadoop Installation
- Installing Hadoop
- Configuring Hadoop
Lab
- Hands-on installation of Hadoop on a local machine or a cloud instance
HDFS Architecture
- HDFS Components (NameNode, DataNode, Secondary NameNode)
- HDFS Read/Write Process
Lab
- Exploring HDFS architecture and components
HDFS Commands
- Basic HDFS Commands (mkdir, ls, put, get, rm, etc.)
- HDFS File Permissions
Lab
- Practicing HDFS commands
- Managing file permissions in HDFS
MapReduce Basics
- Introduction to MapReduce
- MapReduce Workflow
Lab
- Writing a basic MapReduce program
YARN Architecture
- YARN Components (Resource Manager, Node Manager, Application Master)
- YARN Workflow
Lab
- Exploring YARN architecture and components
Introduction to Hive
- What is Hive?
- Hive vs RDBMS
- Use Cases of Hive
Hive Architecture
- Hive Components (Driver, Compiler, Metastore, etc.)
- Hive Query Language (HQL)
- Data Storage in Hive
Lab
- Setting up Hive
- Exploring Hive components
Hive Installation
- Installing and Configuring Hive
- Hive Shell and Beeline
Lab
- Installation of Hive on Hadoop cluster
- Basic commands in Hive Shell and Beeline
Hive Data Types
- Primitive Data Types
- Collection Data Types
Lab
- Creating tables with various data types
Hive Tables
- Managed Tables
- External Tables
- Partitioned Tables
- Bucketing
Lab
- Creating and managing tables
- Working with partitioned and bucketed tables
Hive File Formats
- TextFile
- SequenceFile
- ORC (Optimized Row Columnar)
- Parquet
Lab
- Creating and using tables with different file formats
HiveQL Basics
- SQL Select statements
- Filtering and Sorting Data
- Joins in Hive
Lab
- Writing basic HiveQL queries
- Performing joins in Hive
HiveQL (Hive Query Language)
Advanced HiveQL
- Subqueries
- Views and Indexes
- User-Defined Functions (UDFs)
- Windowing and Analytics Functions
Lab
- Writing advanced HiveQL queries
- Creating and using UDFs
Hive DDL Operations
- Create, Alter, Drop Table
- Create, Alter, Drop Database
Lab
- Practicing DDL operations
Hive DML Operations
- Insert, Update, Delete
- Load Data
Lab
- Practicing DML operations
Hive Optimization
- Query Optimization Techniques
- Indexing and Bucketing
- Partition Pruning
Lab
- Optimization techniques and performance tuning
Hive Indexes
- Creating and Managing Indexes
- Using Indexes to Improve Performance
Lab
- Creating and using indexes to optimize queries
Hive Project
- Usecase of hive project
Lab
- create usecase using data
Introduction to Spark
- What is Apache Spark?
- Spark vs Hadoop MapReduce
- Intro to components of Spark (Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX)
Spark Architecture
- Spark Architecture and Components
- RDD (Resilient Distributed Dataset)
- DAG (Directed Acyclic Graph)
- Spark Cluster Modes (Standalone, YARN, Mesos)
Lab
- Setting up Spark cluster
- Understanding RDDs and DAGs
Spark Installation
- Installing and Configuring Spark
- Spark Shell and Notebooks
Lab
- Installation of Spark on Hadoop cluster or local machine
- Using Spark Shell and Notebooks
Working with RDDs
- Creating RDDs
- Transformations and Actions
- Lazy Evaluation
Lab
- Creating and manipulating RDDs
- Performing transformations and actions
Advanced RDD Operations
- Key-Value Pair RDDs
- Aggregations
- RDD Persistence and Caching
Lab
- Performing advanced RDD operations
- Persisting and caching RDDs
Spark Configurations
- Spark Configuration Properties
- Tuning Parallelism
- Memory Management
Lab
- Configuring Spark for optimal performance
- Managing memory and parallelism
Introduction to Spark SQL
- Spark SQL Overview
- DataFrames and Datasets
- SQL Queries
Lab
- Working with DataFrames and Datasets
- Executing SQL queries in Spark
DataFrame Operations
- Creating DataFrames
- Transformations on DataFrames
- Aggregations and Joins
Lab
- Creating and transforming DataFrames
- Performing aggregations and joins
Working with Structured Data
- Schemas and Encoders
- Reading and Writing Data
- Integrating with Hive
Lab
- Defining schemas and encoders
- Reading/writing data from various sources
Introduction to Spark Streaming
- What is Spark Streaming?
- Discretized Stream (DStream)
- Spark Streaming Architecture
Lab
- Setting up a streaming application
- Understanding DStream architecture
Working with DStreams
- Creating DStreams
- Transformations on DStreams
- Window Operations on DStreams
Lab
- Creating and transforming DStreams
- Performing window operations on DStreams
Spark Performance Tuning
- Optimization Techniques
- Memory Management
- Serialization
- Resources optimization
- Code optimzation
- Handelling skew
Lab
- Performance tuning and optimization
- Managing memory and serialization
Best Practices in Spark
- Coding Best Practices
- Monitoring and Debugging Spark Applications
Lab
- Writing efficient Spark code
- Monitoring and debugging Spark applications
Introduction to Kafka
- What is Apache Kafka?
- Kafka vs Other Messaging Systems
- Use Cases of Kafka
Kafka Architecture
- Kafka Components (Broker, Zookeeper, Producer, Consumer, Topic, Partition)
- Kafka Workflow
- Kafka Cluster Architecture
Lab
- Setting up a Kafka cluster
- Understanding Kafka components and workflow
Kafka Installation
- Installing and Configuring Kafka
- Kafka CLI Commands
Lab
- Installation of Kafka on local machine or cloud instance
- Using Kafka CLI commands
Kafka Producers
- Producing Messages to Kafka
- Producer API
- Advanced Producer Configurations
Lab
- Writing a simple producer
- Sending messages to Kafka topics
Kafka Consumers
- Consuming Messages from Kafka
- Consumer API
- Advanced Consumer Configurations
Lab
- Writing a simple consumer
- Reading messages from Kafka topics
Producer and Consumer Best Practices
- Producer Acknowledgments
- Consumer Offsets and Group Management
- Idempotent Producers and Exactly Once Semantics
Lab
- Implementing best practices for producers and consumers
- Managing offsets and ensuring message delivery
Kafka Streams
- Introduction to Kafka Streams
- Stream Processing Concepts
- KStream and KTable
Lab
- Implementing stream processing applications using Kafka Streams
Advanced Kafka Streams
- State Stores and Windowing
- Joins and Aggregations
- Error Handling and Reprocessing
Lab
- Building stateful stream processing applications
- Performing joins and aggregations on streams
Kafka Streams Best Practices
- Scaling and Fault Tolerance
- Testing and Debugging
- Performance Tuning
Lab
- Implementing best practices for Kafka Streams applications
- Optimizing and scaling stream processing
KSQL
- Introduction to KSQL
- Querying Kafka Topics with SQL
- Building Streaming Applications with KSQL
Lab
- Writing KSQL queries to process streams
- Building streaming applications using KSQL
Introduction to Kafka Connect
- What is Kafka Connect?
- Connectors and Tasks
- Use Cases of Kafka Connect
Setting Up Kafka Connect
- Installing and Configuring Kafka Connect
- Kafka Connect Properties and Workers
Lab
- Setting up Kafka Connect on a local machine or cluster
- Configuring workers and properties
Source Connectors
- Introduction to Source Connectors
- Common Source Connectors (JDBC, FileSource)
- Configuring and Running Source Connectors
Lab
- Setting up and using source connectors to ingest data into Kafka topics
Sink Connectors
- Introduction to Sink Connectors
- Common Sink Connectors (HDFS, Elasticsearch)
- Configuring and Running Sink Connectors
Lab
- Setting up and using sink connectors to export data from Kafka topics
Kafka Connect Transformations
- Single Message Transforms (SMTs)
- Custom Transformations
- Error Handling in Connectors
Lab
- Implementing single message transforms
- Writing custom transformations for connectors
Kafka Use Cases
- Common Kafka Use Cases (Messaging, Log Aggregation, Stream Processing)
- Real-World Kafka Deployments
Lab
- Discussing real-world Kafka use cases
- Designing and implementing a Kafka-based solution
Introduction to Sqoop
- What is Apache Sqoop?
- Use Cases of Sqoop
- Sqoop Architecture
Sqoop Installation
- Installing and Configuring Sqoop
- Sqoop Command-Line Interface
Lab
- Installing Sqoop on Hadoop cluster or local machine
- Configuring Sqoop properties
Basic Sqoop Import
- Importing Data from RDBMS to HDFS
- Importing Data to Hive
- Importing Data to HBase
Lab
- Importing tables from MySQL to HDFS
- Importing data into Hive tables
Advanced Sqoop Import
- Incremental Imports
- Free-form Query Imports
- Importing Data to HBase
Lab
- Performing incremental imports
- Using free-form queries for import
- Importing data into HBase
Basic Sqoop Export
- Exporting Data from HDFS to RDBMS
- Exporting Data from Hive to RDBMS
Lab
- Exporting data from HDFS to MySQL
- Exporting Hive table data to PostgreSQL
Advanced Sqoop Export
- Exporting Data from HBase to RDBMS
- Performance Tuning for Export Operations
Lab
- Exporting data from HBase to MySQL
- Tuning export performance
Introduction to AWS
- Overview of AWS
- Key AWS Services
- AWS Global Infrastructure (Regions and Availability Zones)
Amazon S3 Basics
- Introduction to S3
- Buckets and Objects
- S3 Storage Classes
Lab
- Creating S3 buckets
- Uploading and managing objects in S3
S3 Security and Management
- Bucket Policies and IAM Policies
- Versioning and Lifecycle Policies
- Logging and Monitoring
Lab
- Configuring bucket policies
- Implementing versioning and lifecycle rules
- Enabling S3 logging and monitoring
Advanced S3 Features
- S3 Transfer Acceleration
- Cross-Region Replication
- S3 Select
- S3 Notification
Lab
- Using S3 Transfer Acceleration
- Setting up cross-region replication
- Querying data with S3 Select
CloudWatch Basics
- Introduction to CloudWatch
- CloudWatch Metrics
- CloudWatch Alarms
Lab
- Creating CloudWatch alarms
- Monitoring AWS resources using CloudWatch metrics
Advanced CloudWatch
- CloudWatch Logs
- CloudWatch Events
- CloudWatch Dashboards
Lab
- Setting up CloudWatch Logs
- Creating CloudWatch Events
- Building CloudWatch Dashboards
Introduction to Athena
- What is Amazon Athena?
- Use Cases of Athena
- Querying Data with Athena
Lab
- Basic operations on Athena
Athena Querying
- Writing SQL Queries in Athena
- Using Athena with S3
- Performance Tuning
Lab
- Querying S3 data using Athena
- Optimizing Athena queries for performance
Redshift Basics
- Introduction to Redshift
- Redshift Architecture
- Use Cases of Redshift
Redshift Setup and Management
- Creating Redshift Clusters
- Managing Cluster Performance
- Redshift Security
Lab
- Creating and managing Redshift clusters
- Configuring security settings for Redshift
Advanced Redshift
- Redshift Spectrum
- Data Loading and Unloading
- Performance Tuning
Lab
- Using Redshift Spectrum
- Loading and unloading data
- Optimizing Redshift performance
RDS Basics
- Introduction to RDS
- RDS Instances
- RDS Use Cases
RDS Setup and Management
- Creating RDS Instances
- Backup and Restore
- RDS Security
Lab
- Creating and managing RDS instances
- Configuring backup and restore options
Advanced RDS Features
- Reading data from RDS
- Dbeaver Connection
Lab
- Querying data
AWS DynamoDB
- Introduction to NoSQL databases and DynamoDB
- Data modeling and schema design
- Provisioned vs. On-demand capacity
Lab
- Setting up DynamoDB tables and indexes
- Querying and scanning data in DynamoDB
- DynamoDB transactions and best practices
- Participants will also learn about DynamoDB Streams for real-time data processing
AWS Kinesis Family
- Overview of Kinesis Streams, Firehose, and Analytics
- Real-time data processing with Kinesis Streams
- Analytics and insights using Kinesis Analytics
Lab
- Setting up Kinesis streams and data ingestion
- Data transformation with Kinesis Firehose
- Integrating Kinesis with Lambda for real-time processing. Participants will also explore Kinesis Data Analytics for real-time insights and anomaly detection
AWS SQS
- Introduction to message queues and SQS
- FIFO vs. Standard queues
- Scaling and performance considerations
Lab
- Lab: Creating SQS queues and managing message lifecycle
- Message visibility and handling; Integrating SQS with Lambda for event-driven processing
- Participants will also practice implementing message-driven architectures using SQS
AWS SNS
- Overview of pub/sub messaging with SNS
- Topic-based vs. Direct messaging
- Integrating SNS with other AWS services
Lab
- Creating SNS topics and subscriptions
- Message filtering and delivery policies
- Handling message delivery failures
- Participants will explore SNS message attributes for fine-grained message filtering and dead-letter queues for handling failed messages
AWS Lambda
- Serverless computing concepts and introduction to Lambda
- Event sources and function invocation
- Managing Lambda functions and monitoring performance
Lab
- Creating Lambda functions and triggers
- Writing Lambda functions in different programming languages
- Integrating Lambda with other AWS services
- Participants will also learn about Lambda@Edge for extending Lambda to the edge of the AWS global network
AWS EMR
- Overview of EMR (Elastic MapReduce) and big data processing
- Setting up EMR clusters and configuring applications
- Running and monitoring EMR jobs
Lab
- Setting up EMR clusters with specific configurations
- Running sample big data processing jobs
- Monitoring job performance and scaling EMR clusters if needed
- Participants will explore EMR integration with Apache Spark and Apache Hadoop for distributed data processing
AWS Glue
- Introduction to AWS Glue and its role in ETL processes
- Options available in Glue
- Paramaters and libraries
Lab
- Setting up AWS Glue jobs and understanding ETL workflows
- Configuring Glue crawlers to discover and catalog data
- Writing Glue scripts for data transformation
AWS Glue Connection
- Overview of AWS Glue Connection and its role in data source access
Lab
- Creating Glue connections to various data sources such as Amazon S3, RDS, Redshift, etc.
- Configuring connection properties and access permissions
AWS Glue Crawler
- Understanding AWS Glue Crawler for automated data cataloging
Lab
- Configuring Glue crawlers to automatically discover and catalog data from various sources;
- Defining crawler schedules and output formats
Git and Gitlab
- Git installation
- Git Commands
- Git Branches
- Gitlab integration
Lab
- Setup git and push changes to repository using git
DevOps
- CICD process
- Dbeaver
- Putty/SSH