Data Engineering

Prerequisites

Basic SQL, Python would be great.

Learning Objective

This training program will help participants to learn

- Big Data and Hadoop
- Spark
- Kafka
- Hive and Sqoop
- AWS

Content Outline

 Introduction to Big Data 

- Definition of Big Data
- Characteristics (Volume, Velocity, Variety, Veracity, Value)
- Importance and Use Cases 

Big Data Ecosystem

- Overview of Big Data Tools and Technologies
- Hadoop Ecosystem
- NoSQL Databases
- Data Warehousing Solutions

 Introduction to Hadoop

- What is Hadoop?
- History of Hadoop
- Use Cases and Advantages          

 Hadoop Architecture   

- HDFS (Hadoop Distributed File System)
- MapReduce
- YARN (Yet Another Resource Negotiator) 

Lab

- Setting up Hadoop cluster
- Basic HDFS commands
- Running a simple MapReduce job 

 Hadoop Installation   

- Installing Hadoop
- Configuring Hadoop   

Lab

- Hands-on installation of Hadoop on a local machine or a cloud instance                                              

 HDFS Architecture     

- HDFS Components (NameNode, DataNode, Secondary NameNode)
- HDFS Read/Write Process 

Lab

-  Exploring HDFS architecture and components                                       

 HDFS Commands         

- Basic HDFS Commands (mkdir, ls, put, get, rm, etc.)
- HDFS File Permissions

Lab

- Practicing HDFS commands
- Managing file permissions in HDFS                     

 MapReduce Basics      

- Introduction to MapReduce
- MapReduce Workflow 

 Lab

- Writing a basic MapReduce program                                                                            

 YARN Architecture     

- YARN Components (Resource Manager, Node Manager, Application Master)
- YARN Workflow

Lab

-  Exploring YARN architecture and components                                         

 Introduction to Hive 

- What is Hive?
- Hive vs RDBMS
- Use Cases of Hive               

 Hive Architecture    

- Hive Components (Driver, Compiler, Metastore, etc.)
- Hive Query Language (HQL)
- Data Storage in Hive 

Lab

- Setting up Hive
- Exploring Hive components                                           

 Hive Installation    

- Installing and Configuring Hive
- Hive Shell and Beeline

Lab

- Installation of Hive on Hadoop cluster
- Basic commands in Hive Shell and Beeline                 

 Hive Data Types      

- Primitive Data Types
- Collection Data Types      

Lab

-  Creating tables with various data types                                                                  

 Hive Tables          

- Managed Tables
- External Tables
- Partitioned Tables
- Bucketing 

Lab

- Creating and managing tables
- Working with partitioned and bucketed tables             

 Hive File Formats    

- TextFile
- SequenceFile
- ORC (Optimized Row Columnar)
- Parquet 

Lab

-  Creating and using tables with different file formats                                   

 HiveQL Basics        

- SQL Select statements
- Filtering and Sorting Data
- Joins in Hive 

Lab

- Writing basic HiveQL queries
- Performing joins in Hive                                 

HiveQL (Hive Query Language)

 Advanced HiveQL      

- Subqueries
- Views and Indexes
- User-Defined Functions (UDFs)
- Windowing and Analytics Functions 


Lab

- Writing advanced HiveQL queries
- Creating and using UDFs                               
 

 Hive DDL Operations  

- Create, Alter, Drop Table
- Create, Alter, Drop Database       

Lab

- Practicing DDL operations                                                                    

 Hive DML Operations  

- Insert, Update, Delete
- Load Data           

Lab

- Practicing DML operations                                                                                      

 Hive Optimization    

- Query Optimization Techniques
- Indexing and Bucketing
- Partition Pruning 

Lab

- Optimization techniques and performance tuning                                          

 Hive Indexes         

- Creating and Managing Indexes
- Using Indexes to Improve Performance 

Lab

- Creating and using indexes to optimize queries                                          

Hive Project

- Usecase of hive project

Lab

- create usecase using data

 Introduction to Spark 

- What is Apache Spark?
- Spark vs Hadoop MapReduce
- Intro to components of Spark (Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX)

 Spark Architecture    

- Spark Architecture and Components
- RDD (Resilient Distributed Dataset)
- DAG (Directed Acyclic Graph)
- Spark Cluster Modes (Standalone, YARN, Mesos) 

Lab

- Setting up Spark cluster
- Understanding RDDs and DAGs                                

 Spark Installation    

- Installing and Configuring Spark
- Spark Shell and Notebooks       

Lab

- Installation of Spark on Hadoop cluster or local machine
- Using Spark Shell and Notebooks 

 Working with RDDs     

- Creating RDDs
- Transformations and Actions
- Lazy Evaluation 

Lab

- Creating and manipulating RDDs
- Performing transformations and actions                  

 Advanced RDD Operations 

- Key-Value Pair RDDs
- Aggregations
- RDD Persistence and Caching 

Lab

- Performing advanced RDD operations
- Persisting and caching RDDs                        

 Spark Configurations  

- Spark Configuration Properties
- Tuning Parallelism
- Memory Management 

Lab

- Configuring Spark for optimal performance
- Managing memory and parallelism             

 Introduction to Spark SQL 

- Spark SQL Overview
- DataFrames and Datasets
- SQL Queries    

Lab

- Working with DataFrames and Datasets
- Executing SQL queries in Spark                    

 DataFrame Operations  

- Creating DataFrames
- Transformations on DataFrames
- Aggregations and Joins 

Lab

- Creating and transforming DataFrames
- Performing aggregations and joins               

 Working with Structured Data 

- Schemas and Encoders
- Reading and Writing Data
- Integrating with Hive 

Lab

- Defining schemas and encoders
- Reading/writing data from various sources              

 Introduction to Spark Streaming 

- What is Spark Streaming?
- Discretized Stream (DStream)
- Spark Streaming Architecture 

Lab

- Setting up a streaming application
- Understanding DStream architecture                 

 Working with DStreams 

- Creating DStreams
- Transformations on DStreams
- Window Operations on DStreams 

Lab

- Creating and transforming DStreams
- Performing window operations on DStreams           

 Spark Performance Tuning 

- Optimization Techniques
- Memory Management
- Serialization
- Resources optimization
- Code optimzation
- Handelling skew 

Lab

- Performance tuning and optimization
- Managing memory and serialization

 Best Practices in Spark 

- Coding Best Practices
- Monitoring and Debugging Spark Applications 

Lab

- Writing efficient Spark code
- Monitoring and debugging Spark applications              

 Introduction to Kafka 

- What is Apache Kafka?
- Kafka vs Other Messaging Systems
- Use Cases of Kafka 

 Kafka Architecture    

- Kafka Components (Broker, Zookeeper, Producer, Consumer, Topic, Partition)
- Kafka Workflow
- Kafka Cluster Architecture 

Lab

- Setting up a Kafka cluster
- Understanding Kafka components and workflow              

 Kafka Installation    

- Installing and Configuring Kafka
- Kafka CLI Commands              

Lab

- Installation of Kafka on local machine or cloud instance
- Using Kafka CLI commands    

 Kafka Producers       

- Producing Messages to Kafka
- Producer API
- Advanced Producer Configurations 

Lab

- Writing a simple producer
- Sending messages to Kafka topics                          

 Kafka Consumers       

- Consuming Messages from Kafka
- Consumer API
- Advanced Consumer Configurations 

Lab

- Writing a simple consumer
- Reading messages from Kafka topics                        

 Producer and Consumer Best Practices 

- Producer Acknowledgments
- Consumer Offsets and Group Management
- Idempotent Producers and Exactly Once Semantics 

Lab

- Implementing best practices for producers and consumers
- Managing offsets and ensuring message delivery 

 Kafka Streams         

- Introduction to Kafka Streams
- Stream Processing Concepts
- KStream and KTable 

Lab

- Implementing stream processing applications using Kafka Streams                         

 Advanced Kafka Streams 

- State Stores and Windowing
- Joins and Aggregations
- Error Handling and Reprocessing 

Lab

- Building stateful stream processing applications
- Performing joins and aggregations on streams 

 Kafka Streams Best Practices 

- Scaling and Fault Tolerance
- Testing and Debugging
- Performance Tuning 

Lab

- Implementing best practices for Kafka Streams applications
- Optimizing and scaling stream processing 

 KSQL                  

- Introduction to KSQL
- Querying Kafka Topics with SQL
- Building Streaming Applications with KSQL

Lab

- Writing KSQL queries to process streams
- Building streaming applications using KSQL   

 Introduction to Kafka Connect 

- What is Kafka Connect?
- Connectors and Tasks
- Use Cases of Kafka Connect 

 Setting Up Kafka Connect 

- Installing and Configuring Kafka Connect
- Kafka Connect Properties and Workers 

Lab

- Setting up Kafka Connect on a local machine or cluster
- Configuring workers and properties 

 Source Connectors     

- Introduction to Source Connectors
- Common Source Connectors (JDBC, FileSource)
- Configuring and Running Source Connectors 

Lab

- Setting up and using source connectors to ingest data into Kafka topics                

 Sink Connectors       

- Introduction to Sink Connectors
- Common Sink Connectors (HDFS, Elasticsearch)
- Configuring and Running Sink Connectors 

Lab

- Setting up and using sink connectors to export data from Kafka topics                   

 Kafka Connect Transformations 

- Single Message Transforms (SMTs)
- Custom Transformations
- Error Handling in Connectors 

Lab

- Implementing single message transforms
- Writing custom transformations for connectors 

 Kafka Use Cases       

- Common Kafka Use Cases (Messaging, Log Aggregation, Stream Processing)
- Real-World Kafka Deployments 

Lab

- Discussing real-world Kafka use cases
- Designing and implementing a Kafka-based solution 

 Introduction to Sqoop 

- What is Apache Sqoop?
- Use Cases of Sqoop
- Sqoop Architecture 

 Sqoop Installation    

- Installing and Configuring Sqoop
- Sqoop Command-Line Interface    

Lab

- Installing Sqoop on Hadoop cluster or local machine
- Configuring Sqoop properties     

 Basic Sqoop Import    

- Importing Data from RDBMS to HDFS
- Importing Data to Hive
- Importing Data to HBase 

Lab

- Importing tables from MySQL to HDFS
- Importing data into Hive tables                  

 Advanced Sqoop Import 

- Incremental Imports
- Free-form Query Imports
- Importing Data to HBase 

Lab

- Performing incremental imports
- Using free-form queries for import
- Importing data into HBase 

 Basic Sqoop Export    

- Exporting Data from HDFS to RDBMS
- Exporting Data from Hive to RDBMS 

Lab

- Exporting data from HDFS to MySQL
- Exporting Hive table data to PostgreSQL           

 Advanced Sqoop Export 

- Exporting Data from HBase to RDBMS
- Performance Tuning for Export Operations 

Lab

- Exporting data from HBase to MySQL
- Tuning export performance                         

 Introduction to AWS    

- Overview of AWS
- Key AWS Services
- AWS Global Infrastructure (Regions and Availability Zones)
 

 Amazon S3 Basics       

- Introduction to S3
- Buckets and Objects
- S3 Storage Classes 

Lab

- Creating S3 buckets
- Uploading and managing objects in S3                             

 S3 Security and Management 

- Bucket Policies and IAM Policies
- Versioning and Lifecycle Policies
- Logging and Monitoring 

Lab

- Configuring bucket policies
- Implementing versioning and lifecycle rules
- Enabling S3 logging and monitoring 

 Advanced S3 Features   

- S3 Transfer Acceleration
- Cross-Region Replication
- S3 Select
- S3 Notification

Lab

- Using S3 Transfer Acceleration
- Setting up cross-region replication
- Querying data with S3 Select 

CloudWatch Basics      

- Introduction to CloudWatch
- CloudWatch Metrics
- CloudWatch Alarms 

Lab

- Creating CloudWatch alarms
- Monitoring AWS resources using CloudWatch metrics        

 Advanced CloudWatch    

- CloudWatch Logs
- CloudWatch Events
- CloudWatch Dashboards 

Lab

- Setting up CloudWatch Logs
- Creating CloudWatch Events
- Building CloudWatch Dashboards  

 Introduction to Athena 

- What is Amazon Athena?
- Use Cases of Athena
- Querying Data with Athena

Lab

- Basic operations on Athena 

 Athena Querying        

- Writing SQL Queries in Athena
- Using Athena with S3
- Performance Tuning 

Lab

- Querying S3 data using Athena
- Optimizing Athena queries for performance             

Redshift Basics        

- Introduction to Redshift
- Redshift Architecture
- Use Cases of Redshift 

Redshift Setup and Management 

- Creating Redshift Clusters
- Managing Cluster Performance
- Redshift Security 

Lab

- Creating and managing Redshift clusters
- Configuring security settings for Redshift  

Advanced Redshift    

- Redshift Spectrum
- Data Loading and Unloading
- Performance Tuning  

Lab

- Using Redshift Spectrum
- Loading and unloading data
- Optimizing Redshift performance  

RDS Basics   

- Introduction to RDS
- RDS Instances
- RDS Use Cases                  

RDS Setup and Management 

- Creating RDS Instances
- Backup and Restore
- RDS Security 

Lab

- Creating and managing RDS instances
- Configuring backup and restore options            

Advanced RDS Features 

- Reading data from RDS
- Dbeaver Connection 

Lab

- Querying data
 

AWS DynamoDB

- Introduction to NoSQL databases and DynamoDB
- Data modeling and schema design
- Provisioned vs. On-demand capacity  

Lab

- Setting up DynamoDB tables and indexes
- Querying and scanning data in DynamoDB 
- DynamoDB transactions and best practices 
- Participants will also learn about DynamoDB Streams for real-time data processing        

AWS Kinesis Family     

- Overview of Kinesis Streams, Firehose, and Analytics
- Real-time data processing with Kinesis Streams
- Analytics and insights using Kinesis Analytics

Lab

- Setting up Kinesis streams and data ingestion
- Data transformation with Kinesis Firehose
- Integrating Kinesis with Lambda for real-time processing. Participants will also explore Kinesis Data Analytics for real-time insights and anomaly detection

 AWS SQS                 

- Introduction to message queues and SQS
- FIFO vs. Standard queues
- Scaling and performance considerations 

Lab

- Lab: Creating SQS queues and managing message lifecycle
- Message visibility and handling; Integrating SQS with Lambda for event-driven processing
- Participants will also practice implementing message-driven architectures using SQS

AWS SNS                 

- Overview of pub/sub messaging with SNS
- Topic-based vs. Direct messaging
- Integrating SNS with other AWS services 

Lab 

- Creating SNS topics and subscriptions
- Message filtering and delivery policies
- Handling message delivery failures
- Participants will explore SNS message attributes for fine-grained message filtering and dead-letter queues for handling failed messages

AWS Lambda 

- Serverless computing concepts and introduction to Lambda
- Event sources and function invocation
- Managing Lambda functions and monitoring performance 

Lab

- Creating Lambda functions and triggers 
- Writing Lambda functions in different programming languages
- Integrating Lambda with other AWS services
- Participants will also learn about Lambda@Edge for extending Lambda to the edge of the AWS global network          

AWS EMR

- Overview of EMR (Elastic MapReduce) and big data processing
- Setting up EMR clusters and configuring applications
- Running and monitoring EMR jobs 

Lab 

- Setting up EMR clusters with specific configurations
- Running sample big data processing jobs 
- Monitoring job performance and scaling EMR clusters if needed 
- Participants will explore EMR integration with Apache Spark and Apache Hadoop for distributed data processing               

AWS Glue

- Introduction to AWS Glue and its role in ETL processes
- Options available in Glue
- Paramaters and libraries 

Lab

- Setting up AWS Glue jobs and understanding ETL workflows 
- Configuring Glue crawlers to discover and catalog data 
- Writing Glue scripts for data transformation               

AWS Glue Connection

- Overview of AWS Glue Connection and its role in data source access  

Lab

- Creating Glue connections to various data sources such as Amazon S3, RDS, Redshift, etc.
- Configuring connection properties and access permissions     

AWS Glue Crawler        

- Understanding AWS Glue Crawler for automated data cataloging         

Lab

- Configuring Glue crawlers to automatically discover and catalog data from various sources; 
- Defining crawler schedules and output formats

Git and Gitlab

- Git installation
- Git Commands
- Git Branches
- Gitlab integration

Lab

- Setup git and push changes to repository using git

DevOps

- CICD process
- Dbeaver
- Putty/SSH

Send a Message.


  • Enroll