Radiant

Big Data Overview 1

Introduction to Big Data

- Definition of Big Data
- Characteristics (Volume, Velocity, Variety, Veracity, Value)
- Importance and Use Cases

Big Data Overview 2

Big Data Ecosystem

- Overview of Big Data Tools and Technologies
- Hadoop Ecosystem
- NoSQL Databases
- Data Warehousing Solutions

Hadoop Fundamentals 1

Introduction to Hadoop

- What is Hadoop?
- History of Hadoop
- Use Cases and Advantages

Hadoop Fundamentals 2

Hadoop Architecture

- HDFS (Hadoop Distributed File System)
- MapReduce
- YARN (Yet Another Resource Negotiator)

Lab

- Setting up Hadoop cluster
- Basic HDFS commands
- Running a simple MapReduce job

Hadoop Fundamentals 3

Hadoop Installation

- Installing Hadoop
- Configuring Hadoop

Lab

- Hands-on installation of Hadoop on a local machine or a cloud instance

HDFS (Hadoop Distributed File System) 1

HDFS Architecture

- HDFS Components (NameNode, DataNode, Secondary NameNode)
- HDFS Read/Write Process

Lab

- Exploring HDFS architecture and components

HDFS (Hadoop Distributed File System) 2

HDFS Commands

- Basic HDFS Commands (mkdir, ls, put, get, rm, etc.)
- HDFS File Permissions

Lab

- Practicing HDFS commands
- Managing file permissions in HDFS

MapReduce

MapReduce Basics

- Introduction to MapReduce
- MapReduce Workflow

Lab

- Writing a basic MapReduce program

YARN (Yet Another Resource Negotiator)

YARN Architecture

- YARN Components (Resource Manager, Node Manager, Application Master)
- YARN Workflow

Lab

- Exploring YARN architecture and components

Hive Fundamentals 1

Introduction to Hive

- What is Hive?
- Hive vs RDBMS
- Use Cases of Hive

Hive Fundamentals 2

Hive Architecture

- Hive Components (Driver, Compiler, Metastore, etc.)
- Hive Query Language (HQL)
- Data Storage in Hive

Lab

- Setting up Hive
- Exploring Hive components

Hive Fundamentals 3

Hive Installation

- Installing and Configuring Hive
- Hive Shell and Beeline

Lab

- Installation of Hive on Hadoop cluster
- Basic commands in Hive Shell and Beeline

Hive Data Model 1

Hive Data Types

- Primitive Data Types
- Collection Data Types

Lab

- Creating tables with various data types

Hive Data Model 2

Hive Tables

- Managed Tables
- External Tables
- Partitioned Tables
- Bucketing

Lab

- Creating and managing tables
- Working with partitioned and bucketed tables

Hive Data Model 3

Hive File Formats

- TextFile
- SequenceFile
- ORC (Optimized Row Columnar)
- Parquet

Lab

- Creating and using tables with different file formats

HiveQL (Hive Query Language) 1

HiveQL Basics

- SQL Select statements
- Filtering and Sorting Data
- Joins in Hive

Lab

- Writing basic HiveQL queries
- Performing joins in Hive

HiveQL (Hive Query Language) 2

HiveQL (Hive Query Language)

Advanced HiveQL

- Subqueries
- Views and Indexes
- User-Defined Functions (UDFs)
- Windowing and Analytics Functions

Lab

- Writing advanced HiveQL queries
- Creating and using UDFs

HiveQL (Hive Query Language) 3

Hive DDL Operations

- Create, Alter, Drop Table
- Create, Alter, Drop Database

Lab

- Practicing DDL operations

HiveQL (Hive Query Language) 4

Hive DML Operations

- Insert, Update, Delete
- Load Data

Lab

- Practicing DML operations

Hive Optimization and Performance 1

Hive Optimization

- Query Optimization Techniques
- Indexing and Bucketing
- Partition Pruning

Lab

- Optimization techniques and performance tuning

Hive Optimization and Performance 2

Hive Indexes

- Creating and Managing Indexes
- Using Indexes to Improve Performance

Lab

- Creating and using indexes to optimize queries

Hive Project

- Usecase of hive project

Lab

- create usecase using data

Spark Fundamentals 1

Introduction to Spark

- What is Apache Spark?
- Spark vs Hadoop MapReduce
- Intro to components of Spark (Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX)

Spark Fundamentals 2

Spark Architecture

- Spark Architecture and Components
- RDD (Resilient Distributed Dataset)
- DAG (Directed Acyclic Graph)
- Spark Cluster Modes (Standalone, YARN, Mesos)

Lab

- Setting up Spark cluster
- Understanding RDDs and DAGs

Spark Fundamentals 3

Spark Installation

- Installing and Configuring Spark
- Spark Shell and Notebooks

Lab

- Installation of Spark on Hadoop cluster or local machine
- Using Spark Shell and Notebooks

Spark Core 1

Working with RDDs

- Creating RDDs
- Transformations and Actions
- Lazy Evaluation

Lab

- Creating and manipulating RDDs
- Performing transformations and actions

Spark Core 2

Advanced RDD Operations

- Key-Value Pair RDDs
- Aggregations
- RDD Persistence and Caching

Lab

- Performing advanced RDD operations
- Persisting and caching RDDs

Spark Core 3

Spark Configurations

- Spark Configuration Properties
- Tuning Parallelism
- Memory Management

Lab

- Configuring Spark for optimal performance
- Managing memory and parallelism

Spark SQL and DataFrames 1

Introduction to Spark SQL

- Spark SQL Overview
- DataFrames and Datasets
- SQL Queries

Lab

- Working with DataFrames and Datasets
- Executing SQL queries in Spark

Spark SQL and DataFrames 2

DataFrame Operations

- Creating DataFrames
- Transformations on DataFrames
- Aggregations and Joins

Lab

- Creating and transforming DataFrames
- Performing aggregations and joins

Spark SQL and DataFrames 3

Working with Structured Data

- Schemas and Encoders
- Reading and Writing Data
- Integrating with Hive

Lab

- Defining schemas and encoders
- Reading/writing data from various sources

Spark Streaming 1

Introduction to Spark Streaming

- What is Spark Streaming?
- Discretized Stream (DStream)
- Spark Streaming Architecture

Lab

- Setting up a streaming application
- Understanding DStream architecture

Spark Streaming 2

Working with DStreams

- Creating DStreams
- Transformations on DStreams
- Window Operations on DStreams

Lab

- Creating and transforming DStreams
- Performing window operations on DStreams

Spark Performance Tuning and Best Practices 1

Spark Performance Tuning

- Optimization Techniques
- Memory Management
- Serialization
- Resources optimization
- Code optimzation
- Handelling skew

Lab

- Performance tuning and optimization
- Managing memory and serialization

Spark Performance Tuning and Best Practices 2

Best Practices in Spark

- Coding Best Practices
- Monitoring and Debugging Spark Applications

Lab

- Writing efficient Spark code
- Monitoring and debugging Spark applications

Kafka Fundamentals 1

Introduction to Kafka

- What is Apache Kafka?
- Kafka vs Other Messaging Systems
- Use Cases of Kafka

Kafka Fundamentals 2

Kafka Architecture

- Kafka Components (Broker, Zookeeper, Producer, Consumer, Topic, Partition)
- Kafka Workflow
- Kafka Cluster Architecture

Lab

- Setting up a Kafka cluster
- Understanding Kafka components and workflow

Kafka Fundamentals 3

Kafka Installation

- Installing and Configuring Kafka
- Kafka CLI Commands

Lab

- Installation of Kafka on local machine or cloud instance
- Using Kafka CLI commands

Kafka Producers and Consumers 1

Kafka Producers

- Producing Messages to Kafka
- Producer API
- Advanced Producer Configurations

Lab

- Writing a simple producer
- Sending messages to Kafka topics

Kafka Producers and Consumers 2

Kafka Consumers

- Consuming Messages from Kafka
- Consumer API
- Advanced Consumer Configurations

Lab

- Writing a simple consumer
- Reading messages from Kafka topics

Kafka Producers and Consumers 3

Producer and Consumer Best Practices

- Producer Acknowledgments
- Consumer Offsets and Group Management
- Idempotent Producers and Exactly Once Semantics

Lab

- Implementing best practices for producers and consumers
- Managing offsets and ensuring message delivery

Kafka Streams and KSQL 1

Kafka Streams

- Introduction to Kafka Streams
- Stream Processing Concepts
- KStream and KTable

Lab

- Implementing stream processing applications using Kafka Streams

Kafka Streams and KSQL 2

Advanced Kafka Streams

- State Stores and Windowing
- Joins and Aggregations
- Error Handling and Reprocessing

Lab

- Building stateful stream processing applications
- Performing joins and aggregations on streams

Kafka Streams and KSQL 3

Kafka Streams Best Practices

- Scaling and Fault Tolerance
- Testing and Debugging
- Performance Tuning

Lab

- Implementing best practices for Kafka Streams applications
- Optimizing and scaling stream processing

Kafka Streams and KSQL 4

KSQL

- Introduction to KSQL
- Querying Kafka Topics with SQL
- Building Streaming Applications with KSQL

Lab

- Writing KSQL queries to process streams
- Building streaming applications using KSQL

Kafka Connect 1

Introduction to Kafka Connect

- What is Kafka Connect?
- Connectors and Tasks
- Use Cases of Kafka Connect

Kafka Connect 2

Setting Up Kafka Connect

- Installing and Configuring Kafka Connect
- Kafka Connect Properties and Workers

Lab

- Setting up Kafka Connect on a local machine or cluster
- Configuring workers and properties

Kafka Connect 3

Source Connectors

- Introduction to Source Connectors
- Common Source Connectors (JDBC, FileSource)
- Configuring and Running Source Connectors

Lab

- Setting up and using source connectors to ingest data into Kafka topics

Kafka Connect 4

Sink Connectors

- Introduction to Sink Connectors
- Common Sink Connectors (HDFS, Elasticsearch)
- Configuring and Running Sink Connectors

Lab

- Setting up and using sink connectors to export data from Kafka topics

Kafka Connect 5

Kafka Connect Transformations

- Single Message Transforms (SMTs)
- Custom Transformations
- Error Handling in Connectors

Lab

- Implementing single message transforms
- Writing custom transformations for connectors

Kafka Use Cases and Best Practices

Kafka Use Cases

- Common Kafka Use Cases (Messaging, Log Aggregation, Stream Processing)
- Real-World Kafka Deployments

Lab

- Discussing real-world Kafka use cases
- Designing and implementing a Kafka-based solution

Introduction to Sqoop

- What is Apache Sqoop?
- Use Cases of Sqoop
- Sqoop Architecture

Sqoop Installation and Setup

Sqoop Installation

- Installing and Configuring Sqoop
- Sqoop Command-Line Interface

Lab

- Installing Sqoop on Hadoop cluster or local machine
- Configuring Sqoop properties

Sqoop Import 1

Basic Sqoop Import

- Importing Data from RDBMS to HDFS
- Importing Data to Hive
- Importing Data to HBase

Lab

- Importing tables from MySQL to HDFS
- Importing data into Hive tables

Sqoop Import 2

Advanced Sqoop Import

- Incremental Imports
- Free-form Query Imports
- Importing Data to HBase

Lab

- Performing incremental imports
- Using free-form queries for import
- Importing data into HBase

Sqoop Export 1

Basic Sqoop Export

- Exporting Data from HDFS to RDBMS
- Exporting Data from Hive to RDBMS

Lab

- Exporting data from HDFS to MySQL
- Exporting Hive table data to PostgreSQL

Sqoop Export 2

Advanced Sqoop Export

- Exporting Data from HBase to RDBMS
- Performance Tuning for Export Operations

Lab

- Exporting data from HBase to MySQL
- Tuning export performance

Introduction to AWS

- Overview of AWS
- Key AWS Services
- AWS Global Infrastructure (Regions and Availability Zones)

S3 Part 1

Amazon S3 Basics

- Introduction to S3
- Buckets and Objects
- S3 Storage Classes

Lab

- Creating S3 buckets
- Uploading and managing objects in S3

S3 Part 2

S3 Security and Management

- Bucket Policies and IAM Policies
- Versioning and Lifecycle Policies
- Logging and Monitoring

Lab

- Configuring bucket policies
- Implementing versioning and lifecycle rules
- Enabling S3 logging and monitoring

S3 Part 3

Advanced S3 Features

- S3 Transfer Acceleration
- Cross-Region Replication
- S3 Select
- S3 Notification

Lab

- Using S3 Transfer Acceleration
- Setting up cross-region replication
- Querying data with S3 Select

Amazon CloudWatch 1

CloudWatch Basics

- Introduction to CloudWatch
- CloudWatch Metrics
- CloudWatch Alarms

Lab

- Creating CloudWatch alarms
- Monitoring AWS resources using CloudWatch metrics

Amazon CloudWatch 2

Advanced CloudWatch

- CloudWatch Logs
- CloudWatch Events
- CloudWatch Dashboards

Lab

- Setting up CloudWatch Logs
- Creating CloudWatch Events
- Building CloudWatch Dashboards

Amazon Athena 1

Introduction to Athena

- What is Amazon Athena?
- Use Cases of Athena
- Querying Data with Athena

Lab

- Basic operations on Athena

Amazon Athena 2

Athena Querying

- Writing SQL Queries in Athena
- Using Athena with S3
- Performance Tuning

Lab

- Querying S3 data using Athena
- Optimizing Athena queries for performance

Amazon Redshift 1

Redshift Basics

- Introduction to Redshift
- Redshift Architecture
- Use Cases of Redshift

Amazon Redshift 2

Redshift Setup and Management

- Creating Redshift Clusters
- Managing Cluster Performance
- Redshift Security

Lab

- Creating and managing Redshift clusters
- Configuring security settings for Redshift

Amazon Redshift 3

Advanced Redshift

- Redshift Spectrum
- Data Loading and Unloading
- Performance Tuning

Lab

- Using Redshift Spectrum
- Loading and unloading data
- Optimizing Redshift performance

Amazon RDS 1

RDS Basics

- Introduction to RDS
- RDS Instances
- RDS Use Cases

Amazon RDS 2

RDS Setup and Management

- Creating RDS Instances
- Backup and Restore
- RDS Security

Lab

- Creating and managing RDS instances
- Configuring backup and restore options

Amazon RDS 3

Advanced RDS Features

- Reading data from RDS
- Dbeaver Connection

Lab

- Querying data

AWS DynamoDB

- Introduction to NoSQL databases and DynamoDB
- Data modeling and schema design
- Provisioned vs. On-demand capacity

Lab

- Setting up DynamoDB tables and indexes
- Querying and scanning data in DynamoDB
- DynamoDB transactions and best practices
- Participants will also learn about DynamoDB Streams for real-time data processing

AWS Kinesis Family

- Overview of Kinesis Streams, Firehose, and Analytics
- Real-time data processing with Kinesis Streams
- Analytics and insights using Kinesis Analytics

Lab

- Setting up Kinesis streams and data ingestion
- Data transformation with Kinesis Firehose
- Integrating Kinesis with Lambda for real-time processing. Participants will also explore Kinesis Data Analytics for real-time insights and anomaly detection

AWS SQS

- Introduction to message queues and SQS
- FIFO vs. Standard queues
- Scaling and performance considerations

Lab

- Lab: Creating SQS queues and managing message lifecycle
- Message visibility and handling; Integrating SQS with Lambda for event-driven processing
- Participants will also practice implementing message-driven architectures using SQS

AWS SNS

- Overview of pub/sub messaging with SNS
- Topic-based vs. Direct messaging
- Integrating SNS with other AWS services

Lab

- Creating SNS topics and subscriptions
- Message filtering and delivery policies
- Handling message delivery failures
- Participants will explore SNS message attributes for fine-grained message filtering and dead-letter queues for handling failed messages

AWS Lambda

- Serverless computing concepts and introduction to Lambda
- Event sources and function invocation
- Managing Lambda functions and monitoring performance

Lab

- Creating Lambda functions and triggers
- Writing Lambda functions in different programming languages
- Integrating Lambda with other AWS services
- Participants will also learn about Lambda@Edge for extending Lambda to the edge of the AWS global network

AWS EMR

- Overview of EMR (Elastic MapReduce) and big data processing
- Setting up EMR clusters and configuring applications
- Running and monitoring EMR jobs

Lab

- Setting up EMR clusters with specific configurations
- Running sample big data processing jobs
- Monitoring job performance and scaling EMR clusters if needed
- Participants will explore EMR integration with Apache Spark and Apache Hadoop for distributed data processing

AWS Glue 1

AWS Glue

- Introduction to AWS Glue and its role in ETL processes
- Options available in Glue
- Paramaters and libraries

Lab

- Setting up AWS Glue jobs and understanding ETL workflows
- Configuring Glue crawlers to discover and catalog data
- Writing Glue scripts for data transformation

AWS Glue 2

AWS Glue Connection

- Overview of AWS Glue Connection and its role in data source access

Lab

- Creating Glue connections to various data sources such as Amazon S3, RDS, Redshift, etc.
- Configuring connection properties and access permissions

AWS Glue 3

AWS Glue Crawler

- Understanding AWS Glue Crawler for automated data cataloging

Lab

- Configuring Glue crawlers to automatically discover and catalog data from various sources;
- Defining crawler schedules and output formats

Git

Git and Gitlab

- Git installation
- Git Commands
- Git Branches
- Gitlab integration

Lab

- Setup git and push changes to repository using git

CICD/DevOps

DevOps

- CICD process
- Dbeaver
- Putty/SSH

Data Engineering

Prerequisites

Learning Objective

Content Outline

Send a Message.

Training Category