Description
PySpark-python API for Apache Spark. It enables you to perform real-time, large-scale data processing in a distributed environment using Python. PySpark helps you interface with Resilient Distributed Datasets (RDDs) in Apache Spark and Python programming language.
Course Content
1-Operating System
- Intro to Operating System
- Important Unix Commands
2-Python
- Main constructs of any programming language: Sequence- Condition -Loop
- Working with Python packages- types of packages
- Importing and installing Packages
- Searching for python packages
- IDE Familiarity – Spyder/Pycharm/Jupyter Notebook
- Python Operators including bitwise operators
- Variables & Types
- Conditional statements – If else
- Loops
- Working with strings and arrays
- Functions
- Data Libraries (Numpy, Pandas)
3-RDBMS
- Database Architecture
- Data modelling
- Relational Database concepts
- Database design and schema
- DDL – Create, Alter, Drop Databases
- DML – Load and Query Data
4-Data warehousing
- Overview of Data Warehousing
- Concepts and architecture of Data Warehouses
5-Big Data Concepts
- Introduction to Big Data
- Distributed computing and Hadoop Architecture
6-Storage
- Storing data on Hadoop – HDFS
7-PySpark
- Spark Architecture
- Spark Session
- Spark Language API’s
- Data Frame and Partitions
- Transformations & Actions
- Structured API’s (PySpark-python API)
- Schema Spark
- Types Structured
- API Execution
- Operation on Data Frames
- Working with Different Data Types
- Aggregations in Spark
- Joins in Spark
- RDD and RDD Operations, DAG
8-PySpark Streaming
- PySpark Streaming
- Structured Streaming
Reviews
There are no reviews yet.