spark-internals

spark-internals

September 08, 2024

Spark Internals

Spark Internals


Nomenclatures Driver - Program launched by spark job submission, executes user defined logic(transformations & actions) on a set of executors and controls life cycle of entire Spark application. Application - Comprises driver program and set of executors on the clusters. Cluster - Collection of worker nodes with a cluster manager. Deployment Mode - Decides the driver program launch area, Client mode launches driver on user system/Cluster mode launches driver on one of the worker nodes. Executors - Aka workers, launches by the driver program for distributed processing of data. Cluster Manager - Maintains all the resources(executors/storage) information of the cluster in a centralized way. (Standalone, Yarn, Mesos or Kubernetes) DAG - Directed Acyclic Graph is a logical representation of transformations and actions in the spark code. Action - Function used to calculate final operations. They trigger the execution of code with chained transformations. Transformation - used to create a new dataframe/rdd by applying a new functionality on the data. RDD - Resilient Distributed Dataset - collection of data that is distribution among nodes for parallel processing. DataFrame - Spark SQL object to hold the data across different nodes for parallel processing. Job - A Spark job is created for every action(with immediate predecessors of transformations) in a spark application. Stage - Each Spark job is further divided into stages that can be executed parallelly or in sequence. Task - Stage further divided into Tasks and parallelly launched to the executors. Each task is independent and executes on single partition. Partitions - Slice of data that is maintained on different node for fault tolerance. DataSet - DataFrame with collection of Rows with schema. Spark Session - Unified entry point for the spark applications to create DataFrames with different sources. Spark Context - Entry point to connect to spark cluster for creating RDDs and broadcast variables. DStreams - Spark streaming object contains data that is fetched between the intervals.

Spark Components Spark Core, Spark SQL, Spark Streaming, Spark MLib, Spark GraphX & Spark R

RDD(Resilient Distributed Dataset) It is a primitive & basic datastructure provided to hold the data in distributed manner. Types: Normal RDD, Paired RDD Creation:1. By Parallelized collections 2. Applying transform functions on existing RDD 3. By reading external datasets/file types. Features: In-memory computation, Lazy evaluation, Fault tolerance, Immutability, Persistance, Partitioning, Persistance, Schema Typed Operations: Transformations & Actions Transformation Types: Narrow Transformations & Wide Transformations Narrow Transformations: The operations that can be applied on single partition without shuffling of records Examples: Map, Flatmap, Map partitions, Filter, Sample, Union, etc. Wide Transformations: The operations that need shuffling of records as a partition data may depend/mapped to another partition record. Examples: GroupByKey, ReduceByKey, Join, Partition, Coalesce, etc. Limitations: No default optimization techniques, No Schema support, No support for handling complex/nested types. Persist & Caching the RDD: Saving thre required RDD in memory/secondary storage does not computes again and increases the performance of the spark program and subsequent transformations/actions on the saved RDD will be faster. Cache stores the RDD in the storage level MEMORY_ONLY(default) Persist can store RDD in MEMORY_ONLY(default), MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY Unpersist: rdd.unpersist() method is used to remove the rdd from the storage

DataFrames

DataSets

DStreams Spark also supports stream processing along with batch processing. DStreams aka Discretized streams are basic abstraction of spark streaming which is used to process real time data/near real time. Data continuosly flows from different streaming sources to the application in the series with the given interval Sources: server logs, telemetry datam IoT devices data or messaging systems. DStream Operations: Staless transformations & Stateful transformations Stateless Transformations: The processing of a batch of streaming data is not dependent on previous batches. Stateful Transformations: The computation of a batch of streaming data is dependent on previous batches values over a window period. Checkpoints: It is a Data recovery method to store the metadata or data on the reliable storage like HDFS/Cloud.
Compiled on SUNDAY, 08-SEPTEMBER-2024, 08:13:56 PM IST

Comments