spark-internals

Spark Internals
Spark Internals

Nomenclatures

  • Driver - Program launched by spark job submission, executes user defined logic(transformations & actions) on a set of executors and controls life cycle of entire Spark application.
  • Application - Comprises driver program and set of executors on the clusters.
  • Cluster - Collection of worker nodes with a cluster manager.
  • Deployment Mode - Decides the driver program launch area, Client mode launches driver on user system/Cluster mode launches driver on one of the worker nodes.
  • Executors - Aka workers, launches by the driver program for distributed processing of data.
  • Cluster Manager - Maintains all the resources(executors/storage) information of the cluster in a centralized way. (Standalone, Yarn, Mesos or Kubernetes)
  • DAG - Directed Acyclic Graph is a logical representation of transformations and actions in the spark code.
  • Action - Function used to calculate final operations. They trigger the execution of code with chained transformations.
  • Transformation - used to create a new dataframe/rdd by applying a new functionality on the data.
  • RDD - Resilient Distributed Dataset - collection of data that is distribution among nodes for parallel processing.
  • DataFrame - Spark SQL object to hold the data across different nodes for parallel processing.
  • Job - A Spark job is created for every action(with immediate predecessors of transformations) in a spark application.
  • Stage - Each Spark job is further divided into stages that can be executed parallelly or in sequence.
  • Task - Stage further divided into Tasks and parallelly launched to the executors. Each task is independent and executes on single partition.
  • Partitions - Slice of data that is maintained on different node for fault tolerance.
  • DataSet - DataFrame with collection of Rows with schema.
  • Spark Session - Unified entry point for the spark applications to create DataFrames with different sources.
  • Spark Context - Entry point to connect to spark cluster for creating RDDs and broadcast variables.
  • DStreams - Spark streaming object contains data that is fetched between the intervals.
Spark Components

Spark Core, Spark SQL, Spark Streaming, Spark MLib, Spark GraphX & Spark R

RDD(Resilient Distributed Dataset)

  • It is a primitive & basic datastructure provided to hold the data in distributed manner.
  • Types: Normal RDD, Paired RDD
  • Creation:1. By Parallelized collections 2. Applying transform functions on existing RDD 3. By reading external datasets/file types.
  • Features: In-memory computation, Lazy evaluation, Fault tolerance, Immutability, Persistance, Partitioning, Persistance, Schema Typed
  • Operations: Transformations & Actions
  • Transformation Types: Narrow Transformations & Wide Transformations
  • Narrow Transformations: The operations that can be applied on single partition without shuffling of records
  • Examples: Map, Flatmap, Map partitions, Filter, Sample, Union, etc.
  • Wide Transformations: The operations that need shuffling of records as a partition data may depend/mapped to another partition record.
  • Examples: GroupByKey, ReduceByKey, Join, Partition, Coalesce, etc.
  • Limitations: No default optimization techniques, No Schema support, No support for handling complex/nested types.
  • Persist & Caching the RDD: Saving thre required RDD in memory/secondary storage does not computes again and increases the performance of the spark program and subsequent transformations/actions on the saved RDD will be faster.
  • Cache stores the RDD in the storage level MEMORY_ONLY(default)
  • Persist can store RDD in MEMORY_ONLY(default), MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY
  • Unpersist: rdd.unpersist() method is used to remove the rdd from the storage
DataFrames

DataSets

DStreams

  • Spark also supports stream processing along with batch processing.
  • DStreams aka Discretized streams are basic abstraction of spark streaming which is used to process real time data/near real time.
  • Data continuosly flows from different streaming sources to the application in the series with the given interval
  • Sources: server logs, telemetry datam IoT devices data or messaging systems.
  • DStream Operations: Staless transformations & Stateful transformations
  • Stateless Transformations: The processing of a batch of streaming data is not dependent on previous batches.
  • Stateful Transformations: The computation of a batch of streaming data is dependent on previous batches values over a window period.
  • Checkpoints: It is a Data recovery method to store the metadata or data on the reliable storage like HDFS/Cloud.
Compiled on SUNDAY, 08-SEPTEMBER-2024, 08:13:56 PM IST

Comments

Popular posts from this blog

hadoop-installation-ubuntu

jenv-tool

hive-installation-in-ubuntu