spark-installation

Spark Setup
Spark Setup

How to Setup Spark.

  • Spark distribution without Hadoop & Scala packages
  • Setup Spark for client provided/remote/third-party/external Hadoop Hdfs & Scala SDKs installed
  • Spark Version - 2.4.8
  • For Scala Spark: Install and configure Scala SDK Version 2.12.0
  • For Pyspark, Install python with versions between 3.4 and 3.7
  • Check Pyenv Tool for installing.


1. Install Scala 2.12.0 on Ubuntu
Download scala installation file for Ubuntu
$ sudo dpkg -i /home/mkm/Downloads/scala-2.12.0.deb

2.Check the installation directory
$ whereis scala
scala: /usr/bin/scala /usr/share/scala /usr/share/man/man1/scala.1.gz

3. Update bashrc for Scala
$ gedit .bashrc
export PATH=$PATH:/usr/share/scala/bin

4. Check Scala version
$ scala -version
Scala code runner version 2.12.0

To enter scala console
$ scala
If we get any warning/error reg. numberformat error 0X100
relaunch scala shell with below property adding to bashrc
export TERM=xterm-color

5. Download and extract spark-2.4.8 without Hadoop & Scala

6. Update bashrc for spark
export SPARK_HOME=/home/mkm/mm/softwares/spark-2.4.8
export PATH=$PATH:$SPARK_HOME/bin

7. Edit SPARK_HOME/conf/spark-env.sh

# to overcome console formation exception
export TERM=xterm-color

#  To include Hadoop provided instance
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath):"$HADOOP_HOME/share/hadoop/tools/lib/*"

8. Additional properties to set on bashrc

export CLASSPATH=$CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*:.
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/*:.
export CLASSPATH=$CLASSPATH:SPARK_HOME/jars/*:.

9. Start the Scala Spark shell
$ spark-shell




1. Python support for Spark 2.4.8
min version - 3.4
max version - 3.7

2. Install pyenv - easily installs and manages different python versions

3. pyenv install 3.7.13

4. Add below properties to .bashcrc
export PYENV_ROOT=~/.pyenv
export PATH=$PATH:$PYENV_ROOT/bin
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:/home/mkm/.pyenv/versions/3.7.13/lib/python3.7
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip
export PYSPARK_PYTHON=python3
export PATH=$PATH:/home/mkm/.pyenv/versions/3.7.13/lib/python3.7
eval "$(pyenv init --path)"

5. Open a new terminal
$ pyspark
Python 3.7.13 (default, Aug  5 2024, 10:41:15)
[GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
24/08/05 21:27:19 WARN util.Utils: Your hostname, mkm-virtual-machine resolves to a loopback address: 127.0.1.1; using 192.168.62.130 instead (on interface ens33)
24/08/05 21:27:19 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
24/08/05 21:27:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to Spark
Using Python version 3.7.13 (default, Aug  5 2024 10:41:15)
SparkSession available as 'spark'.
>>>


Compiled on TUESDAY, 06-AUGUST-2024, 12:29:05 AM IST

Comments

Popular posts from this blog

hadoop-installation-ubuntu

jenv-tool

hive-installation-in-ubuntu