Image source here

How to setup PySpark on Windows?

Saumya Goyal

--

PySpark setup and Jupyter Notebook Integration

Apache Spark is an engine vastly used for big data processing. But why do we need it? Firstly, we have produced and consumed a huge amount of data within the past decade and a half. Secondly, we decided to process this data for decision-making and better predictions. Now as the amount of data grows, so does the need for infrastructure to process it efficiently and quickly (oh! The impatient homo-sapiens).

Apache Spark is an open-source engine and was released by the Apache Software Foundation in 2014 for handling and processing a humongous amount of data. Currently, Apache Spark provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. Spark also supports higher-level tools including Spark SQL for SQL and structured data processing, and MLlib for machine learning, to name a few.

Spark helps by separating the data in different clusters and parallelizing the data processing task for GBs and TBs of data. It does so at a very low latency, too. You can read further about the features and usage of Spark here.

But what is PySpark?

To put it in simple words, PySpark is a set of Spark APIs in Python language. It includes almost all Apache Spark features. Because of the simplicity of Python and the efficient processing of large datasets by Spark, PySpark became a hit among the data science practitioners who mostly like to work in Python.

What is wrong with “pip install pyspark” ?

Well, we (Python coders) love Python partly because of the rich libraries and easy one-step installation. In the case of PySpark, it is a bit different: you can still use the above-mentioned command, but your capabilities with it are limited. When using pip, you can install only the PySpark package which can be used to test your jobs locally or run your jobs on an existing cluster running with Yarn, Standalone, or Mesos. It does not contain features or libraries to set up your own cluster, which is a capability you want to have as a beginner.

If you want PySpark with all its features, including starting your own cluster, then follow this blog further…

PySpark Installation

Dependencies of PySpark for Windows system include:

  1. JAVA
  2. Python
  3. PySpark
  4. Winutils

1. Download and Install JAVA

As Spark uses Java Virtual Machine internally, it has a dependency on JAVA. Install the latest version of the JAVA from here.

  • JAVA Download Link: here
  • Install JAVA by running the downloaded file (easy and traditional browse…next…next…finish installation)

2. Download and Install Python

If you are going to work on a data science related project, I recommend you download Python and Jupyter Notebook together with the Anaconda Navigator.

  • Anaconda Download Link: here
  • Follow the self-explanatory traditional installation steps (same as above)

Otherwise, you can also download Python and Jupyter Notebook separately

  • Python Download link: here
  • Run the downloaded file for installation, make sure to check the “include python to Path” and install the recommended packages (including ‘pip’)

To see if Python was successfully installed and that Python is in the PATH environment variable, go to the command prompt and type “python”. You should see something like this. (my Python version is 3.8.5, yours could be different)

In case you do not see the above command, please follow this tutorial for help.

Next, you will need the Jupyter Notebook to be installed for learning integration with PySpark

Install Jupyter Notebook by typing the following command on the command prompt: “pip install notebook

3. Download and unzip PySpark

Finally, it is time to get PySpark. From the link provided below, download the .tgz file using bullet point 3. You can choose the version from the drop-down menus. Then download the 7-zip or any other extractor and extract the downloaded PySpark file. Remember, you will have to unzip the file twice.

Source: Apache Spark
  • PySpark Download Link: here
  • 7zip Download Link: here

Note: The location of my file where I extracted Pyspark is
“E:\PySpark\spark-3.2.1-bin-hadoop3.2” (we will need it later)

4. Download winutils.exe

In order to run Apache Spark locally, winutils.exe is required in the Windows Operating system. This is because Spark needs elements of the Hadoop codebase called ‘winutils‘ when it runs on non-windows clusters. These windows utilities (winutils) help the management of the POSIX(Portable Operating System Interface) file system permissions that the HDFS (Hadoop Distributed File System) requires from the local (windows) file system.

Too-technical? Just download it. Make sure to select the correct Hadoop version.

  • winutils.exe Download Link: here
  • Create a folder structure hadoop\bin within the Pyspark folder and put the downloaded winutils.exe file there.

Note: The location of my winutils.exe is
“E:\PySpark\spark-3.2.1-bin-hadoop3.2\hadoop\bin”

5. Set Environment variables

Now that we have downloaded everything we need, it is time to make it accessible through the command prompt by setting the environment variables.

Some Side Info: What are Environment variables?

Environment variables are global system variables accessible by all the processes / users running under the operating system.

PATH is the most frequently used environment variable, it stores a list of directories to search for executable programs (.exe files). To reference a variable in Windows, you can use %varname%.

Some more side info: What does PATH do?

When you launch an executable program (with file extension of ".exe", ".bat" or ".com") from the command prompt, Windows searches for the executable program in the current working directory, followed by all the directories listed in the PATH environment variable. If the program is not found in these directories, you will get the following error saying “the command is not recognized”.

Back to the PySpark installation. In order to set the environment variables

  • Go to Windows search
  • Type “env” —it will show the “edit environment variable for your account”, click on it
  • Click on “New” for the user variables and add the following variable name and values (depending upon the location of the downloaded files)
Adding the folder where PySpark is downloaded to
Adding the Hadoop folder location referencing the PySpark folder — SPARK_HOME
Adding the Python.exe file for PySpark to use

Next, Update the PATH variable with the \bin folder address, containing the executable files of PySpark and Hadoop. This will help in executing Pyspark from the command prompt.

  • Click on the “Path” variable
  • Then add the following two values ( we are using the previously defined Environment variables here)
    %SPARK_HOME%\bin
    %HADOOP_HOME%\bin
Adding \bin directory to the PATH which contains the executables

6. Let's fire PySpark!

Test if PySpark has been installed correctly and all the environment variables are set.

  • Go to command prompt
  • type “pyspark
PySpark running with my Python version 3.8.5

Great! You have now installed PySpark successfully and it seems like it is running. To see PySpark running, go to “https://localhost:4040” without closing the command prompt and check for yourself.

PySpark is up and running on port 4040 of your localhost:127.0.0.1

7. Jupyter Notebook integration with Python

Now, once the PySpark is running in the background, you could open a Jupyter notebook and start working on it. But running PySpark commands will still throw an error (as it does not know which cluster to use) and in that case, you will have to use a python library “findspark”. And use the following two commands before PySpark import statements in the Jupyter Notebook.

import findspark
findspark.init()

But there is a workaround. You can configure PySpark to fire up a Jupyter Notebook instantiated with the current Spark cluster by running just the command “pyspark” on the command prompt. To achieve this, you will not have to download additional libraries. For this…

… you will need to add two more environment variables

Adding this tells to run Jupyter Notebook with the PySpark command
Adding the additional information and port where the Jupyter Notebook should run

Now, when you run the “pyspark” in the command prompt:

  • It will give information on how to open the Jupyter Notebook.
  • Just copy the URL (highlight and use CTRL+c) and paste it into the browser along with the token information — this will open Jupyter Notebook.
pyspark command fires up Jupyter Notebook instance

8. Running a sample code on the Jupyter Notebook

Just to make sure everything is working fine, and you are ready to use the PySpark integrated with your Jupyter Notebook.

  • Run Pyspark through the command prompt
  • Open Jupyter Notebook
  • Write the following commands and execute them
# Import Libraries
import pyspark
from pyspark import SQLContext

# Setup the Configuration
conf = pyspark.SparkConf()
spark_context = SparkSession.builder.getOrCreate()

# Add Data
data = ([(1580, "John", "Doe", "Mars" ),
(5820, "Jane", "Doe", "Venus"),
(2340, "Kid1", "Doe", "Jupyter"),
(7860, "Kid2", "Doe", "Saturn")
])

# Setup the Data Frame
user_data_df = spark_context.createDataFrame(data)

# Display the Data Frame
user_data_df.show()
Just a sample code
Spark displays the job carried out

My Version information

  • Python: 3.8.5
  • JAVA: 1.8.0_331
    Java™ SE Runtime Environment (build 1.8.0_331-b09)
    Java HotSpot™ 64-Bit Server VM (build 25.331-b09, mixed mode)
  • PySpark: 3.2.1 (spark-3.2.1-bin-hadoop3.2.tgz)
  • Hadoop winutils.exe: 3.2.1
  • Jupyter:
    IPython : 7.30.1
    ipykernel : 6.6.0
    jupyter_client : 7.0.6
    jupyter_core : 4.9.1
    notebook : 6.4.6

CONGRATULATIONS! You were able to set up the environment for PySpark on your Windows machine.

Please write in the comment section if you face any issues.

--

--

Saumya Goyal
Saumya Goyal

Written by Saumya Goyal

MLOps Engineer at BSH/Datamics | Writes about Tech and career| TUM Graduate

Responses (3)