Run your first Spark program using PySpark and Jupyter notebook

I think almost all whoever have a relationship with Big Data will cross Spark path in one way or another way. I know one day I need to go for a date with Spark but somehow I was postponing for a long time, That day came I am excited about this new journey.

We recently started offline segmentation support in our current project. Basically what segmentation does is shortlists the users based on conditions given in segmentation definition like user visited X page and did Y event & etc.

You may ask what will you do with the segmented users or why do you need segmentation? Let me explain with an example related to one of our company’s cool & super website CarDekho.

Our data team constantly searches for good offers related to Cars. Offers differ from city to city, car model to model & it has validity too. We want to share these to users. If we share all offers its literally spamming. So we need to share relevant offer to the user. Let’s say we got a good offer for Maruti Swift car in Hyderabad city we need to send a notification to the users who visited our Web Site or Mobile Site or App last week and did research related to Maruti Swift car and they are from Hyderabad city. So, in this case, we need to get segmented users. This example segment contains 3 conditions.

We are using Cassandra database because our application is write heavy & its time series data. So, every time based on conditions we need to get the users belongs to those partitions and do UNION if OR condition, INTERSECTION if its AND condition.

More than 1 million users visit our company’s Indian cars websites, mobile sites & Apps every day. So if we want to share something important to any broad segment users our application goes out of memory because of several reasons like RAM, large object space limit & etc. We can do a couple of optimizations but we know those are temporary fixes. I realized its time to meet my future love Spark. Enough chit chat lets start.


I am using a Mac machine, so setup steps related to Mac. I hope you have Homebrew installed in your mac if not follow this link.

Spark is implemented on Hadoop/HDFS and written mostly in Scala, a functional programming language which runs on the JVM. So, we need to first install Java. Run below command to install Java.

brew cask install java

Right now Java9 is installed by default. If you want to install Java8 then run below commands.

brew tap caskroom/versions
brew cask install java8

To check whether java installed correctly or not just run below command.

java -version

You will get output like this

java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)

By default, you will have python. So you don’t need to install python.

If you are like wanting to work with the latest software then you love to work in python3. For that, you need to run below command.

brew upgrade python

Previously we need to download Spark from Spark site and extract it and do the stuff. Now the pyspark package is available so no need to worry about all those. Run below command to install pyspark.

#If you are using python2 then use `pip install pyspark`
pip3 install pyspark

You will get output like this

$ pip3 install pyspark
Collecting pyspark
Downloading https://files.pythonhosted.org/packages/ee/2f/709df6e8dc00624689aa0a11c7a4c06061a7d00037e370584b9f011df44c/pyspark-2.3.1.tar.gz (211.9MB)
100% |████████████████████████████████| 211.9MB 19kB/s
Collecting py4j==0.10.7 (from pyspark)
Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
100% |████████████████████████████████| 204kB 158kB/s
Building wheels for collected packages: pyspark
Running setup.py bdist_wheel for pyspark ... done
Stored in directory: /Users/ashoktankala/Library/Caches/pip/wheels/37/48/54/f1b63f0dbb729e20c92f1bbcf1c53c03b300e0b93ca1781526
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 pyspark-2.3.1

Almost there. One last thing. If you are going to use Spark means you will play a lot of operations/trails with data so it makes sense to do those using Jupyter notebook. Run below command to install jupyter.

#If you are using python2 then use `pip install jupyter`
pip3 install jupyter

First, we need to know where pyspark package installed so run below command to find out

#If you are using python2 then use `pip show pyspark`
pip3 show pyspark

You will get output like this

Name: pyspark
Version: 2.3.1
Summary: Apache Spark Python API
Home-page: https://github.com/apache/spark/tree/master/python
Author: Spark Developers
Author-email: dev@spark.apache.org
License: http://www.apache.org/licenses/LICENSE-2.0
Location: /usr/local/lib/python3.7/site-packages
Requires: py4j
Required-by:

So it means pyspark installed at /usr/local/lib/python3.7/site-packages. So, SPARK_HOME will be /usr/local/lib/python3.7/site-packages/pyspark. Now we need to set SPARK_HOME environment variable. Open .bash_profile using command

vi ~/.bash_profile

Add below line

export SPARK_HOME=/usr/local/lib/python3.7/site-packages/pyspark

If you want to run pyspark shell then add below line too.

export PATH=$SPARK_HOME/bin:$PATH

In our case, we want to run through Jupyter and it had to find the spark based on our SPARK_HOME so we need to install findspark pacakge. Install it using below command.

#If you are using python2 then use `pip install findspark`
pip3 install findspark

It’s time to write our first program using pyspark in a Jupyter notebook. Run below command to start a Jupyter notebook.

jupyter notebook

Then automatically new tab will be opened in the browser and then you will see something like this.

Now click on New and then click on Python 3. If you are using Python 2 then you will see Python instead of Python 3.

Then a new tab will be opened where new notebook is created for our program.

Let’s write a small program which outputs each word count in a file. First create a file and let’s add a sentence in that file. Code for this program is

# To find out where the pyspark
import findspark
findspark.init()
# Creating Spark Context
from pyspark import SparkContext
sc = SparkContext("local", "first app")
# Calculating words count
text_file = sc.textFile("OneSentence.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
# Printing each word with its respective count
output = counts.collect()
for (word, count) in output:
    print("%s: %i" % (word, count))
# Stopping Spark Context
sc.stop()

Congrats on your first program with PySpark using Jupyter notebook.

Peace. Happy Coding.

Related Post

2 thoughts on “Run your first Spark program using PySpark and Jupyter notebook

  1. Where do we need to place the OneSentence.txt file?

    I get the following error when I use the command
    “counts = text_file.flatMap(lambda line: line.split(” “)) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)”

    ERROR:
    An error occurred while calling o25.partitions.
    : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/Chandu/OneSentence.txt
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
    at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
    at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:834)

    1. Hi Chandu,

      I didn’t understand in what context you are talking about.

      As far as I understood your program unable to access file at /Users/Chandu/OneSentence.txt . you should keep your file in this path.

Leave a Reply to tankala Cancel reply

Your email address will not be published. Required fields are marked *