I think almost all whoever have a relationship with Big Data will cross Spark path in one way or another way. I know one day I need to go for a date with Spark but somehow I was postponing for a long time, That day came I am excited about this new journey.
We recently started offline segmentation support in our current project. Basically what segmentation does is shortlists the users based on conditions given in segmentation definition like user visited X page and did Y event & etc.
You may ask what will you do with the segmented users or why do you need segmentation? Let me explain with an example related to one of our company’s cool & super website CarDekho.
Our data team constantly searches for good offers related to Cars. Offers differ from city to city, car model to model & it has validity too. We want to share these to users. If we share all offers its literally spamming. So we need to share relevant offer to the user. Let’s say we got a good offer for Maruti Swift car in Hyderabad city we need to send a notification to the users who visited our Web Site or Mobile Site or App last week and did research related to Maruti Swift car and they are from Hyderabad city. So, in this case, we need to get segmented users. This example segment contains 3 conditions.
We are using Cassandra database because our application is write heavy & its time series data. So, every time based on conditions we need to get the users belongs to those partitions and do UNION if OR condition, INTERSECTION if its AND condition.
More than 1 million users visit our company’s Indian cars websites, mobile sites & Apps every day. So if we want to share something important to any broad segment users our application goes out of memory because of several reasons like RAM, large object space limit & etc. We can do a couple of optimizations but we know those are temporary fixes. I realized its time to meet my future love Spark. Enough chit chat lets start.
I am using a Mac machine, so setup steps related to Mac. I hope you have Homebrew installed in your mac if not follow this link.
Spark is implemented on Hadoop/HDFS and written mostly in Scala, a functional programming language which runs on the JVM. So, we need to first install Java. Run below command to install Java.
brew cask install java
Right now Java9 is installed by default. If you want to install Java8 then run below commands.
brew tap caskroom/versions brew cask install java8
To check whether java installed correctly or not just run below command.
You will get output like this
java version "1.8.0_102" Java(TM) SE Runtime Environment (build 1.8.0_102-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
By default, you will have python. So you don’t need to install python.
If you are like wanting to work with the latest software then you love to work in python3. For that, you need to run below command.
brew upgrade python
Previously we need to download Spark from Spark site and extract it and do the stuff. Now the pyspark package is available so no need to worry about all those. Run below command to install pyspark.
#If you are using python2 then use `pip install pyspark` pip3 install pyspark
You will get output like this
$ pip3 install pyspark Collecting pyspark Downloading https://files.pythonhosted.org/packages/ee/2f/709df6e8dc00624689aa0a11c7a4c06061a7d00037e370584b9f011df44c/pyspark-2.3.1.tar.gz (211.9MB) 100% |████████████████████████████████| 211.9MB 19kB/s Collecting py4j==0.10.7 (from pyspark) Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB) 100% |████████████████████████████████| 204kB 158kB/s Building wheels for collected packages: pyspark Running setup.py bdist_wheel for pyspark ... done Stored in directory: /Users/ashoktankala/Library/Caches/pip/wheels/37/48/54/f1b63f0dbb729e20c92f1bbcf1c53c03b300e0b93ca1781526 Successfully built pyspark Installing collected packages: py4j, pyspark Successfully installed py4j-0.10.7 pyspark-2.3.1
Almost there. One last thing. If you are going to use Spark means you will play a lot of operations/trails with data so it makes sense to do those using Jupyter notebook. Run below command to install jupyter.
#If you are using python2 then use `pip install jupyter` pip3 install jupyter
First, we need to know where pyspark package installed so run below command to find out
#If you are using python2 then use `pip show pyspark` pip3 show pyspark
You will get output like this
Name: pyspark Version: 2.3.1 Summary: Apache Spark Python API Home-page: https://github.com/apache/spark/tree/master/python Author: Spark Developers Author-email: email@example.com License: http://www.apache.org/licenses/LICENSE-2.0 Location: /usr/local/lib/python3.7/site-packages Requires: py4j Required-by:
So it means pyspark installed at /usr/local/lib/python3.7/site-packages. So, SPARK_HOME will be /usr/local/lib/python3.7/site-packages/pyspark. Now we need to set SPARK_HOME environment variable. Open .bash_profile using command
Add below line
If you want to run pyspark shell then add below line too.
In our case, we want to run through Jupyter and it had to find the spark based on our SPARK_HOME so we need to install findspark pacakge. Install it using below command.
#If you are using python2 then use `pip install findspark` pip3 install findspark
It’s time to write our first program using pyspark in a Jupyter notebook. Run below command to start a Jupyter notebook.
Then automatically new tab will be opened in the browser and then you will see something like this.
Now click on New and then click on Python 3. If you are using Python 2 then you will see Python instead of Python 3.
Then a new tab will be opened where new notebook is created for our program.
Let’s write a small program which outputs each word count in a file. First create a file and let’s add a sentence in that file. Code for this program is
# To find out where the pyspark import findspark findspark.init()
# Creating Spark Context from pyspark import SparkContext sc = SparkContext("local", "first app")
# Calculating words count text_file = sc.textFile("OneSentence.txt") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b)
# Printing each word with its respective count output = counts.collect() for (word, count) in output: print("%s: %i" % (word, count))
# Stopping Spark Context sc.stop()
Congrats on your first program with PySpark using Jupyter notebook.
Peace. Happy Coding.