My Pandas Cheat Sheet for Data Science in Python

Recently I worked most of the time on the data creation part for our model creation at that time I used pandas a lot. I used some functions a lot to complete my task. Let me share them with you I hope these will help you in your Data preparation journey.

You came here to read this article means you already know a bit about pandas and why so many people use pandas for working on Data. So I think I don’t need to give any intro about pandas. But if you ask me to tell me your opinion on pandas in one sentence then anytime I will say this

pandas is a great software library/package for data manipulation and analysis.

Don’t forget to install the pandas package. To install the pandas package run below command

pip3 install pandas

You need to import pandas to use so let’s import pandas first

import pandas as pd

We added as pd to import statement of pandas because of that you can use pandas superpowers by simply typing pd instead of typing pandas.

Reading JSON data

Most of the time we don’t have all the data we want. So we do scrapping and store data mostly in JSON. You can load it easily using pandas like this

df = pd.read_json('data.json')

Here data.json looks like this

[
  {
    "_id": "5e91936d617c639d56d6e40a",
    "index": 0,
    "guid": "7fe17eae-c86f-4b99-896a-018636d54f59",
    "isActive": true,
    "balance": "$1,814.13",
    "picture": "http://placehold.it/32x32",
    "age": 33,
    "eyeColor": "blue",
    "name": "Britney Cote",
    "gender": "female",
    "company": "PANZENT",
    "email": "britneycote@panzent.com",
    "phone": "+1 (907) 524-3223",
    "address": "266 Church Lane, Swartzville, Massachusetts, 7719",
    "about": "Dolore aliquip sint ut proident. Ullamco sit eiusmod laborum ullamco sint adipisicing culpa eu laborum adipisicing nisi consequat non culpa. Dolor est fugiat ad dolore minim irure quis. Minim nulla sit ut elit eiusmod dolor.\r\n",
    "registered": "2018-07-24T02:51:54 -06:-30",
    "latitude": 35.545005,
    "longitude": -10.315414,
    "tags": [
      "enim",
      "dolor",
      "fugiat",
      "velit",
      "excepteur",
      "culpa",
      "nisi"
    ],
    "friends": 127,
    "greeting": "Hello, Britney Cote! You have 5 unread messages.",
    "favoriteFruit": "apple"
  },
......
]

Some times we don’t store the JSON data in the proper JSON format. Let’s say we stored like this

{"_id": "5e91991ba38b52e0fb47bb7f","index": 0,"guid": "554903ee-525d-47cb-b09d-bf3cc15aec3d"}
{"_id": "5e91991bfc6a941b23db09b0","index": 1,"guid": "9bdd26f6-d14c-4ab8-bcd3-d42b27599fbd"}
{"_id": "5e91991bfc4766a4b3ea78ea","index": 2,"guid": "a3fee0e1-7b15-4f8b-8901-eaef3ea466c8"}

Then below code will save you

df=pd.read_json('data_not_in_proper_format.json', lines=True)

Here it takes line by line and treats it as one JSON object and loads.

Data Display

If you want to see the data loaded then run below code

df.head()
index_idindexguidisActivebalancepictureageeyeColornamegenderphoneaddressaboutregisteredlatitudelongitudetagsfriendsgreetingfavoriteFruit
05e91936d617c639d56d6e40a07fe17eae-c86f-4b99-896a-018636d54f59True$1,814.13http://placehold.it/32×3233blueBritney Cotefemale+1 (907) 524-3223266 Church Lane, Swartzville, Massachusetts, 7719Dolore aliquip sint ut proident. Ullamco sit e…2018-07-24T02:51:54 -06:-3035.545005-10.315414[enim, dolor, fugiat, velit, excepteur, culpa,…127Hello, Britney Cote! You have 5 unread messages.apple
15e91936d03261516997796941b8c92f7e-ca91-4b07-9e49-956ce1f6bebfFalse$3,610.18http://placehold.it/32×3235brownMiranda Mathismale+1 (952) 533-3545271 Underhill Avenue, Watchtower, Palau, 7558Ut ea dolor cupidatat laboris nulla in anim ex…2014-03-27T08:55:19 -06:-3016.06177762.760309[occaecat, veniam, elit, consequat, ea, eiusmo…235Hello, Miranda Mathis! You have 3 unread messa…apple
25e91936dac51832738a8d241271627ba1-c0bf-4582-9e16-11a5fa2b7889False$2,358.30http://placehold.it/32×3235blueCarly Rodgersfemale+1 (974) 452-3699116 Forbell Street, Kilbourne, American Samoa,…Culpa tempor voluptate excepteur adipisicing i…2017-07-26T02:04:48 -06:-3010.495423139.383287[est, sunt, sunt, laborum, sit, pariatur, nisi]298Hello, Carly Rodgers! You have 6 unread messages.apple
35e91936dc74a890c6f19d9333f24ce2af-1e61-47e4-93bb-d8024eb8e581True$2,457.23http://placehold.it/32×3224greenPatsy Mckeefemale+1 (800) 570-2994275 Railroad Avenue, Cloverdale, Vermont, 7149Non duis incididunt laboris tempor aute aliqui…2015-03-05T07:38:38 -06:-3089.744027-120.626924[veniam, sit, nulla, duis, elit, sunt, consect…191Hello, Patsy Mckee! You have 5 unread messages.apple
45e91936df9d5b29dec2b1b634013e2904-9e7c-43fe-8bc4-2afc9b8f270cFalse$2,466.73http://placehold.it/32×3220blueWaters Farleymale+1 (921) 580-2595148 Village Road, Dyckesville, Minnesota, 2327Aute nulla aute proident ullamco in laborum. A…2017-01-05T01:40:22 -06:-30-78.36466963.065719[laborum, ipsum, nulla, anim, laborum, minim, ex]253Hello, Waters Farley! You have 6 unread messages.apple

Here it is displaying the first 5 rows data. If you want to see only one row data then you need to run this

df.head(1)

If you want to see the last 2 rows data then

df.tail(2)

If there are so many columns like above some of the columns won’t be shown there you will see “…” only. To see all columns below code will come handy

pd.set_option('display.max_columns', None)

At the time of display just like columns displaying many rows also have restrictions If you want to see all rows then below code is here to save you

pd.set_option('display.max_rows', None)

Reading data from MySQL

If you need to read data from the MySQL database and load it in the dataframe then

import mysql.connector

mydb = mysql.connector.connect(
  host="localhost",
  user="root",
  passwd="root",
  database="test"
)

First, you need to create a MySQL connection for that we will provide database configuration details.
I use mysql-connector-python package usually. To create a MySQL connection you need to run above code to run it first you need to install this package for that you need to run below command

pip3 install mysql-connector-python

Once the connection is created you just need to provide query rest will be handled by pandas for you

df = pd.read_sql("SELECT * FROM table_name", mydb)

Reading CSV Data

To read data from CSV file

df=pd.read_csv('data.csv', sep=';', header=0)

By default it takes Comma as separator if you want pandas to consider some other separator then you need to mention like this sep=’;’ which you can find in above code.
header=0 means you have the names of columns in the first row in the file

Writing Data to CSV file

If you want to write the data we have in the data frame to a CSV file then

df.to_csv('data.csv', sep=';', index=False)

By default whenever pandas write data from data frame to CSV file they put row index too. To avoid this index=False is required.

Understanding the Data

To understand what columns have and their types and which have values in every record and which doesn’t

df.info()

You will get output like this

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   _id            19 non-null     object 
 1   index          19 non-null     int64  
 2   guid           19 non-null     object 
 3   isActive       19 non-null     bool   
 4   balance        19 non-null     object 
 5   picture        19 non-null     object 
 6   age            19 non-null     int64  
 7   eyeColor       16 non-null     object 
 8   name           19 non-null     object 
 9   gender         19 non-null     object 
 10  company        19 non-null     object 
 11  email          19 non-null     object 
 12  phone          19 non-null     object 
 13  address        19 non-null     object 
 14  registered     19 non-null     object 
 15  latitude       19 non-null     float64
 16  longitude      19 non-null     float64
 17  tags           19 non-null     object 
 18  friends        19 non-null     int64  
 19  greeting       19 non-null     object 
 20  favoriteFruit  19 non-null     object 
 21  about          17 non-null     object 
dtypes: bool(1), float64(2), int64(3), object(16)
memory usage: 3.3+ KB

You want to have only column names nothing other than that then this small code is there to help you

for column in df.columns:
    print(column)

To know unique values in a column

df['eyeColor'].unique()

Output:

array(['blue', nan, 'green', 'brown'], dtype=object)

To get counts of unique values of a column (Like Group By)

df['eyeColor'].value_counts()

Output:

blue     10
green     3
brown     3
Name: eyeColor, dtype: int64

We have 19 rows but if we total above counts then we will get 16 because by default it doesn’t count None values. To get None(or NaN) count also you need to add dropna=False

df['eyeColor'].value_counts(dropna=False)

Output:

blue     10
green     3
brown     3
NaN       3
Name: eyeColor, dtype: int64

To drop unnecessary columns

df = df.drop(['index','guid','picture','email','phone','latitude','longitude','address','registered','tags'], axis='columns')

Whenever we handle operation on column level then axis=’columns’ or axis=1 need to be added

Now the data frame looks like this

index_idisActivebalanceageeyeColornamegendercompanyfriendsgreetingfavoriteFruitabout
05e91936d617c639d56d6e40aTrue$1,814.1333blueBritney CotefemalePANZENT127Hello, Britney Cote! You have 5 unread messages.appleNaN
15e91936d0326151699779694False$3,610.1835NaNMiranda MathismaleHIVEDOM235Hello, Miranda Mathis! You have 3 unread messa…appleUt ea dolor cupidatat laboris nulla in anim ex…
25e91936dac51832738a8d241False$2,358.3035blueCarly RodgersfemaleBUZZNESS298Hello, Carly Rodgers! You have 6 unread messages.appleCulpa tempor voluptate excepteur adipisicing i…
35e91936dc74a890c6f19d933True$2,457.2324greenPatsy MckeefemaleVISALIA191Hello, Patsy Mckee! You have 5 unread messages.appleNon duis incididunt laboris tempor aute aliqui…
45e91936df9d5b29dec2b1b63False$2,466.7320blueWaters FarleymaleRODEOLOGY253Hello, Waters Farley! You have 6 unread messages.appleAute nulla aute proident ullamco in laborum. A…

Data imputation

Data imputation plays a major role to prepare data for the model. These code snippets helped me a lot in data imputation journey

To change the value of a column based on condition

df.loc[df['_id']=='5e91936d617c639d56d6e40a',['isActive']] = False

Output:

index_idisActivebalanceageeyeColornamegendercompanyfriendsgreetingfavoriteFruitabout
05e91936d617c639d56d6e40aFalse$1,814.1333blueBritney CotefemalePANZENT127Hello, Britney Cote! You have 5 unread messages.appleNaN
15e91936d0326151699779694False$3,610.1835NaNMiranda MathismaleHIVEDOM235Hello, Miranda Mathis! You have 3 unread messa…appleUt ea dolor cupidatat laboris nulla in anim ex…
25e91936dac51832738a8d241False$2,358.3035blueCarly RodgersfemaleBUZZNESS298Hello, Carly Rodgers! You have 6 unread messages.appleCulpa tempor voluptate excepteur adipisicing i…
35e91936dc74a890c6f19d933True$2,457.2324greenPatsy MckeefemaleVISALIA191Hello, Patsy Mckee! You have 5 unread messages.appleNon duis incididunt laboris tempor aute aliqui…
45e91936df9d5b29dec2b1b63False$2,466.7320blueWaters FarleymaleRODEOLOGY253Hello, Waters Farley! You have 6 unread messages.appleAute nulla aute proident ullamco in laborum. A…
You can observe here isActive column value of the row with id 5e91936d617c639d56d6e40a changed to False from True

Sometimes we get None values there we usually fill with data which is kind of normal. For example, most of the people have brown coloured eyes and some records have None in the eyeColor column. To fill it

df['eyeColor'].fillna('brown', inplace=True)

Output:

index_idisActivebalanceageeyeColornamegendercompanyfriendsgreetingfavoriteFruitabout
05e91936d617c639d56d6e40aFalse$1,814.1333blueBritney CotefemalePANZENT127Hello, Britney Cote! You have 5 unread messages.appleNaN
15e91936d0326151699779694False$3,610.1835brownMiranda MathismaleHIVEDOM235Hello, Miranda Mathis! You have 3 unread messa…appleUt ea dolor cupidatat laboris nulla in anim ex…
25e91936dac51832738a8d241False$2,358.3035blueCarly RodgersfemaleBUZZNESS298Hello, Carly Rodgers! You have 6 unread messages.appleCulpa tempor voluptate excepteur adipisicing i…
35e91936dc74a890c6f19d933True$2,457.2324greenPatsy MckeefemaleVISALIA191Hello, Patsy Mckee! You have 5 unread messages.appleNon duis incididunt laboris tempor aute aliqui…
45e91936df9d5b29dec2b1b63False$2,466.7320blueWaters FarleymaleRODEOLOGY253Hello, Waters Farley! You have 6 unread messages.appleAute nulla aute proident ullamco in laborum. A…
You can see 2nd row previous value is NaN. Now it got filled by brown

We used inplace=True extra here. Usually, whatever operations we do in data frame it returns a new data frame. Instead of that if you want to do change in current data frame only then we need to use inplace=True

We can do several operations specific to data types. For example, If you want to convert values to uppercase of a column which has string values then pretty simple. Code for this is

df["gender"] = df["gender"].str.upper()

Output:

index_idisActivebalanceageeyeColornamegendercompanyfriendsgreetingfavoriteFruitabout
05e91936d617c639d56d6e40aFalse$1,814.1333blueBritney CoteFEMALEPANZENT127Hello, Britney Cote! You have 5 unread messages.appleNaN
15e91936d0326151699779694False$3,610.1835brownMiranda MathisMALEHIVEDOM235Hello, Miranda Mathis! You have 3 unread messa…appleUt ea dolor cupidatat laboris nulla in anim ex…
25e91936dac51832738a8d241False$2,358.3035blueCarly RodgersFEMALEBUZZNESS298Hello, Carly Rodgers! You have 6 unread messages.appleCulpa tempor voluptate excepteur adipisicing i…
35e91936dc74a890c6f19d933True$2,457.2324greenPatsy MckeeFEMALEVISALIA191Hello, Patsy Mckee! You have 5 unread messages.appleNon duis incididunt laboris tempor aute aliqui…
45e91936df9d5b29dec2b1b63False$2,466.7320blueWaters FarleyMALERODEOLOGY253Hello, Waters Farley! You have 6 unread messages.appleAute nulla aute proident ullamco in laborum. A…
You can observe now gender values converted to uppercase

We can consider a data frame as a matrix so we can do matrix operations too. For example, let’s say we want to create a new column where we want to calculate friends count with age. To do this

df['friends_to_age']=df['friends']/df['age']

Output:

index_idisActivebalanceageeyeColornamegendercompanyfriendsgreetingfavoriteFruitaboutfriends_to_age
05e91936d617c639d56d6e40aFalse$1,814.1333blueBritney CoteFEMALEPANZENT127Hello, Britney Cote! You have 5 unread messages.appleNaN3.848485
15e91936d0326151699779694False$3,610.1835brownMiranda MathisMALEHIVEDOM235Hello, Miranda Mathis! You have 3 unread messa…appleUt ea dolor cupidatat laboris nulla in anim ex…6.714286
25e91936dac51832738a8d241False$2,358.3035blueCarly RodgersFEMALEBUZZNESS298Hello, Carly Rodgers! You have 6 unread messages.appleCulpa tempor voluptate excepteur adipisicing i…8.514286
35e91936dc74a890c6f19d933True$2,457.2324greenPatsy MckeeFEMALEVISALIA191Hello, Patsy Mckee! You have 5 unread messages.appleNon duis incididunt laboris tempor aute aliqui…7.958333
45e91936df9d5b29dec2b1b63False$2,466.7320blueWaters FarleyMALERODEOLOGY253Hello, Waters Farley! You have 6 unread messages.appleAute nulla aute proident ullamco in laborum. A…12.650000

You can do all sorts of things on values and write lambda functions too to extract data from values present in a column. For example, If you read the greeting message it contains the number of unread messages count. I don’t think greeting message has any significance except the unread messages count so we can extract and put it in a new column with name unread_messages_count

Output:

index_idisActivebalanceageeyeColornamegendercompanyfriendsgreetingfavoriteFruitaboutfriends_to_ageunread_messages_count
05e91936d617c639d56d6e40aFalse$1,814.1333blueBritney CoteFEMALEPANZENT127Hello, Britney Cote! You have 5 unread messages.appleNaN3.8484855
15e91936d0326151699779694False$3,610.1835brownMiranda MathisMALEHIVEDOM235Hello, Miranda Mathis! You have 3 unread messa…appleUt ea dolor cupidatat laboris nulla in anim ex…6.7142863
25e91936dac51832738a8d241False$2,358.3035blueCarly RodgersFEMALEBUZZNESS298Hello, Carly Rodgers! You have 6 unread messages.appleCulpa tempor voluptate excepteur adipisicing i…8.5142866
35e91936dc74a890c6f19d933True$2,457.2324greenPatsy MckeeFEMALEVISALIA191Hello, Patsy Mckee! You have 5 unread messages.appleNon duis incididunt laboris tempor aute aliqui…7.9583335
45e91936df9d5b29dec2b1b63False$2,466.7320blueWaters FarleyMALERODEOLOGY253Hello, Waters Farley! You have 6 unread messages.appleAute nulla aute proident ullamco in laborum. A…12.6500006

Let’s say you want to extract what you want from a column value and put it in different columns that also possible. For example, Let’s split name into first name and last name

def extract_first_last_name(row):
    name = row['name'].split()
    row['first_name'] = name[0]
    row['last_name'] = name[1]
    return row

df = df.apply(lambda row: extract_first_last_name(row), axis='columns')

Output:

index_idisActivebalanceageeyeColornamegendercompanyfriendsgreetingfavoriteFruitaboutfriends_to_ageunread_messages_countfirst_namelast_name
05e91936d617c639d56d6e40aFalse$1,814.1333blueBritney CoteFEMALEPANZENT127Hello, Britney Cote! You have 5 unread messages.appleNaN3.8484855BritneyCote
15e91936d0326151699779694False$3,610.1835brownMiranda MathisMALEHIVEDOM235Hello, Miranda Mathis! You have 3 unread messa…appleUt ea dolor cupidatat laboris nulla in anim ex…6.7142863MirandaMathis
25e91936dac51832738a8d241False$2,358.3035blueCarly RodgersFEMALEBUZZNESS298Hello, Carly Rodgers! You have 6 unread messages.appleCulpa tempor voluptate excepteur adipisicing i…8.5142866CarlyRodgers
35e91936dc74a890c6f19d933True$2,457.2324greenPatsy MckeeFEMALEVISALIA191Hello, Patsy Mckee! You have 5 unread messages.appleNon duis incididunt laboris tempor aute aliqui…7.9583335PatsyMckee
45e91936df9d5b29dec2b1b63False$2,466.7320blueWaters FarleyMALERODEOLOGY253Hello, Waters Farley! You have 6 unread messages.appleAute nulla aute proident ullamco in laborum. A…12.6500006WatersFarley

We did whatever data imputation we need to do and for some columns, we can’t do anything and they don’t have values in some rows and there is no point of them to keep because it’s going to give anyway trouble at the time of model training so we want to drop those columns and it’s data then

df = df.dropna(axis='columns')

Output:

index_idisActivebalanceageeyeColornamegendercompanyfriendsgreetingfavoriteFruitfriends_to_ageunread_messages_countfirst_namelast_name
05e91936d617c639d56d6e40aFalse$1,814.1333blueBritney CoteFEMALEPANZENT127Hello, Britney Cote! You have 5 unread messages.apple3.8484855BritneyCote
15e91936d0326151699779694False$3,610.1835brownMiranda MathisMALEHIVEDOM235Hello, Miranda Mathis! You have 3 unread messa…apple6.7142863MirandaMathis
25e91936dac51832738a8d241False$2,358.3035blueCarly RodgersFEMALEBUZZNESS298Hello, Carly Rodgers! You have 6 unread messages.apple8.5142866CarlyRodgers
35e91936dc74a890c6f19d933True$2,457.2324greenPatsy MckeeFEMALEVISALIA191Hello, Patsy Mckee! You have 5 unread messages.apple7.9583335PatsyMckee
45e91936df9d5b29dec2b1b63False$2,466.7320blueWaters FarleyMALERODEOLOGY253Hello, Waters Farley! You have 6 unread messages.apple12.6500006WatersFarley
You can observe here about column is gone.

Let’s drop some unnecessary columns

df = df.drop(['name','greeting'], axis='columns')

Dropping duplicate rows

To drop duplicate rows

unique_df = df.drop_duplicates()

If you want see what unique_df have

unique_df.head(10)
index_idisActivebalanceageeyeColorgendercompanyfriendsfavoriteFruitfriends_to_ageunread_messages_countfirst_namelast_name
05e91936d617c639d56d6e40aFalse$1,814.1333blueFEMALEPANZENT127apple3.8484855BritneyCote
15e91936d0326151699779694False$3,610.1835brownMALEHIVEDOM235apple6.7142863MirandaMathis
25e91936dac51832738a8d241False$2,358.3035blueFEMALEBUZZNESS298apple8.5142866CarlyRodgers
35e91936dc74a890c6f19d933True$2,457.2324greenFEMALEVISALIA191apple7.9583335PatsyMckee
45e91936df9d5b29dec2b1b63False$2,466.7320blueMALERODEOLOGY253apple12.6500006WatersFarley
55e91936dbf8fbe6a1b99f888False$2,069.5939brownFEMALEQUAILCOM123banana3.1538467KellyMonroe
65e91936d4eed9784f3515a46False$1,472.5924blueFEMALEENTROPIX167apple6.9583338ChelseaGregory
unique_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   _id                    7 non-null      object 
 1   isActive               7 non-null      bool   
 2   balance                7 non-null      object 
 3   age                    7 non-null      int64  
 4   eyeColor               7 non-null      object 
 5   gender                 7 non-null      object 
 6   company                7 non-null      object 
 7   friends                7 non-null      int64  
 8   favoriteFruit          7 non-null      object 
 9   friends_to_age         7 non-null      float64
 10  unread_messages_count  7 non-null      object 
 11  first_name             7 non-null      object 
 12  last_name              7 non-null      object 
dtypes: bool(1), float64(1), int64(2), object(9)
memory usage: 735.0+ bytes

Now you can observe there are no duplicate rows

Renaming column names

unique_df = unique_df.rename(columns={"unread_messages_count": "unread_count","friends_to_age":"friends_by_age"})

Here we renamed column names from unread_messages_count to unread_count and from friends_to_age to friends_by_age

unique_df.head()
index_idisActivebalanceageeyeColorgendercompanyfriendsfavoriteFruitfriends_by_ageunread_countfirst_namelast_name
05e91936d617c639d56d6e40aFalse$1,814.1333blueFEMALEPANZENT127apple3.8484855BritneyCote
15e91936d0326151699779694False$3,610.1835brownMALEHIVEDOM235apple6.7142863MirandaMathis
25e91936dac51832738a8d241False$2,358.3035blueFEMALEBUZZNESS298apple8.5142866CarlyRodgers
35e91936dc74a890c6f19d933True$2,457.2324greenFEMALEVISALIA191apple7.9583335PatsyMckee
45e91936df9d5b29dec2b1b63False$2,466.7320blueMALERODEOLOGY253apple12.6500006WatersFarley

Data Filtering and sorting

We can do all sorts of conditions and filter the data what we need too. For example, you want to see the rows which users have age more than 30

unique_df.loc[unique_df['age'] > 30]

Output:

index_idisActivebalanceageeyeColorgendercompanyfriendsfavoriteFruitfriends_by_ageunread_countfirst_namelast_name
05e91936d617c639d56d6e40aFalse$1,814.1333blueFEMALEPANZENT127apple3.8484855BritneyCote
15e91936d0326151699779694False$3,610.1835brownMALEHIVEDOM235apple6.7142863MirandaMathis
25e91936dac51832738a8d241False$2,358.3035blueFEMALEBUZZNESS298apple8.5142866CarlyRodgers
55e91936dbf8fbe6a1b99f888False$2,069.5939brownFEMALEQUAILCOM123banana3.1538467KellyMonroe

We can put multiple conditions on a single column too

unique_df.loc[unique_df['eyeColor'].isin(['brown','green'])]

Output:

index_idisActivebalanceageeyeColorgendercompanyfriendsfavoriteFruitfriends_by_ageunread_countfirst_namelast_name
15e91936d0326151699779694False$3,610.1835brownMALEHIVEDOM235apple6.7142863MirandaMathis
35e91936dc74a890c6f19d933True$2,457.2324greenFEMALEVISALIA191apple7.9583335PatsyMckee
55e91936dbf8fbe6a1b99f888False$2,069.5939brownFEMALEQUAILCOM123banana3.1538467KellyMonroe

Even we can use predicate functions related to data types like this

unique_df.loc[unique_df['eyeColor'].str.contains('b')]

Output:

index_idisActivebalanceageeyeColorgendercompanyfriendsfavoriteFruitfriends_by_ageunread_countfirst_namelast_name
05e91936d617c639d56d6e40aFalse$1,814.1333blueFEMALEPANZENT127apple3.8484855BritneyCote
15e91936d0326151699779694False$3,610.1835brownMALEHIVEDOM235apple6.7142863MirandaMathis
25e91936dac51832738a8d241False$2,358.3035blueFEMALEBUZZNESS298apple8.5142866CarlyRodgers
45e91936df9d5b29dec2b1b63False$2,466.7320blueMALERODEOLOGY253apple12.6500006WatersFarley
55e91936dbf8fbe6a1b99f888False$2,069.5939brownFEMALEQUAILCOM123banana3.1538467KellyMonroe
65e91936d4eed9784f3515a46False$1,472.5924blueFEMALEENTROPIX167apple6.9583338ChelseaGregory
4th row is gone because it have green as a value in eyeColor column

We can sort the data based on values of a column

unique_df.sort_values(by=['first_name'])

Output:

index_idisActivebalanceageeyeColorgendercompanyfriendsfavoriteFruitfriends_by_ageunread_countfirst_namelast_name
05e91936d617c639d56d6e40aFalse$1,814.1333blueFEMALEPANZENT127apple3.8484855BritneyCote
25e91936dac51832738a8d241False$2,358.3035blueFEMALEBUZZNESS298apple8.5142866CarlyRodgers
65e91936d4eed9784f3515a46False$1,472.5924blueFEMALEENTROPIX167apple6.9583338ChelseaGregory
55e91936dbf8fbe6a1b99f888False$2,069.5939brownFEMALEQUAILCOM123banana3.1538467KellyMonroe
15e91936d0326151699779694False$3,610.1835brownMALEHIVEDOM235apple6.7142863MirandaMathis
35e91936dc74a890c6f19d933True$2,457.2324greenFEMALEVISALIA191apple7.9583335PatsyMckee
45e91936df9d5b29dec2b1b63False$2,466.7320blueMALERODEOLOGY253apple12.6500006WatersFarley
Sorted based on(order by) first name

Creating unique data set with specific fields

Some times we need to see specific columns data only and those too unique ones so that we can do these stuff using group by operations.We have duplicate values in df and let’s say we want to know age and balance per user

df.groupby('_id').agg({'age':'min', 'balance':'min'})

Output:

agebalance
_id
5e91936d032615169977969435$3,610.18
5e91936d4eed9784f3515a4624$1,472.59
5e91936d617c639d56d6e40a33$1,814.13
5e91936dac51832738a8d24135$2,358.30
5e91936dbf8fbe6a1b99f88839$2,069.59
5e91936dc74a890c6f19d93324$2,457.23
5e91936df9d5b29dec2b1b6320$2,466.73

You can observe that age and balance in the header at a different level. To solve this problem reset_index function is there to help you

df.groupby('_id').agg({'age':'min', 'balance':'min'}).reset_index()
index_idagebalance
05e91936d032615169977969435$3,610.18
15e91936d4eed9784f3515a4624$1,472.59
25e91936d617c639d56d6e40a33$1,814.13
35e91936dac51832738a8d24135$2,358.30
45e91936dbf8fbe6a1b99f88839$2,069.59
55e91936dc74a890c6f19d93324$2,457.23
65e91936df9d5b29dec2b1b6320$2,466.73

Merging 2 data frames

Some times we have data in a different table or different CSV. So we put that in a different data frame, to merge this to another data frame we need to use merge function.

To explain how we can do let’s create a data frame. It has _id and favourite food.

food_dict = [{"_id":"5e91936d0326151699779694", "favouriteFood":"Pasta"},
{"_id":"5e91936d4eed9784f3515a46", "favouriteFood":"Hamburgers"},
{"_id":"5e91936d617c639d56d6e40a", "favouriteFood":"Pizza"},
{"_id":"5e91936dac51832738a8d241", "favouriteFood":"Vegetables"},
{"_id":"5e91936dbf8fbe6a1b99f888", "favouriteFood":"Chicken"},
{"_id":"5e91936dc74a890c6f19d933", "favouriteFood":"Pasta"}
]
df_food = pd.DataFrame(food_dict)
df_food.head()
index_idfavouriteFood
05e91936d0326151699779694Pasta
15e91936d4eed9784f3515a46Hamburgers
25e91936d617c639d56d6e40aPizza
35e91936dac51832738a8d241Vegetables
45e91936dbf8fbe6a1b99f888Chicken

To merge df_food and unique_df let’s do an inner join

unique_df = unique_df.merge(df_food, on="_id", how = 'inner')
unique_df.head(10)
index_idisActivebalanceageeyeColorgendercompanyfriendsfavoriteFruitfriends_by_ageunread_countfirst_namelast_namefavouriteFood
05e91936d617c639d56d6e40aFalse$1,814.1333blueFEMALEPANZENT127apple3.8484855BritneyCotePizza
15e91936d0326151699779694False$3,610.1835brownMALEHIVEDOM235apple6.7142863MirandaMathisPasta
25e91936dac51832738a8d241False$2,358.3035blueFEMALEBUZZNESS298apple8.5142866CarlyRodgersVegetables
35e91936dc74a890c6f19d933True$2,457.2324greenFEMALEVISALIA191apple7.9583335PatsyMckeePasta
45e91936dbf8fbe6a1b99f888False$2,069.5939brownFEMALEQUAILCOM123banana3.1538467KellyMonroeChicken
55e91936d4eed9784f3515a46False$1,472.5924blueFEMALEENTROPIX167apple6.9583338ChelseaGregoryHamburgers

These are the ones helped me in my project. I hope the above code snippets and examples help you too. This is my cheat sheet. I am putting all these code snippets in a notebook and putting into my GitHub Repository. You can find it here.

Peace. Happy Coding.

Related Post