
Recently I worked most of the time on the data creation part for our model creation at that time I used pandas a lot. I used some functions a lot to complete my task. Let me share them with you I hope these will help you in your Data preparation journey.
You came here to read this article means you already know a bit about pandas and why so many people use pandas for working on Data. So I think I don’t need to give any intro about pandas. But if you ask me to tell me your opinion on pandas in one sentence then anytime I will say this
pandas is a great software library/package for data manipulation and analysis.
Don’t forget to install the pandas package. To install the pandas package run below command
pip3 install pandas
You need to import pandas to use so let’s import pandas first
import pandas as pd
We added as pd to import statement of pandas because of that you can use pandas superpowers by simply typing pd instead of typing pandas.
Reading JSON data
Most of the time we don’t have all the data we want. So we do scrapping and store data mostly in JSON. You can load it easily using pandas like this
df = pd.read_json('data.json')
Here data.json looks like this
[
{
"_id": "5e91936d617c639d56d6e40a",
"index": 0,
"guid": "7fe17eae-c86f-4b99-896a-018636d54f59",
"isActive": true,
"balance": "$1,814.13",
"picture": "http://placehold.it/32x32",
"age": 33,
"eyeColor": "blue",
"name": "Britney Cote",
"gender": "female",
"company": "PANZENT",
"email": "britneycote@panzent.com",
"phone": "+1 (907) 524-3223",
"address": "266 Church Lane, Swartzville, Massachusetts, 7719",
"about": "Dolore aliquip sint ut proident. Ullamco sit eiusmod laborum ullamco sint adipisicing culpa eu laborum adipisicing nisi consequat non culpa. Dolor est fugiat ad dolore minim irure quis. Minim nulla sit ut elit eiusmod dolor.\r\n",
"registered": "2018-07-24T02:51:54 -06:-30",
"latitude": 35.545005,
"longitude": -10.315414,
"tags": [
"enim",
"dolor",
"fugiat",
"velit",
"excepteur",
"culpa",
"nisi"
],
"friends": 127,
"greeting": "Hello, Britney Cote! You have 5 unread messages.",
"favoriteFruit": "apple"
},
......
]
Some times we don’t store the JSON data in the proper JSON format. Let’s say we stored like this
{"_id": "5e91991ba38b52e0fb47bb7f","index": 0,"guid": "554903ee-525d-47cb-b09d-bf3cc15aec3d"}
{"_id": "5e91991bfc6a941b23db09b0","index": 1,"guid": "9bdd26f6-d14c-4ab8-bcd3-d42b27599fbd"}
{"_id": "5e91991bfc4766a4b3ea78ea","index": 2,"guid": "a3fee0e1-7b15-4f8b-8901-eaef3ea466c8"}
Then below code will save you
df=pd.read_json('data_not_in_proper_format.json', lines=True)
Here it takes line by line and treats it as one JSON object and loads.
Data Display
If you want to see the data loaded then run below code
df.head()
index | _id | index | guid | isActive | balance | picture | age | eyeColor | name | gender | … | phone | address | about | registered | latitude | longitude | tags | friends | greeting | favoriteFruit |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | 0 | 7fe17eae-c86f-4b99-896a-018636d54f59 | True | $1,814.13 | http://placehold.it/32×32 | 33 | blue | Britney Cote | female | … | +1 (907) 524-3223 | 266 Church Lane, Swartzville, Massachusetts, 7719 | Dolore aliquip sint ut proident. Ullamco sit e… | 2018-07-24T02:51:54 -06:-30 | 35.545005 | -10.315414 | [enim, dolor, fugiat, velit, excepteur, culpa,… | 127 | Hello, Britney Cote! You have 5 unread messages. | apple |
1 | 5e91936d0326151699779694 | 1 | b8c92f7e-ca91-4b07-9e49-956ce1f6bebf | False | $3,610.18 | http://placehold.it/32×32 | 35 | brown | Miranda Mathis | male | … | +1 (952) 533-3545 | 271 Underhill Avenue, Watchtower, Palau, 7558 | Ut ea dolor cupidatat laboris nulla in anim ex… | 2014-03-27T08:55:19 -06:-30 | 16.061777 | 62.760309 | [occaecat, veniam, elit, consequat, ea, eiusmo… | 235 | Hello, Miranda Mathis! You have 3 unread messa… | apple |
2 | 5e91936dac51832738a8d241 | 2 | 71627ba1-c0bf-4582-9e16-11a5fa2b7889 | False | $2,358.30 | http://placehold.it/32×32 | 35 | blue | Carly Rodgers | female | … | +1 (974) 452-3699 | 116 Forbell Street, Kilbourne, American Samoa,… | Culpa tempor voluptate excepteur adipisicing i… | 2017-07-26T02:04:48 -06:-30 | 10.495423 | 139.383287 | [est, sunt, sunt, laborum, sit, pariatur, nisi] | 298 | Hello, Carly Rodgers! You have 6 unread messages. | apple |
3 | 5e91936dc74a890c6f19d933 | 3 | f24ce2af-1e61-47e4-93bb-d8024eb8e581 | True | $2,457.23 | http://placehold.it/32×32 | 24 | green | Patsy Mckee | female | … | +1 (800) 570-2994 | 275 Railroad Avenue, Cloverdale, Vermont, 7149 | Non duis incididunt laboris tempor aute aliqui… | 2015-03-05T07:38:38 -06:-30 | 89.744027 | -120.626924 | [veniam, sit, nulla, duis, elit, sunt, consect… | 191 | Hello, Patsy Mckee! You have 5 unread messages. | apple |
4 | 5e91936df9d5b29dec2b1b63 | 4 | 013e2904-9e7c-43fe-8bc4-2afc9b8f270c | False | $2,466.73 | http://placehold.it/32×32 | 20 | blue | Waters Farley | male | … | +1 (921) 580-2595 | 148 Village Road, Dyckesville, Minnesota, 2327 | Aute nulla aute proident ullamco in laborum. A… | 2017-01-05T01:40:22 -06:-30 | -78.364669 | 63.065719 | [laborum, ipsum, nulla, anim, laborum, minim, ex] | 253 | Hello, Waters Farley! You have 6 unread messages. | apple |
Here it is displaying the first 5 rows data. If you want to see only one row data then you need to run this
df.head(1)
If you want to see the last 2 rows data then
df.tail(2)
If there are so many columns like above some of the columns won’t be shown there you will see “…” only. To see all columns below code will come handy
pd.set_option('display.max_columns', None)
At the time of display just like columns displaying many rows also have restrictions If you want to see all rows then below code is here to save you
pd.set_option('display.max_rows', None)
Reading data from MySQL
If you need to read data from the MySQL database and load it in the dataframe then
import mysql.connector
mydb = mysql.connector.connect(
host="localhost",
user="root",
passwd="root",
database="test"
)
First, you need to create a MySQL connection for that we will provide database configuration details.
I use mysql-connector-python package usually. To create a MySQL connection you need to run above code to run it first you need to install this package for that you need to run below command
pip3 install mysql-connector-python
Once the connection is created you just need to provide query rest will be handled by pandas for you
df = pd.read_sql("SELECT * FROM table_name", mydb)
Reading CSV Data
To read data from CSV file
df=pd.read_csv('data.csv', sep=';', header=0)
By default it takes Comma as separator if you want pandas to consider some other separator then you need to mention like this sep=’;’ which you can find in above code.
header=0 means you have the names of columns in the first row in the file
Writing Data to CSV file
If you want to write the data we have in the data frame to a CSV file then
df.to_csv('data.csv', sep=';', index=False)
By default whenever pandas write data from data frame to CSV file they put row index too. To avoid this index=False is required.
Understanding the Data
To understand what columns have and their types and which have values in every record and which doesn’t
df.info()
You will get output like this
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 _id 19 non-null object
1 index 19 non-null int64
2 guid 19 non-null object
3 isActive 19 non-null bool
4 balance 19 non-null object
5 picture 19 non-null object
6 age 19 non-null int64
7 eyeColor 16 non-null object
8 name 19 non-null object
9 gender 19 non-null object
10 company 19 non-null object
11 email 19 non-null object
12 phone 19 non-null object
13 address 19 non-null object
14 registered 19 non-null object
15 latitude 19 non-null float64
16 longitude 19 non-null float64
17 tags 19 non-null object
18 friends 19 non-null int64
19 greeting 19 non-null object
20 favoriteFruit 19 non-null object
21 about 17 non-null object
dtypes: bool(1), float64(2), int64(3), object(16)
memory usage: 3.3+ KB
You want to have only column names nothing other than that then this small code is there to help you
for column in df.columns:
print(column)
To know unique values in a column
df['eyeColor'].unique()
Output:
array(['blue', nan, 'green', 'brown'], dtype=object)
To get counts of unique values of a column (Like Group By)
df['eyeColor'].value_counts()
Output:
blue 10
green 3
brown 3
Name: eyeColor, dtype: int64
We have 19 rows but if we total above counts then we will get 16 because by default it doesn’t count None values. To get None(or NaN) count also you need to add dropna=False
df['eyeColor'].value_counts(dropna=False)
Output:
blue 10
green 3
brown 3
NaN 3
Name: eyeColor, dtype: int64
To drop unnecessary columns
df = df.drop(['index','guid','picture','email','phone','latitude','longitude','address','registered','tags'], axis='columns')
Whenever we handle operation on column level then axis=’columns’ or axis=1 need to be added
Now the data frame looks like this
index | _id | isActive | balance | age | eyeColor | name | gender | company | friends | greeting | favoriteFruit | about |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | True | $1,814.13 | 33 | blue | Britney Cote | female | PANZENT | 127 | Hello, Britney Cote! You have 5 unread messages. | apple | NaN |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | NaN | Miranda Mathis | male | HIVEDOM | 235 | Hello, Miranda Mathis! You have 3 unread messa… | apple | Ut ea dolor cupidatat laboris nulla in anim ex… |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | Carly Rodgers | female | BUZZNESS | 298 | Hello, Carly Rodgers! You have 6 unread messages. | apple | Culpa tempor voluptate excepteur adipisicing i… |
3 | 5e91936dc74a890c6f19d933 | True | $2,457.23 | 24 | green | Patsy Mckee | female | VISALIA | 191 | Hello, Patsy Mckee! You have 5 unread messages. | apple | Non duis incididunt laboris tempor aute aliqui… |
4 | 5e91936df9d5b29dec2b1b63 | False | $2,466.73 | 20 | blue | Waters Farley | male | RODEOLOGY | 253 | Hello, Waters Farley! You have 6 unread messages. | apple | Aute nulla aute proident ullamco in laborum. A… |
Data imputation
Data imputation plays a major role to prepare data for the model. These code snippets helped me a lot in data imputation journey
To change the value of a column based on condition
df.loc[df['_id']=='5e91936d617c639d56d6e40a',['isActive']] = False
Output:
index | _id | isActive | balance | age | eyeColor | name | gender | company | friends | greeting | favoriteFruit | about |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | False | $1,814.13 | 33 | blue | Britney Cote | female | PANZENT | 127 | Hello, Britney Cote! You have 5 unread messages. | apple | NaN |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | NaN | Miranda Mathis | male | HIVEDOM | 235 | Hello, Miranda Mathis! You have 3 unread messa… | apple | Ut ea dolor cupidatat laboris nulla in anim ex… |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | Carly Rodgers | female | BUZZNESS | 298 | Hello, Carly Rodgers! You have 6 unread messages. | apple | Culpa tempor voluptate excepteur adipisicing i… |
3 | 5e91936dc74a890c6f19d933 | True | $2,457.23 | 24 | green | Patsy Mckee | female | VISALIA | 191 | Hello, Patsy Mckee! You have 5 unread messages. | apple | Non duis incididunt laboris tempor aute aliqui… |
4 | 5e91936df9d5b29dec2b1b63 | False | $2,466.73 | 20 | blue | Waters Farley | male | RODEOLOGY | 253 | Hello, Waters Farley! You have 6 unread messages. | apple | Aute nulla aute proident ullamco in laborum. A… |
Sometimes we get None values there we usually fill with data which is kind of normal. For example, most of the people have brown coloured eyes and some records have None in the eyeColor column. To fill it
df['eyeColor'].fillna('brown', inplace=True)
Output:
index | _id | isActive | balance | age | eyeColor | name | gender | company | friends | greeting | favoriteFruit | about |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | False | $1,814.13 | 33 | blue | Britney Cote | female | PANZENT | 127 | Hello, Britney Cote! You have 5 unread messages. | apple | NaN |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | brown | Miranda Mathis | male | HIVEDOM | 235 | Hello, Miranda Mathis! You have 3 unread messa… | apple | Ut ea dolor cupidatat laboris nulla in anim ex… |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | Carly Rodgers | female | BUZZNESS | 298 | Hello, Carly Rodgers! You have 6 unread messages. | apple | Culpa tempor voluptate excepteur adipisicing i… |
3 | 5e91936dc74a890c6f19d933 | True | $2,457.23 | 24 | green | Patsy Mckee | female | VISALIA | 191 | Hello, Patsy Mckee! You have 5 unread messages. | apple | Non duis incididunt laboris tempor aute aliqui… |
4 | 5e91936df9d5b29dec2b1b63 | False | $2,466.73 | 20 | blue | Waters Farley | male | RODEOLOGY | 253 | Hello, Waters Farley! You have 6 unread messages. | apple | Aute nulla aute proident ullamco in laborum. A… |
We used inplace=True extra here. Usually, whatever operations we do in data frame it returns a new data frame. Instead of that if you want to do change in current data frame only then we need to use inplace=True
We can do several operations specific to data types. For example, If you want to convert values to uppercase of a column which has string values then pretty simple. Code for this is
df["gender"] = df["gender"].str.upper()
Output:
index | _id | isActive | balance | age | eyeColor | name | gender | company | friends | greeting | favoriteFruit | about |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | False | $1,814.13 | 33 | blue | Britney Cote | FEMALE | PANZENT | 127 | Hello, Britney Cote! You have 5 unread messages. | apple | NaN |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | brown | Miranda Mathis | MALE | HIVEDOM | 235 | Hello, Miranda Mathis! You have 3 unread messa… | apple | Ut ea dolor cupidatat laboris nulla in anim ex… |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | Carly Rodgers | FEMALE | BUZZNESS | 298 | Hello, Carly Rodgers! You have 6 unread messages. | apple | Culpa tempor voluptate excepteur adipisicing i… |
3 | 5e91936dc74a890c6f19d933 | True | $2,457.23 | 24 | green | Patsy Mckee | FEMALE | VISALIA | 191 | Hello, Patsy Mckee! You have 5 unread messages. | apple | Non duis incididunt laboris tempor aute aliqui… |
4 | 5e91936df9d5b29dec2b1b63 | False | $2,466.73 | 20 | blue | Waters Farley | MALE | RODEOLOGY | 253 | Hello, Waters Farley! You have 6 unread messages. | apple | Aute nulla aute proident ullamco in laborum. A… |
We can consider a data frame as a matrix so we can do matrix operations too. For example, let’s say we want to create a new column where we want to calculate friends count with age. To do this
df['friends_to_age']=df['friends']/df['age']
Output:
index | _id | isActive | balance | age | eyeColor | name | gender | company | friends | greeting | favoriteFruit | about | friends_to_age |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | False | $1,814.13 | 33 | blue | Britney Cote | FEMALE | PANZENT | 127 | Hello, Britney Cote! You have 5 unread messages. | apple | NaN | 3.848485 |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | brown | Miranda Mathis | MALE | HIVEDOM | 235 | Hello, Miranda Mathis! You have 3 unread messa… | apple | Ut ea dolor cupidatat laboris nulla in anim ex… | 6.714286 |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | Carly Rodgers | FEMALE | BUZZNESS | 298 | Hello, Carly Rodgers! You have 6 unread messages. | apple | Culpa tempor voluptate excepteur adipisicing i… | 8.514286 |
3 | 5e91936dc74a890c6f19d933 | True | $2,457.23 | 24 | green | Patsy Mckee | FEMALE | VISALIA | 191 | Hello, Patsy Mckee! You have 5 unread messages. | apple | Non duis incididunt laboris tempor aute aliqui… | 7.958333 |
4 | 5e91936df9d5b29dec2b1b63 | False | $2,466.73 | 20 | blue | Waters Farley | MALE | RODEOLOGY | 253 | Hello, Waters Farley! You have 6 unread messages. | apple | Aute nulla aute proident ullamco in laborum. A… | 12.650000 |
You can do all sorts of things on values and write lambda functions too to extract data from values present in a column. For example, If you read the greeting message it contains the number of unread messages count. I don’t think greeting message has any significance except the unread messages count so we can extract and put it in a new column with name unread_messages_count
Output:
index | _id | isActive | balance | age | eyeColor | name | gender | company | friends | greeting | favoriteFruit | about | friends_to_age | unread_messages_count |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | False | $1,814.13 | 33 | blue | Britney Cote | FEMALE | PANZENT | 127 | Hello, Britney Cote! You have 5 unread messages. | apple | NaN | 3.848485 | 5 |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | brown | Miranda Mathis | MALE | HIVEDOM | 235 | Hello, Miranda Mathis! You have 3 unread messa… | apple | Ut ea dolor cupidatat laboris nulla in anim ex… | 6.714286 | 3 |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | Carly Rodgers | FEMALE | BUZZNESS | 298 | Hello, Carly Rodgers! You have 6 unread messages. | apple | Culpa tempor voluptate excepteur adipisicing i… | 8.514286 | 6 |
3 | 5e91936dc74a890c6f19d933 | True | $2,457.23 | 24 | green | Patsy Mckee | FEMALE | VISALIA | 191 | Hello, Patsy Mckee! You have 5 unread messages. | apple | Non duis incididunt laboris tempor aute aliqui… | 7.958333 | 5 |
4 | 5e91936df9d5b29dec2b1b63 | False | $2,466.73 | 20 | blue | Waters Farley | MALE | RODEOLOGY | 253 | Hello, Waters Farley! You have 6 unread messages. | apple | Aute nulla aute proident ullamco in laborum. A… | 12.650000 | 6 |
Let’s say you want to extract what you want from a column value and put it in different columns that also possible. For example, Let’s split name into first name and last name
def extract_first_last_name(row):
name = row['name'].split()
row['first_name'] = name[0]
row['last_name'] = name[1]
return row
df = df.apply(lambda row: extract_first_last_name(row), axis='columns')
Output:
index | _id | isActive | balance | age | eyeColor | name | gender | company | friends | greeting | favoriteFruit | about | friends_to_age | unread_messages_count | first_name | last_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | False | $1,814.13 | 33 | blue | Britney Cote | FEMALE | PANZENT | 127 | Hello, Britney Cote! You have 5 unread messages. | apple | NaN | 3.848485 | 5 | Britney | Cote |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | brown | Miranda Mathis | MALE | HIVEDOM | 235 | Hello, Miranda Mathis! You have 3 unread messa… | apple | Ut ea dolor cupidatat laboris nulla in anim ex… | 6.714286 | 3 | Miranda | Mathis |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | Carly Rodgers | FEMALE | BUZZNESS | 298 | Hello, Carly Rodgers! You have 6 unread messages. | apple | Culpa tempor voluptate excepteur adipisicing i… | 8.514286 | 6 | Carly | Rodgers |
3 | 5e91936dc74a890c6f19d933 | True | $2,457.23 | 24 | green | Patsy Mckee | FEMALE | VISALIA | 191 | Hello, Patsy Mckee! You have 5 unread messages. | apple | Non duis incididunt laboris tempor aute aliqui… | 7.958333 | 5 | Patsy | Mckee |
4 | 5e91936df9d5b29dec2b1b63 | False | $2,466.73 | 20 | blue | Waters Farley | MALE | RODEOLOGY | 253 | Hello, Waters Farley! You have 6 unread messages. | apple | Aute nulla aute proident ullamco in laborum. A… | 12.650000 | 6 | Waters | Farley |
We did whatever data imputation we need to do and for some columns, we can’t do anything and they don’t have values in some rows and there is no point of them to keep because it’s going to give anyway trouble at the time of model training so we want to drop those columns and it’s data then
df = df.dropna(axis='columns')
Output:
index | _id | isActive | balance | age | eyeColor | name | gender | company | friends | greeting | favoriteFruit | friends_to_age | unread_messages_count | first_name | last_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | False | $1,814.13 | 33 | blue | Britney Cote | FEMALE | PANZENT | 127 | Hello, Britney Cote! You have 5 unread messages. | apple | 3.848485 | 5 | Britney | Cote |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | brown | Miranda Mathis | MALE | HIVEDOM | 235 | Hello, Miranda Mathis! You have 3 unread messa… | apple | 6.714286 | 3 | Miranda | Mathis |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | Carly Rodgers | FEMALE | BUZZNESS | 298 | Hello, Carly Rodgers! You have 6 unread messages. | apple | 8.514286 | 6 | Carly | Rodgers |
3 | 5e91936dc74a890c6f19d933 | True | $2,457.23 | 24 | green | Patsy Mckee | FEMALE | VISALIA | 191 | Hello, Patsy Mckee! You have 5 unread messages. | apple | 7.958333 | 5 | Patsy | Mckee |
4 | 5e91936df9d5b29dec2b1b63 | False | $2,466.73 | 20 | blue | Waters Farley | MALE | RODEOLOGY | 253 | Hello, Waters Farley! You have 6 unread messages. | apple | 12.650000 | 6 | Waters | Farley |
Let’s drop some unnecessary columns
df = df.drop(['name','greeting'], axis='columns')
Dropping duplicate rows
To drop duplicate rows
unique_df = df.drop_duplicates()
If you want see what unique_df have
unique_df.head(10)
index | _id | isActive | balance | age | eyeColor | gender | company | friends | favoriteFruit | friends_to_age | unread_messages_count | first_name | last_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | False | $1,814.13 | 33 | blue | FEMALE | PANZENT | 127 | apple | 3.848485 | 5 | Britney | Cote |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | brown | MALE | HIVEDOM | 235 | apple | 6.714286 | 3 | Miranda | Mathis |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | FEMALE | BUZZNESS | 298 | apple | 8.514286 | 6 | Carly | Rodgers |
3 | 5e91936dc74a890c6f19d933 | True | $2,457.23 | 24 | green | FEMALE | VISALIA | 191 | apple | 7.958333 | 5 | Patsy | Mckee |
4 | 5e91936df9d5b29dec2b1b63 | False | $2,466.73 | 20 | blue | MALE | RODEOLOGY | 253 | apple | 12.650000 | 6 | Waters | Farley |
5 | 5e91936dbf8fbe6a1b99f888 | False | $2,069.59 | 39 | brown | FEMALE | QUAILCOM | 123 | banana | 3.153846 | 7 | Kelly | Monroe |
6 | 5e91936d4eed9784f3515a46 | False | $1,472.59 | 24 | blue | FEMALE | ENTROPIX | 167 | apple | 6.958333 | 8 | Chelsea | Gregory |
unique_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 _id 7 non-null object
1 isActive 7 non-null bool
2 balance 7 non-null object
3 age 7 non-null int64
4 eyeColor 7 non-null object
5 gender 7 non-null object
6 company 7 non-null object
7 friends 7 non-null int64
8 favoriteFruit 7 non-null object
9 friends_to_age 7 non-null float64
10 unread_messages_count 7 non-null object
11 first_name 7 non-null object
12 last_name 7 non-null object
dtypes: bool(1), float64(1), int64(2), object(9)
memory usage: 735.0+ bytes
Now you can observe there are no duplicate rows
Renaming column names
unique_df = unique_df.rename(columns={"unread_messages_count": "unread_count","friends_to_age":"friends_by_age"})
Here we renamed column names from unread_messages_count to unread_count and from friends_to_age to friends_by_age
unique_df.head()
index | _id | isActive | balance | age | eyeColor | gender | company | friends | favoriteFruit | friends_by_age | unread_count | first_name | last_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | False | $1,814.13 | 33 | blue | FEMALE | PANZENT | 127 | apple | 3.848485 | 5 | Britney | Cote |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | brown | MALE | HIVEDOM | 235 | apple | 6.714286 | 3 | Miranda | Mathis |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | FEMALE | BUZZNESS | 298 | apple | 8.514286 | 6 | Carly | Rodgers |
3 | 5e91936dc74a890c6f19d933 | True | $2,457.23 | 24 | green | FEMALE | VISALIA | 191 | apple | 7.958333 | 5 | Patsy | Mckee |
4 | 5e91936df9d5b29dec2b1b63 | False | $2,466.73 | 20 | blue | MALE | RODEOLOGY | 253 | apple | 12.650000 | 6 | Waters | Farley |
Data Filtering and sorting
We can do all sorts of conditions and filter the data what we need too. For example, you want to see the rows which users have age more than 30
unique_df.loc[unique_df['age'] > 30]
Output:
index | _id | isActive | balance | age | eyeColor | gender | company | friends | favoriteFruit | friends_by_age | unread_count | first_name | last_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | False | $1,814.13 | 33 | blue | FEMALE | PANZENT | 127 | apple | 3.848485 | 5 | Britney | Cote |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | brown | MALE | HIVEDOM | 235 | apple | 6.714286 | 3 | Miranda | Mathis |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | FEMALE | BUZZNESS | 298 | apple | 8.514286 | 6 | Carly | Rodgers |
5 | 5e91936dbf8fbe6a1b99f888 | False | $2,069.59 | 39 | brown | FEMALE | QUAILCOM | 123 | banana | 3.153846 | 7 | Kelly | Monroe |
We can put multiple conditions on a single column too
unique_df.loc[unique_df['eyeColor'].isin(['brown','green'])]
Output:
index | _id | isActive | balance | age | eyeColor | gender | company | friends | favoriteFruit | friends_by_age | unread_count | first_name | last_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | brown | MALE | HIVEDOM | 235 | apple | 6.714286 | 3 | Miranda | Mathis |
3 | 5e91936dc74a890c6f19d933 | True | $2,457.23 | 24 | green | FEMALE | VISALIA | 191 | apple | 7.958333 | 5 | Patsy | Mckee |
5 | 5e91936dbf8fbe6a1b99f888 | False | $2,069.59 | 39 | brown | FEMALE | QUAILCOM | 123 | banana | 3.153846 | 7 | Kelly | Monroe |
Even we can use predicate functions related to data types like this
unique_df.loc[unique_df['eyeColor'].str.contains('b')]
Output:
index | _id | isActive | balance | age | eyeColor | gender | company | friends | favoriteFruit | friends_by_age | unread_count | first_name | last_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | False | $1,814.13 | 33 | blue | FEMALE | PANZENT | 127 | apple | 3.848485 | 5 | Britney | Cote |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | brown | MALE | HIVEDOM | 235 | apple | 6.714286 | 3 | Miranda | Mathis |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | FEMALE | BUZZNESS | 298 | apple | 8.514286 | 6 | Carly | Rodgers |
4 | 5e91936df9d5b29dec2b1b63 | False | $2,466.73 | 20 | blue | MALE | RODEOLOGY | 253 | apple | 12.650000 | 6 | Waters | Farley |
5 | 5e91936dbf8fbe6a1b99f888 | False | $2,069.59 | 39 | brown | FEMALE | QUAILCOM | 123 | banana | 3.153846 | 7 | Kelly | Monroe |
6 | 5e91936d4eed9784f3515a46 | False | $1,472.59 | 24 | blue | FEMALE | ENTROPIX | 167 | apple | 6.958333 | 8 | Chelsea | Gregory |
We can sort the data based on values of a column
unique_df.sort_values(by=['first_name'])
Output:
index | _id | isActive | balance | age | eyeColor | gender | company | friends | favoriteFruit | friends_by_age | unread_count | first_name | last_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | False | $1,814.13 | 33 | blue | FEMALE | PANZENT | 127 | apple | 3.848485 | 5 | Britney | Cote |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | FEMALE | BUZZNESS | 298 | apple | 8.514286 | 6 | Carly | Rodgers |
6 | 5e91936d4eed9784f3515a46 | False | $1,472.59 | 24 | blue | FEMALE | ENTROPIX | 167 | apple | 6.958333 | 8 | Chelsea | Gregory |
5 | 5e91936dbf8fbe6a1b99f888 | False | $2,069.59 | 39 | brown | FEMALE | QUAILCOM | 123 | banana | 3.153846 | 7 | Kelly | Monroe |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | brown | MALE | HIVEDOM | 235 | apple | 6.714286 | 3 | Miranda | Mathis |
3 | 5e91936dc74a890c6f19d933 | True | $2,457.23 | 24 | green | FEMALE | VISALIA | 191 | apple | 7.958333 | 5 | Patsy | Mckee |
4 | 5e91936df9d5b29dec2b1b63 | False | $2,466.73 | 20 | blue | MALE | RODEOLOGY | 253 | apple | 12.650000 | 6 | Waters | Farley |
Creating unique data set with specific fields
Some times we need to see specific columns data only and those too unique ones so that we can do these stuff using group by operations.We have duplicate values in df and let’s say we want to know age and balance per user
df.groupby('_id').agg({'age':'min', 'balance':'min'})
Output:
age | balance | |
---|---|---|
_id | ||
5e91936d0326151699779694 | 35 | $3,610.18 |
5e91936d4eed9784f3515a46 | 24 | $1,472.59 |
5e91936d617c639d56d6e40a | 33 | $1,814.13 |
5e91936dac51832738a8d241 | 35 | $2,358.30 |
5e91936dbf8fbe6a1b99f888 | 39 | $2,069.59 |
5e91936dc74a890c6f19d933 | 24 | $2,457.23 |
5e91936df9d5b29dec2b1b63 | 20 | $2,466.73 |
You can observe that age and balance in the header at a different level. To solve this problem reset_index function is there to help you
df.groupby('_id').agg({'age':'min', 'balance':'min'}).reset_index()
index | _id | age | balance |
---|---|---|---|
0 | 5e91936d0326151699779694 | 35 | $3,610.18 |
1 | 5e91936d4eed9784f3515a46 | 24 | $1,472.59 |
2 | 5e91936d617c639d56d6e40a | 33 | $1,814.13 |
3 | 5e91936dac51832738a8d241 | 35 | $2,358.30 |
4 | 5e91936dbf8fbe6a1b99f888 | 39 | $2,069.59 |
5 | 5e91936dc74a890c6f19d933 | 24 | $2,457.23 |
6 | 5e91936df9d5b29dec2b1b63 | 20 | $2,466.73 |
Merging 2 data frames
Some times we have data in a different table or different CSV. So we put that in a different data frame, to merge this to another data frame we need to use merge function.
To explain how we can do let’s create a data frame. It has _id and favourite food.
food_dict = [{"_id":"5e91936d0326151699779694", "favouriteFood":"Pasta"},
{"_id":"5e91936d4eed9784f3515a46", "favouriteFood":"Hamburgers"},
{"_id":"5e91936d617c639d56d6e40a", "favouriteFood":"Pizza"},
{"_id":"5e91936dac51832738a8d241", "favouriteFood":"Vegetables"},
{"_id":"5e91936dbf8fbe6a1b99f888", "favouriteFood":"Chicken"},
{"_id":"5e91936dc74a890c6f19d933", "favouriteFood":"Pasta"}
]
df_food = pd.DataFrame(food_dict)
df_food.head()
index | _id | favouriteFood |
---|---|---|
0 | 5e91936d0326151699779694 | Pasta |
1 | 5e91936d4eed9784f3515a46 | Hamburgers |
2 | 5e91936d617c639d56d6e40a | Pizza |
3 | 5e91936dac51832738a8d241 | Vegetables |
4 | 5e91936dbf8fbe6a1b99f888 | Chicken |
To merge df_food and unique_df let’s do an inner join
unique_df = unique_df.merge(df_food, on="_id", how = 'inner')
unique_df.head(10)
index | _id | isActive | balance | age | eyeColor | gender | company | friends | favoriteFruit | friends_by_age | unread_count | first_name | last_name | favouriteFood |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5e91936d617c639d56d6e40a | False | $1,814.13 | 33 | blue | FEMALE | PANZENT | 127 | apple | 3.848485 | 5 | Britney | Cote | Pizza |
1 | 5e91936d0326151699779694 | False | $3,610.18 | 35 | brown | MALE | HIVEDOM | 235 | apple | 6.714286 | 3 | Miranda | Mathis | Pasta |
2 | 5e91936dac51832738a8d241 | False | $2,358.30 | 35 | blue | FEMALE | BUZZNESS | 298 | apple | 8.514286 | 6 | Carly | Rodgers | Vegetables |
3 | 5e91936dc74a890c6f19d933 | True | $2,457.23 | 24 | green | FEMALE | VISALIA | 191 | apple | 7.958333 | 5 | Patsy | Mckee | Pasta |
4 | 5e91936dbf8fbe6a1b99f888 | False | $2,069.59 | 39 | brown | FEMALE | QUAILCOM | 123 | banana | 3.153846 | 7 | Kelly | Monroe | Chicken |
5 | 5e91936d4eed9784f3515a46 | False | $1,472.59 | 24 | blue | FEMALE | ENTROPIX | 167 | apple | 6.958333 | 8 | Chelsea | Gregory | Hamburgers |
These are the ones helped me in my project. I hope the above code snippets and examples help you too. This is my cheat sheet. I am putting all these code snippets in a notebook and putting into my GitHub Repository. You can find it here.
Peace. Happy Coding.