Handling missing values is not an easy task. As we think that the typical cases are either NAN in Dataframe or Null in the database. What do you think about some of the following characters in a dataset?
Encoded with ‘Nill’, ‘-‘, ‘Empty’, and ‘ null’ in a column.
Note: In Data science, the above-mentioned values are not only missing data but it is needed to think about some rows (or records) that are missing. It depends on the dataset and the use case that we work on. But it is recommended to have some visualizations daily or weekly or monthly that may help to identify the gaps easily.
Then the question is how to come for a conclusion on the pattern of missing values?
Get the length of each value in every text column and then identify the pattern and most repeated lengths.
The following scopes are focused to discuss :
Impact of the missing values.
How to analyze/visualize the missing values.
How to fix the missing values.
Example of the code using Panda and Pyspark.
Please download the data for this example from -https://www.kaggle.com/competitions/titanic/data?select=train.csv
How it impacts in analyzing:
Handling missing data by analyzing it in the right way is important to visualize the insight correctly to have better decision-making.
How to analyze NAN , missing value typically.
Firstly, a sample of missing data in the Age and the Cabin columns.
df. info() will provide the count of missing
visualizing the missing values
Step one: Please download Titanic data for this example from -https://www.kaggle.com/competitions/titanic/data?select=train.csv
Step Two: Let us work with Jupyter Notebook
Step two: import necessary libraries and read the CSV using Panda
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
explore_titanic_train =pd.read_csv(‘titanic_train.csv’)
Step three: visualizing using “distplot” and heatmap
plt. figure(figsize=(10,6))
sn. displot(
data=explore_titanic_train.isna().melt(value_name=”missing”),
y=”variable”,
hue=”missing”,
multiple=”fill”,
aspect=1.25
)
Using heat map visualization
sn.heatmap(explore_titanic_train.isnull(), yticklabels=True ,cbar=False, cmap=’viridis’)
So it is crystal clear that Age and Cabin have a notable number of missing values.
How to fix the missing values and some examples of the code using Panda and Pyspark will be updated in another post soon.