Data engineer – Sharing Knowledge and programming

Handling missing values is not an easy task. As we think that the typical cases are either NAN in Dataframe or Null in the database. What do you think about some of the following characters in a dataset?

Encoded with ‘Nill’, ‘-‘, ‘Empty’, and ‘ null’ in a column.

Note: In Data science, the above-mentioned values are not only missing data but it is needed to think about some rows (or records) that are missing. It depends on the dataset and the use case that we work on. But it is recommended to have some visualizations daily or weekly or monthly that may help to identify the gaps easily.

Then the question is how to come for a conclusion on the pattern of missing values?

Get the length of each value in every text column and then identify the pattern and most repeated lengths.

The following scopes are focused to discuss :

Impact of the missing values.

How to analyze/visualize the missing values.

How to fix the missing values.

Example of the code using Panda and Pyspark.

Please download the data for this example from -https://www.kaggle.com/competitions/titanic/data?select=train.csv

How it impacts in analyzing:

Handling missing data by analyzing it in the right way is important to visualize the insight correctly to have better decision-making.

How to analyze NAN , missing value typically.

Firstly, a sample of missing data in the Age and the Cabin columns.

df. info() will provide the count of missing

visualizing the missing values

Step one: Please download Titanic data for this example from -https://www.kaggle.com/competitions/titanic/data?select=train.csv

Step Two: Let us work with Jupyter Notebook

Step two: import necessary libraries and read the CSV using Panda

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sn %matplotlib inline

explore_titanic_train =pd.read_csv(‘titanic_train.csv’)

Step three: visualizing using “distplot” and heatmap

plt. figure(figsize=(10,6))
sn. displot(
data=explore_titanic_train.isna().melt(value_name=”missing”),
y=”variable”,
hue=”missing”,
multiple=”fill”,
aspect=1.25
)

Using heat map visualization

sn.heatmap(explore_titanic_train.isnull(), yticklabels=True ,cbar=False, cmap=’viridis’)

So it is crystal clear that Age and Cabin have a notable number of missing values.

How to fix the missing values and some examples of the code using Panda and Pyspark will be updated in another post soon.

As I am used to working on Ubuntu these are interesting steps to share.

1. First, install the docker Desktop, https://docs.docker.com/desktop/install/windows-install/

2. make sure to install WSL (I did Ubunut’s latest LTS)

3. Open the docker desktop and pull jupyter/pyspark-notebook

4. Run the PowerShell with admin privileges and then docker run -p 8888:8888 jupyter/pyspark-notebook

Playing with Docker and WSL using Notebook:

Open your WSL command prompt and then try following.

How to mount to Windows default file :

azeem@DESKTOP-VGSDP7F:~$ cp /mnt/c/Users/User/Downloads/Py_DS_ML_Bootcamp-master/Refactored_Py_DS_ML_Bootcamp-master/04-Pandas-Exercises/Salaries.csv .

2. How to login into the container shell

docker exec -it <intelligent_benz> bash #make sure ur image name

3. How to copy a local Windows file into a docker image container

docker cp /mnt/c/Users/User/Downloads/Py_DS_ML_Bootcamp-master/Refactored_Py_DS_ML_Bootcamp-master/04-Pandas-Exercises/Salaries.csv intelligent_benz:tmp/
Successfully copied 16.1MB to intelligent_benz:tmp/

4. Docker file to local host

sudo docker cp container-id:/path/filename.txt ~/Desktop/xyz.txt

Sharing Knowledge and programming

Category: Data engineer

Handling missing data with “Titanic dataset” for beginners

PySpark setup on Windows Docker configuration