Data Preprocessing

mli · June 9, 2020, 7:48pm

http://d2l.ai/chapter_preliminaries/pandas.html

Kunxia_Wang · February 23, 2021, 5:37am

Hi, in the exercise, I use data.isna and ‘max’ function to find which column has the most missing value, do you guys have any other easy way to do?

Kjmtz · June 8, 2021, 8:21am

Well, in my case, I use data.isnull().sum() to find the number of NaN in each columns and find the most one by using max(), then I use dropna() to delete to the column that I want.

j101 · October 25, 2021, 3:25am

I defined a function to find the column with the max na:

def get_max_col_name(data):
    max = data.isnull().sum().max()
    for i in range(len(data.columns)):
        if data.iloc[:,i].isnull().sum() == max:
            return data.iloc[:,i].name

Then used the function in dataframe.drop:
data_dropna = data.drop(columns=[get_max_col_name(data)], axis=1)

Den_G · November 29, 2021, 7:42am

inputs.isnull().sum().index[inputs.isnull().sum().values == inputs.isnull().sum().max()]

this might work

gagskumar · January 17, 2025, 8:08am

Exercises

Try loading datasets, e.g., Abalone from the UCI Machine Learning Repository and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?

#Ex-1 Try loading datasets, e.g., Abalone from the UCI Machine Learning Repository and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?
import pandas as pd

data_file = os.path.join('..', 'data\\abalone', 'abalone.data')  #This is a CSV file
# abalone.names has the attribute information - column names copied from there.
abalone = pd.read_csv(data_file, header=None, 
                   names = ["Sex","Length","Diameter","Height","Whole weight","Shucked weight","Viscera weight","Shell weight","Rings"])
# 1. Check for missing values using isnull()
missing_values = abalone.isnull()
print("Boolean DataFrame indicating missing values:\n", missing_values)

# 2. Count missing values in each column
missing_count = abalone.isnull().sum()
print("\nNumber of missing values in each column:\n", missing_count)

# 3. Find rows with at least one missing value
rows_with_missing = abalone[abalone.isnull().any(axis=1)]
print("\nRows with at least one missing value:\n", rows_with_missing)
#print(data)

Output

Boolean DataFrame indicating missing values:
         Sex  Length  Diameter  Height  Whole weight  Shucked weight  \
0     False   False     False   False         False           False   
1     False   False     False   False         False           False   
2     False   False     False   False         False           False   
...     ...     ...       ...     ...           ...             ...   
4172  False   False     False   False         False           False   
4176  False   False     False   False         False           False   

      Viscera weight  Shell weight  Rings  
0              False         False  False  
1              False         False  False  
2              False         False  False  
...              ...           ...    ...  
4176           False         False  False  

[4177 rows x 9 columns]

Number of missing values in each column:
 Sex               0
Length            0
Diameter          0
Height            0
Whole weight      0
Shucked weight    0
Viscera weight    0
Shell weight      0
Rings             0
dtype: int64

Rows with at least one missing value:
 Empty DataFrame
Columns: [Sex, Length, Diameter, Height, Whole weight, Shucked weight, Viscera weight, Shell weight, Rings]
Index: []

Try indexing and selecting data columns by name rather than by column number. The pandas documentation on indexing has further details on how to do this.

#Ex-2 Try indexing and selecting data columns by name rather than by column number. The pandas documentation on indexing has further details on how to do this.

abalone[["Sex","Height"]]

Output

 	Sex 	Height
0 	M 	0.095
1 	M 	0.090
2 	F 	0.135
... 	... 	...
4175 	F 	0.150
4176 	M 	0.195

4177 rows × 2 columns

Oliz888 · June 7, 2025, 6:30am

Hi, I have a question regarding why create two dummies, which can be represent by only one, also, will it create the problem of “multicollinearity”?