Hi, in the exercise, I use data.isna and ‘max’ function to find which column has the most missing value, do you guys have any other easy way to do?
Well, in my case, I use data.isnull().sum() to find the number of NaN in each columns and find the most one by using max(), then I use dropna() to delete to the column that I want.
I defined a function to find the column with the max na:
def get_max_col_name(data): max = data.isnull().sum().max() for i in range(len(data.columns)): if data.iloc[:,i].isnull().sum() == max: return data.iloc[:,i].name
Then used the function in dataframe.drop:
data_dropna = data.drop(columns=[get_max_col_name(data)], axis=1)
inputs.isnull().sum().index[inputs.isnull().sum().values == inputs.isnull().sum().max()]
this might work
Exercises
- Try loading datasets, e.g., Abalone from the UCI Machine Learning Repository and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?
#Ex-1 Try loading datasets, e.g., Abalone from the UCI Machine Learning Repository and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?
import pandas as pd
data_file = os.path.join('..', 'data\\abalone', 'abalone.data') #This is a CSV file
# abalone.names has the attribute information - column names copied from there.
abalone = pd.read_csv(data_file, header=None,
names = ["Sex","Length","Diameter","Height","Whole weight","Shucked weight","Viscera weight","Shell weight","Rings"])
# 1. Check for missing values using isnull()
missing_values = abalone.isnull()
print("Boolean DataFrame indicating missing values:\n", missing_values)
# 2. Count missing values in each column
missing_count = abalone.isnull().sum()
print("\nNumber of missing values in each column:\n", missing_count)
# 3. Find rows with at least one missing value
rows_with_missing = abalone[abalone.isnull().any(axis=1)]
print("\nRows with at least one missing value:\n", rows_with_missing)
#print(data)
Output
Boolean DataFrame indicating missing values:
Sex Length Diameter Height Whole weight Shucked weight \
0 False False False False False False
1 False False False False False False
2 False False False False False False
... ... ... ... ... ... ...
4172 False False False False False False
4176 False False False False False False
Viscera weight Shell weight Rings
0 False False False
1 False False False
2 False False False
... ... ... ...
4176 False False False
[4177 rows x 9 columns]
Number of missing values in each column:
Sex 0
Length 0
Diameter 0
Height 0
Whole weight 0
Shucked weight 0
Viscera weight 0
Shell weight 0
Rings 0
dtype: int64
Rows with at least one missing value:
Empty DataFrame
Columns: [Sex, Length, Diameter, Height, Whole weight, Shucked weight, Viscera weight, Shell weight, Rings]
Index: []
- Try indexing and selecting data columns by name rather than by column number. The pandas documentation on indexing has further details on how to do this.
#Ex-2 Try indexing and selecting data columns by name rather than by column number. The pandas documentation on indexing has further details on how to do this.
abalone[["Sex","Height"]]
Output
Sex Height
0 M 0.095
1 M 0.090
2 F 0.135
... ... ...
4175 F 0.150
4176 M 0.195
4177 rows × 2 columns