数据预处理

https://zh.d2l.ai/chapter_preliminaries/pandas.html

问题1:data2 = data2.iloc[:, data2.isna().sum().values < data2.isna().sum().max()],不知道还有什么更简单的方法吗

data2 = data2.drop(data.isna().sum().idxmax(),axis=1)

问题1 data = data.drop(data.count().idxmin(),axis=1)

上面的方法都可以,试着分析了下原理:

代码:
#删除缺失值最多的列
data2 = data
print(“data2 Addr:”,id(data2))
print(data2)

print("\ndata2.isna():")
print(type(data2.isna()))
print(data2.isna())

print("\ndata2.isna().sum():")
print(type(data2.isna().sum()))
print(data2.isna().sum())

print("\ndata2.isna().sum().values:")
print(type(data2.isna().sum().values))
print(data2.isna().sum().values)

print(“delete column by iloc:”)
data2 = data2.iloc[:, data2.isna().sum().values < data2.isna().sum().max()]
print(“data2 Addr:”,id(data2))
print(data2)

print("\n\n\nanother way:")
print(“data3 Addr:”,id(data3))
data3 = data

print("\ndata3.isna().sum().idxmax():")
print(type(data3.isna().sum().idxmax()))
print(data3.isna().sum().idxmax())

print("\ndelete by drop:")
print(type(data))
data3 = data3.drop(data3.isna().sum().idxmax(),axis=1)
print(data3)
print(id(data3))

输出:
data2 Addr: 2383387722992
NumRooms Alley Price
0 NaN Pave 127500
1 2.0 NaN 106000
2 4.0 NaN 178100
3 NaN NaN 140000

data2.isna():
<class ‘pandas.core.frame.DataFrame’>
NumRooms Alley Price
0 True False False
1 False True False
2 False True False
3 True True False

data2.isna().sum():
<class ‘pandas.core.series.Series’>
NumRooms 2
Alley 3
Price 0
dtype: int64

data2.isna().sum().values:
<class ‘numpy.ndarray’>
[2 3 0]
delete column by iloc:
data2 Addr: 2381973028288
NumRooms Price
0 NaN 127500
1 2.0 106000
2 4.0 178100
3 NaN 140000

another way:
data3 Addr: 2381823401168

data3.isna().sum().idxmax():
<class ‘str’>
Alley

delete by drop:
<class ‘pandas.core.frame.DataFrame’>
NumRooms Price
0 NaN 127500
1 2.0 106000
2 4.0 178100
3 NaN 140000
2383388074912

可以看到,iloc和drop两种方法都会产生新的引用而不是原地更新。
在此基础上,iloc的方法中除了要判断max之外还要遍历一次作判断,而drop只需要判断一次max,略微简单一些。

习题2:
代码:
#处理后的数据转换为张量格式
import tensorflow as tf
Z = tf.constant(data2.values)
Z

输出:
<tf.Tensor: shape=(4, 2), dtype=float64, numpy=
array([[ nan, 1.275e+05],
[2.000e+00, 1.060e+05],
[4.000e+00, 1.781e+05],
[ nan, 1.400e+05]])>

会删缺失值最多的列了,那么缺失值最多的行怎么删除呢?

#删除缺失值最多的行
data3 = data
data3 = data3.drop(data3.isna().sum(axis=1).idxmax())
data3

inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())
print(inputs)
The above code will occur error:can only concatenate str (not “int”) to str :boom:
so we can fix code as : inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
numeric_data = inputs.drop(columns=[‘Alley’])
inputs = inputs.fillna(numeric_data.mean())
print(inputs)

1 Like

Thx, this problem occurred to me too. I wonder if there is a way to categorize all the numeric columns.