数据预处理

https://zh.d2l.ai/chapter_preliminaries/pandas.html

问题1:data2 = data2.iloc[:, data2.isna().sum().values < data2.isna().sum().max()],不知道还有什么更简单的方法吗

data2 = data2.drop(data.isna().sum().idxmax(),axis=1)

问题1 data = data.drop(data.count().idxmin(),axis=1)

上面的方法都可以,试着分析了下原理:

代码:
#删除缺失值最多的列
data2 = data
print(“data2 Addr:”,id(data2))
print(data2)

print("\ndata2.isna():")
print(type(data2.isna()))
print(data2.isna())

print("\ndata2.isna().sum():")
print(type(data2.isna().sum()))
print(data2.isna().sum())

print("\ndata2.isna().sum().values:")
print(type(data2.isna().sum().values))
print(data2.isna().sum().values)

print(“delete column by iloc:”)
data2 = data2.iloc[:, data2.isna().sum().values < data2.isna().sum().max()]
print(“data2 Addr:”,id(data2))
print(data2)

print("\n\n\nanother way:")
print(“data3 Addr:”,id(data3))
data3 = data

print("\ndata3.isna().sum().idxmax():")
print(type(data3.isna().sum().idxmax()))
print(data3.isna().sum().idxmax())

print("\ndelete by drop:")
print(type(data))
data3 = data3.drop(data3.isna().sum().idxmax(),axis=1)
print(data3)
print(id(data3))

输出:
data2 Addr: 2383387722992
NumRooms Alley Price
0 NaN Pave 127500
1 2.0 NaN 106000
2 4.0 NaN 178100
3 NaN NaN 140000

data2.isna():
<class ‘pandas.core.frame.DataFrame’>
NumRooms Alley Price
0 True False False
1 False True False
2 False True False
3 True True False

data2.isna().sum():
<class ‘pandas.core.series.Series’>
NumRooms 2
Alley 3
Price 0
dtype: int64

data2.isna().sum().values:
<class ‘numpy.ndarray’>
[2 3 0]
delete column by iloc:
data2 Addr: 2381973028288
NumRooms Price
0 NaN 127500
1 2.0 106000
2 4.0 178100
3 NaN 140000

another way:
data3 Addr: 2381823401168

data3.isna().sum().idxmax():
<class ‘str’>
Alley

delete by drop:
<class ‘pandas.core.frame.DataFrame’>
NumRooms Price
0 NaN 127500
1 2.0 106000
2 4.0 178100
3 NaN 140000
2383388074912

可以看到,iloc和drop两种方法都会产生新的引用而不是原地更新。
在此基础上,iloc的方法中除了要判断max之外还要遍历一次作判断,而drop只需要判断一次max,略微简单一些。

习题2:
代码:
#处理后的数据转换为张量格式
import tensorflow as tf
Z = tf.constant(data2.values)
Z

输出:
<tf.Tensor: shape=(4, 2), dtype=float64, numpy=
array([[ nan, 1.275e+05],
[2.000e+00, 1.060e+05],
[4.000e+00, 1.781e+05],
[ nan, 1.400e+05]])>

会删缺失值最多的列了,那么缺失值最多的行怎么删除呢?

#删除缺失值最多的行
data3 = data
data3 = data3.drop(data3.isna().sum(axis=1).idxmax())
data3

inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())
print(inputs)
The above code will occur error:can only concatenate str (not “int”) to str :boom:
so we can fix code as : inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
numeric_data = inputs.drop(columns=[‘Alley’])
inputs = inputs.fillna(numeric_data.mean())
print(inputs)

1 Like

Thx, this problem occurred to me too. I wonder if there is a way to categorize all the numeric columns.

通过在notebook后直接创建三个代码块完成我的作业,其中处理了一些可能因为包版本更新导致的语法错误。

# 预处理
os.makedirs(os.path.join('..', 'data'), exist_ok=True)
exercise_file = os.path.join('..', 'data', 'pandas_exercise.csv')
with open(exercise_file, 'w') as f:
    # 列名
    f.write('NumRooms,Alley,Bathrooms,Bedrooms,Price\n')

    # 生成更多行的数据
    data = [
        'NA,Pave,2,3,127500\n',
        '2,Pave,1,2,106000\n',
        '4,NA,3,4,178100\n',
        'NA,Pave,2,NA,140000\n',
        '3,Pave,NA,3,150000\n',
        '2,Pave,1,2,120000\n',
        '3,Pave,NA,3,175000\n',
        '4,Pave,NA,4,190000\n',
        '2,NA,1,2,110000\n',
        '3,Pave,NA,3,160000\n',
        '4,NA,3,4,185000\n',
        'NA,Pave,NA,3,145000\n',
    ]

    # 写入每一行数据
    for line in data:
        f.write(line)
# 练习 1
exercise_data = pd.read_csv(exercise_file)
print(exercise_data)

max_missing_col = exercise_data.isnull().sum().idxmax()
exercise_data = exercise_data.drop(max_missing_col, axis=1)     # axis=1 表示按列删除,类似的可以通过axis=0删除行
print(exercise_data)
# 练习 2
exercise_inputs, exercise_outputs = exercise_data.iloc[:, 0:3], exercise_data.iloc[:, -1]
exercise_inputs = exercise_inputs.fillna(exercise_inputs.mean(numeric_only=True))
exercise_inputs = pd.get_dummies(exercise_inputs, dummy_na=True)
P, q = tf.constant(exercise_inputs.astype(float).values), tf.constant(exercise_outputs.values)
print(P, q, sep='\n')