问题1:data2 = data2.iloc[:, data2.isna().sum().values < data2.isna().sum().max()],不知道还有什么更简单的方法吗
data2 = data2.drop(data.isna().sum().idxmax(),axis=1)
问题1 data = data.drop(data.count().idxmin(),axis=1)
上面的方法都可以,试着分析了下原理:
代码:
#删除缺失值最多的列
data2 = data
print(“data2 Addr:”,id(data2))
print(data2)
print("\ndata2.isna():")
print(type(data2.isna()))
print(data2.isna())
print("\ndata2.isna().sum():")
print(type(data2.isna().sum()))
print(data2.isna().sum())
print("\ndata2.isna().sum().values:")
print(type(data2.isna().sum().values))
print(data2.isna().sum().values)
print(“delete column by iloc:”)
data2 = data2.iloc[:, data2.isna().sum().values < data2.isna().sum().max()]
print(“data2 Addr:”,id(data2))
print(data2)
print("\n\n\nanother way:")
print(“data3 Addr:”,id(data3))
data3 = data
print("\ndata3.isna().sum().idxmax():")
print(type(data3.isna().sum().idxmax()))
print(data3.isna().sum().idxmax())
print("\ndelete by drop:")
print(type(data))
data3 = data3.drop(data3.isna().sum().idxmax(),axis=1)
print(data3)
print(id(data3))
输出:
data2 Addr: 2383387722992
NumRooms Alley Price
0 NaN Pave 127500
1 2.0 NaN 106000
2 4.0 NaN 178100
3 NaN NaN 140000
data2.isna():
<class ‘pandas.core.frame.DataFrame’>
NumRooms Alley Price
0 True False False
1 False True False
2 False True False
3 True True False
data2.isna().sum():
<class ‘pandas.core.series.Series’>
NumRooms 2
Alley 3
Price 0
dtype: int64
data2.isna().sum().values:
<class ‘numpy.ndarray’>
[2 3 0]
delete column by iloc:
data2 Addr: 2381973028288
NumRooms Price
0 NaN 127500
1 2.0 106000
2 4.0 178100
3 NaN 140000
another way:
data3 Addr: 2381823401168
data3.isna().sum().idxmax():
<class ‘str’>
Alley
delete by drop:
<class ‘pandas.core.frame.DataFrame’>
NumRooms Price
0 NaN 127500
1 2.0 106000
2 4.0 178100
3 NaN 140000
2383388074912
可以看到,iloc和drop两种方法都会产生新的引用而不是原地更新。
在此基础上,iloc的方法中除了要判断max之外还要遍历一次作判断,而drop只需要判断一次max,略微简单一些。
习题2:
代码:
#处理后的数据转换为张量格式
import tensorflow as tf
Z = tf.constant(data2.values)
Z
输出:
<tf.Tensor: shape=(4, 2), dtype=float64, numpy=
array([[ nan, 1.275e+05],
[2.000e+00, 1.060e+05],
[4.000e+00, 1.781e+05],
[ nan, 1.400e+05]])>
会删缺失值最多的列了,那么缺失值最多的行怎么删除呢?
#删除缺失值最多的行
data3 = data
data3 = data3.drop(data3.isna().sum(axis=1).idxmax())
data3
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())
print(inputs)
The above code will occur error:can only concatenate str (not “int”) to str
so we can fix code as : inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
numeric_data = inputs.drop(columns=[‘Alley’])
inputs = inputs.fillna(numeric_data.mean())
print(inputs)
Thx, this problem occurred to me too. I wonder if there is a way to categorize all the numeric columns.
通过在notebook后直接创建三个代码块完成我的作业,其中处理了一些可能因为包版本更新导致的语法错误。
# 预处理
os.makedirs(os.path.join('..', 'data'), exist_ok=True)
exercise_file = os.path.join('..', 'data', 'pandas_exercise.csv')
with open(exercise_file, 'w') as f:
# 列名
f.write('NumRooms,Alley,Bathrooms,Bedrooms,Price\n')
# 生成更多行的数据
data = [
'NA,Pave,2,3,127500\n',
'2,Pave,1,2,106000\n',
'4,NA,3,4,178100\n',
'NA,Pave,2,NA,140000\n',
'3,Pave,NA,3,150000\n',
'2,Pave,1,2,120000\n',
'3,Pave,NA,3,175000\n',
'4,Pave,NA,4,190000\n',
'2,NA,1,2,110000\n',
'3,Pave,NA,3,160000\n',
'4,NA,3,4,185000\n',
'NA,Pave,NA,3,145000\n',
]
# 写入每一行数据
for line in data:
f.write(line)
# 练习 1
exercise_data = pd.read_csv(exercise_file)
print(exercise_data)
max_missing_col = exercise_data.isnull().sum().idxmax()
exercise_data = exercise_data.drop(max_missing_col, axis=1) # axis=1 表示按列删除,类似的可以通过axis=0删除行
print(exercise_data)
# 练习 2
exercise_inputs, exercise_outputs = exercise_data.iloc[:, 0:3], exercise_data.iloc[:, -1]
exercise_inputs = exercise_inputs.fillna(exercise_inputs.mean(numeric_only=True))
exercise_inputs = pd.get_dummies(exercise_inputs, dummy_na=True)
P, q = tf.constant(exercise_inputs.astype(float).values), tf.constant(exercise_outputs.values)
print(P, q, sep='\n')