Tried an hour to search some related contents and understand functions of pandas.
import os
data_file = '../data/results11.csv'
import pandas as pd
data = pd.read_csv(data_file)
print(data.head())
# calculate the max of the NaN numbers of all columns
m = max(data.isnull().sum(axis = 0))
data_dropmaxnan = data.dropna(axis = 1, thresh = len(data)+1-m)
thresh: Keep only the rows with at least len(data)+1-m non-NA values.
The colomn that has max NaN value has len(data)-m non-NA values.
We don’t want it, so plus 1.
How to get it by myself?
The loop of dataframe’s colomns is hard to write…
How to read the source code of pandas’ functions?
For example:dataframe.isnull().sum
I wanna figure out how to loop by dataframe’s colomns.
I also heard ‘apply’ function from https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06.
Maybe I will try it next time.
0.I have already done by pytorch’s api.
1.The best way to read pytorch’s source code?Please give me some tips.
2.how to loop by dataframe’s colomns?I’m trying to use loop to calculate data.isnull().sum().
Thanks.
I will read them later.
Now I have some issues about d2l/pytorch.py
I have renamed it as impytorch.py to avoid same name with package.
$ /usr/bin/env python "d:\onedrive\文档\read\d2l\d2l\imtorch.py"
Traceback (most recent call last):
File "d:\onedrive\文档\read\d2l\d2l\imtorch.py", line 22, in <module>
import torch
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\__init__.py", line 81, in <module>
ctypes.CDLL(dll)
File "C:\ProgramData\Anaconda3\lib\ctypes\__init__.py", line 364, in __init__
self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found
After running pip install -U d2l -f https://d2l.ai/whl.html,
I can directly run from d2l import torch as d2l
Thanks
But I’m confused about the bug when I directly run the imtorch.py(rename from d2l/torch.py)
$ /usr/bin/env python "d:\onedrive\文档\read\d2l\d2l\imtorch.py"
Traceback (most recent call last):
File "d:\onedrive\文档\read\d2l\d2l\imtorch.py", line 22, in <module>
import torch
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\__init__.py", line 81, in <module>
ctypes.CDLL(dll)
File "C:\ProgramData\Anaconda3\lib\ctypes\__init__.py", line 364, in __init__
self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found
import pandas as pd
data = pd.read_csv(data_file)
print(data)
Thresh=max(data.isnull().sum(axis=0))
print(Thresh)
pro_data=data.dropna(axis=1,thresh=data.shape[0]-Thresh+1)
print(pro_data)
PS: If you want to delete the ROW with most missing values, make changes listed:
Thresh=max(data.isnull().sum(axis=1))
pro_data=data.dropna(axis=0,thresh=data.shape[1]-Thresh+1)
In section 2.2.3. Conversion to the Tensor Format, the code uses the .values() method, but I believe (at least according to the pandas documentation) that .to_numpy method is now preferred.