Hi @StevenJokes, I am confused by your question. Could you use 3 sentences the describe your question?
0.I have already done by pytorch’s api.
1.The best way to read pytorch’s source code?Please give me some tips.
2.how to loop by dataframe’s colomns?I’m trying to use loop to calculate data.isnull().sum().
Hi @StevenJokes,
1.The best way to read pytorch’s source code?Please give me some tips.
Here are some official API documents that may be helpful.
https://pytorch.org/tutorials/beginner/ptcheat.html
https://pytorch.org/docs/stable/index.html#
2. how to loop by dataframe’s colomns?I’m trying to use loop to calculate data.isnull().sum().
There are a vast amount of tutorials for pandas. You can just search online. Here is the official guide.
https://pandas.pydata.org/docs/user_guide/index.html#user-guide
Thanks.
I will read them later.
Now I have some issues about d2l/pytorch.py
I have renamed it as impytorch.py to avoid same name with package.
$ /usr/bin/env python "d:\onedrive\文档\read\d2l\d2l\imtorch.py"
Traceback (most recent call last):
File "d:\onedrive\文档\read\d2l\d2l\imtorch.py", line 22, in <module>
import torch
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\__init__.py", line 81, in <module>
ctypes.CDLL(dll)
File "C:\ProgramData\Anaconda3\lib\ctypes\__init__.py", line 364, in __init__
self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found
my conda info
active environment : pytorch
active env location : C:\Users\a8679\.conda\envs\pytorch
shell level : 1
user config file : C:\Users\a8679\.condarc
populated config files : C:\Users\a8679\.condarc
conda version : 4.8.3
conda-build version : 3.18.9
python version : 3.7.4.final.0
virtual packages :
base environment : C:\ProgramData\Anaconda3 (read only)
channel URLs : https://conda.anaconda.org/conda-forge/win-64
https://conda.anaconda.org/conda-forge/noarch
https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/menpo/win-64
https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/menpo/noarch
https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda/win-64
https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda/noarch
https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/win-64
https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/noarch
https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/win-64
https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/noarch
https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud//pytorch/win-64
https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud//pytorch/noarch
https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/win-64
https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/noarch
https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/win-64
https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/noarch
https://repo.anaconda.com/pkgs/main/win-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/win-64
https://repo.anaconda.com/pkgs/r/noarch
https://repo.anaconda.com/pkgs/msys2/win-64
https://repo.anaconda.com/pkgs/msys2/noarch
package cache : C:\ProgramData\Anaconda3\pkgs
C:\Users\a8679\.conda\pkgs
C:\Users\a8679\AppData\Local\conda\conda\pkgs
envs directories : C:\Users\a8679\.conda\envs
C:\ProgramData\Anaconda3\envs
C:\Users\a8679\AppData\Local\conda\conda\envs
platform : win-64
user-agent : conda/4.8.3 requests/2.22.0 CPython/3.7.4 Windows/10 Windows/10.0.18362
administrator : False
netrc file : None
offline mode : False
Then I tried “conda install python” and “conda update --all”
I’m still waiting to updating.
Hope everything goes well.
After running pip install -U d2l -f https://d2l.ai/whl.html
,
I can directly run from d2l import torch as d2l
Thanks
But I’m confused about the bug when I directly run the imtorch.py(rename from d2l/torch.py)
$ /usr/bin/env python "d:\onedrive\文档\read\d2l\d2l\imtorch.py"
Traceback (most recent call last):
File "d:\onedrive\文档\read\d2l\d2l\imtorch.py", line 22, in <module>
import torch
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\__init__.py", line 81, in <module>
ctypes.CDLL(dll)
File "C:\ProgramData\Anaconda3\lib\ctypes\__init__.py", line 364, in __init__
self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found
import pandas as pd
data = pd.read_csv(data_file)
print(data)
Thresh=max(data.isnull().sum(axis=0))
print(Thresh)
pro_data=data.dropna(axis=1,thresh=data.shape[0]-Thresh+1)
print(pro_data)
PS: If you want to delete the ROW with most missing values, make changes listed:
Thresh=max(data.isnull().sum(axis=1))
pro_data=data.dropna(axis=0,thresh=data.shape[1]-Thresh+1)
Thanks for your answers which resolved my questions.
result_data = data.dropna(axis=1, thresh=min(data.count(axis=0))+1)
- Delete the column with the most missing values.
data = data.dropna(axis=1, how=any, thresh= len(data) -max(data.isnull().sum(axis=0))+1)
- Convert the preprocessed dataset to the tensor format.
inputs, outputs = data.iloc[:, 0:-1], data.iloc[:, -1]
inputs = inputs.fillna(inputs.mean())
X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
Here is another answer, I deal with the inputs because we can’t delete outputs anyway.
c = inputs.isna().sum().idxmax()
del inputs[c]
data2 = data2.iloc[:, data2.isna().sum().values < data2.isna().sum().max()]
So I have a question regarding the input data preprocessing.
If I had two biological sequences instead of NumRooms and Alley (as input, and no missing values), how would I convert them to tensors?
data.drop(data.columns[data.isnull().sum(axis=0).argmax()], axis=1) # delete the column with largest number of missing values
data.drop(data.index[data.isnull().sum(axis=1).argmax()], axis=0) # delete the row with largest number of missing values
both of the commands above will delete one column or one row even though there are some columns or rows that have the largest number of NAs
data.drop([pd.isnull(data).sum().idxmax()],axis=1)
In section 2.2.3. Conversion to the Tensor Format, the code uses the .values() method, but I believe (at least according to the pandas documentation) that .to_numpy method is now preferred.
The author should consider updating the code to use the pathlib API
import pathlib
dir_out = pathlib.Path().cwd()/'data'
dir_out.mkdir(parents=True, exist_ok=True)
file_new = dir_out / 'tiny.csv'
list_rows = [
'NumRooms,Alley,Price', # Column names / Header
'NA,Pave,127500', # Each row represents a data example
'2,NA,106000',
'4,NA,178100',
'NA,NA,140000',
]
with file_new.open(mode="w") as file:
for row in list_rows:
file.write(row + '\n')
If the author wants to suggest pandas, then they should invoke more of the pandas API
inputs = data[['NumRooms', 'Alley']] # dataframe
outputs = data['Price'] # series
Also, calling mean on a whole dataframe will call a future warning unless you specify the operation is on numeric data
inputs.mean(numeric_only=True)
It doesn’t affect these examples, but readers should be aware of this.
You can call them directly as a series then numpy array.
# convert column to tensor
array = data[column_name]
tensor = torch.tensor(array)
By assuming we only want to drop input columns:
data = pd.read_csv(data_file)
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
nas = inputs.isna().astype(int)
column_index = nas.sum(axis = 0).argmax()
inputs = inputs.drop(inputs.columns[column_index], axis=1)
inputs