Data Preprocessing

StevenJokes · June 10, 2020, 12:26pm

Tried an hour to search some related contents and understand functions of pandas.

import os
data_file = '../data/results11.csv'
import pandas as pd
data = pd.read_csv(data_file)
print(data.head())
# calculate the max of the NaN numbers of all columns 
m = max(data.isnull().sum(axis = 0))
data_dropmaxnan = data.dropna(axis = 1, thresh = len(data)+1-m)

thresh: Keep only the rows with at least len(data)+1-m non-NA values.
The colomn that has max NaN value has len(data)-m non-NA values.
We don’t want it, so plus 1.

For more:

dropna:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html?highlight=dropna#pandas.DataFrame.dropna

How to get it by myself?
The loop of dataframe’s colomns is hard to write…
How to read the source code of pandas’ functions?
For example:dataframe.isnull().sum
I wanna figure out how to loop by dataframe’s colomns.
I also heard ‘apply’ function from https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06.
Maybe I will try it next time.

StevenJokes · June 10, 2020, 12:30pm

I’m a Chinese student. If I have wrong expressions, please forgive me.
Because new users only can add two links.
Check answer to 2:

github.com

StevenJokes/D2L_enread/blob/master/Chapter2/2-2.md



<!--
 * @version:
 * @Author: steven
 * @Date: 2020-06-10 21:32:19
 * @LastEditors: steven
 * @LastEditTime: 2020-06-10 21:46:12
 * @Description:
-->

I’m a Chinese student. If I have wrong expressions, please forgive me. :sob:

---

1.
Tried an hour to search some related contents and understand functions of pandas. :cold_face:
```python
import os
data_file = '../data/results11.csv'

This file has been truncated. show original

goldpiggy · June 10, 2020, 4:25pm

Hi @StevenJokes, I am confused by your question. Could you use 3 sentences the describe your question?

StevenJokes · June 10, 2020, 4:42pm

0.I have already done by pytorch’s api.
1.The best way to read pytorch’s source code?Please give me some tips.
2.how to loop by dataframe’s colomns?I’m trying to use loop to calculate data.isnull().sum().

goldpiggy · June 11, 2020, 5:10pm

Hi @StevenJokes,

1.The best way to read pytorch’s source code?Please give me some tips.

Here are some official API documents that may be helpful.
https://pytorch.org/tutorials/beginner/ptcheat.html
https://pytorch.org/docs/stable/index.html#

2. how to loop by dataframe’s colomns?I’m trying to use loop to calculate data.isnull().sum().

There are a vast amount of tutorials for pandas. You can just search online. Here is the official guide.
https://pandas.pydata.org/docs/user_guide/index.html#user-guide

StevenJokes · June 11, 2020, 6:07pm

Thanks.
I will read them later.
Now I have some issues about d2l/pytorch.py
I have renamed it as impytorch.py to avoid same name with package.

$ /usr/bin/env python "d:\onedrive\文档\read\d2l\d2l\imtorch.py"
Traceback (most recent call last):
  File "d:\onedrive\文档\read\d2l\d2l\imtorch.py", line 22, in <module>
    import torch
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\__init__.py", line 81, in <module>
    ctypes.CDLL(dll)
  File "C:\ProgramData\Anaconda3\lib\ctypes\__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found

my conda info

     active environment : pytorch
    active env location : C:\Users\a8679\.conda\envs\pytorch
            shell level : 1
       user config file : C:\Users\a8679\.condarc
 populated config files : C:\Users\a8679\.condarc
          conda version : 4.8.3
    conda-build version : 3.18.9
         python version : 3.7.4.final.0
       virtual packages :
       base environment : C:\ProgramData\Anaconda3  (read only)        
           channel URLs : https://conda.anaconda.org/conda-forge/win-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/menpo/win-64
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/menpo/noarch
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda/win-64
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda/noarch
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/win-64
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/noarch
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/win-64
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/noarch
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud//pytorch/win-64
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud//pytorch/noarch
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/win-64
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/noarch
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/win-64
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/noarch
                          https://repo.anaconda.com/pkgs/main/win-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/win-64
                          https://repo.anaconda.com/pkgs/r/noarch
                          https://repo.anaconda.com/pkgs/msys2/win-64
                          https://repo.anaconda.com/pkgs/msys2/noarch
          package cache : C:\ProgramData\Anaconda3\pkgs
                          C:\Users\a8679\.conda\pkgs
                          C:\Users\a8679\AppData\Local\conda\conda\pkgs
       envs directories : C:\Users\a8679\.conda\envs
                          C:\ProgramData\Anaconda3\envs
                          C:\Users\a8679\AppData\Local\conda\conda\envs
               platform : win-64
             user-agent : conda/4.8.3 requests/2.22.0 CPython/3.7.4 Windows/10 Windows/10.0.18362
          administrator : False
             netrc file : None
           offline mode : False

Then I tried “conda install python” and “conda update --all”
I’m still waiting to updating.
Hope everything goes well.

StevenJokes · June 12, 2020, 3:54am

After running pip install -U d2l -f https://d2l.ai/whl.html,
I can directly run from d2l import torch as d2l

Thanks
But I’m confused about the bug when I directly run the imtorch.py(rename from d2l/torch.py）

$ /usr/bin/env python "d:\onedrive\文档\read\d2l\d2l\imtorch.py"
Traceback (most recent call last):
  File "d:\onedrive\文档\read\d2l\d2l\imtorch.py", line 22, in <module>
    import torch
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\__init__.py", line 81, in <module>
    ctypes.CDLL(dll)
  File "C:\ProgramData\Anaconda3\lib\ctypes\__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found

anirudh · June 12, 2020, 1:56pm

Please refer to my reply to your question 3 here.

Gkkkkkkkkk · September 7, 2020, 12:00pm

import pandas as pd
data = pd.read_csv(data_file)
print(data)
Thresh=max(data.isnull().sum(axis=0))
print(Thresh)
pro_data=data.dropna(axis=1,thresh=data.shape[0]-Thresh+1)
print(pro_data)

PS: If you want to delete the ROW with most missing values, make changes listed:
Thresh=max(data.isnull().sum(axis=1))
pro_data=data.dropna(axis=0,thresh=data.shape[1]-Thresh+1)

Alvin · October 27, 2020, 10:18am

Thanks for your answers which resolved my questions.

Anatoly · December 7, 2020, 8:19pm

result_data = data.dropna(axis=1, thresh=min(data.count(axis=0))+1)

ufs · February 4, 2021, 9:13am

Delete the column with the most missing values.
data = data.dropna(axis=1, how=any, thresh= len(data) -max(data.isnull().sum(axis=0))+1)
Convert the preprocessed dataset to the tensor format.
inputs, outputs = data.iloc[:, 0:-1], data.iloc[:, -1]
inputs = inputs.fillna(inputs.mean())
X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)

anyinlover · April 3, 2021, 12:31pm

Here is another answer, I deal with the inputs because we can’t delete outputs anyway.

c = inputs.isna().sum().idxmax()
del inputs[c]

11110 · April 22, 2021, 8:47am

data2 = data2.iloc[:, data2.isna().sum().values < data2.isna().sum().max()]

abdnahid · April 25, 2021, 10:58am

So I have a question regarding the input data preprocessing.

If I had two biological sequences instead of NumRooms and Alley (as input, and no missing values), how would I convert them to tensors?

jioyoung · May 9, 2021, 10:04pm

data.drop(data.columns[data.isnull().sum(axis=0).argmax()], axis=1) # delete the column with largest number of missing values

data.drop(data.index[data.isnull().sum(axis=1).argmax()], axis=0) # delete the row with largest number of missing values

both of the commands above will delete one column or one row even though there are some columns or rows that have the largest number of NAs

CE_I · July 18, 2021, 11:35am

data.drop([pd.isnull(data).sum().idxmax()],axis=1)

Dan_Wallace · September 15, 2021, 7:05am

In section 2.2.3. Conversion to the Tensor Format, the code uses the .values() method, but I believe (at least according to the pandas documentation) that .to_numpy method is now preferred.

dhern023 · October 24, 2021, 3:03pm

The author should consider updating the code to use the pathlib API

import pathlib

dir_out = pathlib.Path().cwd()/'data'
dir_out.mkdir(parents=True, exist_ok=True)
file_new = dir_out / 'tiny.csv'

list_rows = [
    'NumRooms,Alley,Price',  # Column names / Header
    'NA,Pave,127500',  # Each row represents a data example
    '2,NA,106000',
    '4,NA,178100',
    'NA,NA,140000',
    ]

with file_new.open(mode="w") as file:
    for row in list_rows:
        file.write(row + '\n')

dhern023 · October 24, 2021, 3:23pm

If the author wants to suggest pandas, then they should invoke more of the pandas API

inputs = data[['NumRooms', 'Alley']] # dataframe
outputs = data['Price'] # series

Also, calling mean on a whole dataframe will call a future warning unless you specify the operation is on numeric data

inputs.mean(numeric_only=True)

It doesn’t affect these examples, but readers should be aware of this.