Data Preprocessing

StevenJokes · June 12, 2020, 3:54am

After running pip install -U d2l -f https://d2l.ai/whl.html,
I can directly run from d2l import torch as d2l

Thanks
But I’m confused about the bug when I directly run the imtorch.py(rename from d2l/torch.py）

$ /usr/bin/env python "d:\onedrive\文档\read\d2l\d2l\imtorch.py"
Traceback (most recent call last):
  File "d:\onedrive\文档\read\d2l\d2l\imtorch.py", line 22, in <module>
    import torch
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\__init__.py", line 81, in <module>
    ctypes.CDLL(dll)
  File "C:\ProgramData\Anaconda3\lib\ctypes\__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found

anirudh · June 12, 2020, 1:56pm

Please refer to my reply to your question 3 here.

Gkkkkkkkkk · September 7, 2020, 12:00pm

import pandas as pd
data = pd.read_csv(data_file)
print(data)
Thresh=max(data.isnull().sum(axis=0))
print(Thresh)
pro_data=data.dropna(axis=1,thresh=data.shape[0]-Thresh+1)
print(pro_data)

PS: If you want to delete the ROW with most missing values, make changes listed:
Thresh=max(data.isnull().sum(axis=1))
pro_data=data.dropna(axis=0,thresh=data.shape[1]-Thresh+1)

Alvin · October 27, 2020, 10:18am

Thanks for your answers which resolved my questions.

Anatoly · December 7, 2020, 8:19pm

result_data = data.dropna(axis=1, thresh=min(data.count(axis=0))+1)

ufs · February 4, 2021, 9:13am

Delete the column with the most missing values.
data = data.dropna(axis=1, how=any, thresh= len(data) -max(data.isnull().sum(axis=0))+1)
Convert the preprocessed dataset to the tensor format.
inputs, outputs = data.iloc[:, 0:-1], data.iloc[:, -1]
inputs = inputs.fillna(inputs.mean())
X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)

anyinlover · April 3, 2021, 12:31pm

Here is another answer, I deal with the inputs because we can’t delete outputs anyway.

c = inputs.isna().sum().idxmax()
del inputs[c]

11110 · April 22, 2021, 8:47am

data2 = data2.iloc[:, data2.isna().sum().values < data2.isna().sum().max()]

abdnahid · April 25, 2021, 10:58am

So I have a question regarding the input data preprocessing.

If I had two biological sequences instead of NumRooms and Alley (as input, and no missing values), how would I convert them to tensors?

jioyoung · May 9, 2021, 10:04pm

data.drop(data.columns[data.isnull().sum(axis=0).argmax()], axis=1) # delete the column with largest number of missing values

data.drop(data.index[data.isnull().sum(axis=1).argmax()], axis=0) # delete the row with largest number of missing values

both of the commands above will delete one column or one row even though there are some columns or rows that have the largest number of NAs

CE_I · July 18, 2021, 11:35am

data.drop([pd.isnull(data).sum().idxmax()],axis=1)

Dan_Wallace · September 15, 2021, 7:05am

In section 2.2.3. Conversion to the Tensor Format, the code uses the .values() method, but I believe (at least according to the pandas documentation) that .to_numpy method is now preferred.

dhern023 · October 24, 2021, 3:03pm

The author should consider updating the code to use the pathlib API

import pathlib

dir_out = pathlib.Path().cwd()/'data'
dir_out.mkdir(parents=True, exist_ok=True)
file_new = dir_out / 'tiny.csv'

list_rows = [
    'NumRooms,Alley,Price',  # Column names / Header
    'NA,Pave,127500',  # Each row represents a data example
    '2,NA,106000',
    '4,NA,178100',
    'NA,NA,140000',
    ]

with file_new.open(mode="w") as file:
    for row in list_rows:
        file.write(row + '\n')

dhern023 · October 24, 2021, 3:23pm

If the author wants to suggest pandas, then they should invoke more of the pandas API

inputs = data[['NumRooms', 'Alley']] # dataframe
outputs = data['Price'] # series

Also, calling mean on a whole dataframe will call a future warning unless you specify the operation is on numeric data

inputs.mean(numeric_only=True)

It doesn’t affect these examples, but readers should be aware of this.

dhern023 · October 24, 2021, 3:58pm

You can call them directly as a series then numpy array.

# convert column to tensor
array = data[column_name]
tensor = torch.tensor(array)

roncato · January 6, 2022, 1:21am

By assuming we only want to drop input columns:

data = pd.read_csv(data_file)
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
nas = inputs.isna().astype(int)
column_index = nas.sum(axis = 0).argmax()
inputs = inputs.drop(inputs.columns[column_index], axis=1)
inputs

MrBean · June 27, 2022, 3:06pm

Two line code

missingMostColumnIndex = data.count().argmin()
data.drop(columns=data.columns[missingMostColumnIndex])

Tejas-Garhewal · August 24, 2022, 4:26pm

1. Try loading datasets, e.g., Abalone from the UCI Machine Learning Repository and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?

abalone_data = pd.read_csv("../data/chap_2/abalone.data", 
                           names = [
                               "sex", "length", "diameter", "height", 
                               "whole_weight", "shucked_weight",
                               "viscera_weight", "shell_weight",
                               "rings"
                           ]
                          )

abalone_data.describe(include = "all")

	sex	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings
count	4177	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000
unique	3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
top	M	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
freq	1528	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
mean	NaN	0.523992	0.407881	0.139516	0.828742	0.359367	0.180594	0.238831	9.933684
std	NaN	0.120093	0.099240	0.041827	0.490389	0.221963	0.109614	0.139203	3.224169
min	NaN	0.075000	0.055000	0.000000	0.002000	0.001000	0.000500	0.001500	1.000000
25%	NaN	0.450000	0.350000	0.115000	0.441500	0.186000	0.093500	0.130000	8.000000
50%	NaN	0.545000	0.425000	0.140000	0.799500	0.336000	0.171000	0.234000	9.000000
75%	NaN	0.615000	0.480000	0.165000	1.153000	0.502000	0.253000	0.329000	11.000000
max	NaN	0.815000	0.650000	1.130000	2.825500	1.488000	0.760000	1.005000	29.000000

There are no items with missing values
8 out of 9 attributes are numerical, last is object

2. Try out indexing and selecting data columns by name rather than by column number. The pandas documentation on indexing has further details on how to do this.

abalone_data[["sex", "rings", "length"]][ : 20]

	sex	rings	length
0	M	15	0.455
1	M	7	0.350
2	F	9	0.530
3	M	10	0.440
4	I	7	0.330
5	I	8	0.425
6	F	20	0.530
7	F	16	0.545
8	M	9	0.475
9	F	19	0.550
10	F	14	0.525
11	M	10	0.430
12	M	11	0.490
13	F	10	0.535
14	F	10	0.470
15	M	12	0.500
16	I	7	0.355
17	F	10	0.440
18	M	7	0.365
19	M	9	0.450

3.How large a dataset do you think you could load this way? What might be the limitations? Hint: consider the time to read the data, representation, processing, and memory footprint. Try this out on your laptop. What changes if you try it out on a server?

How large?
- Depends on the amount of RAM your system has, mine starts struggling at around 8,00,000 records of text data
What changes on server?
- If you were using HDD on your machine and SSD on your cloud machine instance/server, you might notice better load times
- Or you might see significantly worse performance since you’re usign free version which has barely more RAM than your machine and uses an HDD to boot

4. How would you deal with data that has a very large number of categories? What if the category labels are all unique? Should you include the latter?

If too many categories, try to manually find catgories that are common to each other and group them as one. If they’re all far too different from each other, you’re most likely out of luck, or you can take the information hit and still do the merging of categories to the extent possible
If the categories are all unique, meaning number of categories == number of samples in dataset, just drop the column, since the column is carrying no useful information, just like a column that only has 1 value. All values are different(if all unique) or same(if all same) no matter the value of the rest of the attributes, there is no pattern to be found here

5. What alternatives to pandas can you think of? How about loading NumPy tensors from a file? Check out Pillow, the Python Imaging Library.

Just one word : dask

xavier_porras · November 8, 2022, 10:44pm

Create a raw dataset with more rows and columns:
a= [str(x)+’,NA’ for x in list(np.random.randint(0,4, 1000))]
b =[str(y) for y in list(np.random.randint(0,178000, 1000))]
z=[x+’,’+y+’\n’ for (x,y) in zip(a, b)]
with open(data_file, ‘w’) as f:
f.write(‘NumRooms,Alley,Price\n’) # Column names
for x in z:
f.write(x)

Delete the column with the most missing values.
d_ict= dict(data.isnull().sum())
max_value = max(d_ict, key=d_ict.get)
data.drop(max_value, axis=1)
Convert the preprocessed dataset to the tensor format.
outputs, inputs = data.iloc[:,1], data.loc[:,[‘NumRooms’, ‘Price’]]
X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y

Andy_Singal · February 2, 2023, 4:31pm

column_names = [“sex”, “length”, “diameter”, “height”, “whole weight”, “shucked weight”, “viscera weight”, “shell weight”, “rings”]
df = pd.read_csv(‘abalone.data’,names=column_names)
print(“Number of samples: %d” % len(df))

df.isna().sum() #Missing values

Catgorical and numerical types
df_numerical = df.select_dtypes(exclude=‘object’)
df_categorical = df.select_dtypes(include=‘object’)

df_numerical_cols = df_numerical.columns.tolist()
df_categorical_cols = df_categorical.columns.tolist()

Indexing can be done:
df.iloc[:,2:][:20]