https://d2l.ai/chapter_linear-regression/synthetic-regression-data.html
My opinions for exs
ex.1
I use ?torch.utils.data.DataLoader
then find parameter “drop_last”,
set it to true in the defination of function d2l.DataModule.get_tensorloader(), like:
@d2l.add_to_class(d2l.DataModule) #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
tensors = tuple(a[indices] for a in tensors)
dataset = torch.utils.data.TensorDataset(*tensors)
return torch.utils.data.DataLoader(dataset, self.batch_size,
shuffle=train, drop_last = True)
@d2l.add_to_class(SyntheticRegressionData) #@save
def get_dataloader(self, train):
i = slice(0, self.num_train) if train else slice(self.num_train, None)
return self.get_tensorloader((self.X, self.y), train, i)
and test with
len(data.train_dataloader())
the result changed from 32 to 31
ex.2
This one is too much for me now.
ex.3
class SyntheticRegressionData_onTheFly(d2l.HyperParameters):
def __init__(self, w, b, noise=0.01, batch_size=8):
self.save_hyperparameters()
self.w = self.w.reshape((-1, 1))
def get_dataloader(self):
X = torch.randn(self.batch_size, len(self.w))
noise_tmp = torch.randn(self.batch_size, 1) * self.noise
y = torch.matmul(X, self.w) + self.b + noise_tmp
return X, y
test = SyntheticRegressionData_onTheFly(w=torch.tensor([1., -2.]), b=3.)
print(test.get_dataloader()[0],'\n',test.get_dataloader()[1])
result:
tensor([[ 0.7405, -0.8744],
[-1.6136, 0.6811],
[ 0.3348, -1.2086],
[-0.6661, 0.9301],
[ 0.8505, -0.2203],
[ 0.9009, -0.3271],
[ 0.7607, -0.2932],
[ 0.1139, -0.7248]])
tensor([[ 1.2106],
[ 1.9595],
[ 2.4213],
[-0.6816],
[ 5.0581],
[ 5.3575],
[ 0.1735],
[-1.9671]])
ex.4
Let the num_train=any number, num_val = batch_size, get a val_batch each time
data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2, num_train=1, num_val=8, batch_size = 8)
X, y = next(iter(data.val_dataloader()))
print(X)
print(y)
X_, y_ = next(iter(data.val_dataloader()))
print(X_)
print(y_)
ex4.
add “torch.manual_seed(2)”
before
“class SyntheticRegressionData(d2l.DataModule): #@save”
Thanks a lot to @DReidiano , I add “torch.manual_seed()” for ex.4
and the new code is
class SyntheticRegressionData_onTheFly(d2l.HyperParameters):
def __init__(self, w, b, noise=0.01, batch_size=8):
self.save_hyperparameters()
self.w = self.w.reshape((-1, 1))
def get_dataloader(self, seed):
torch.manual_seed(seed)
X = torch.randn(self.batch_size, len(self.w))
noise_tmp = torch.randn(self.batch_size, 1) * self.noise
y = torch.matmul(X, self.w) + self.b + noise_tmp
return X, y
data = SyntheticRegressionData_onTheFly(w=torch.tensor([2, -3.4]), b=4.2, batch_size = 8)
print()
for i in range(2):
X, y = data.get_dataloader(seed = 1)
print(torch.cat((X,y),1))
The result is:
tensor([[-1.5256, -0.7502, 3.6893],
[-0.6540, -1.6095, 8.3587],
[-0.1002, -0.6092, 6.0620],
[-0.9798, -1.6091, 7.7108],
[-0.7121, 0.3037, 1.7411],
[-0.7773, -0.2515, 3.4907],
[-0.2223, 1.6871, -1.9765],
[ 0.2284, 0.4676, 3.0696]])
tensor([[-1.5256, -0.7502, 3.6893],
[-0.6540, -1.6095, 8.3587],
[-0.1002, -0.6092, 6.0620],
[-0.9798, -1.6091, 7.7108],
[-0.7121, 0.3037, 1.7411],
[-0.7773, -0.2515, 3.4907],
[-0.2223, 1.6871, -1.9765],
[ 0.2284, 0.4676, 3.0696]])
In the page of 3.32:
X, y = next(iter(data.get_dataloader()))
But I think it is loss a parameter ‘train’
My opinions:
X, y = next(iter(data.get_dataloader(train=True)))
or
def get_dataloader(self, train=True)
I think it should be modified
Why data loader in 3.3.3 is considered to be more efficient? Self.X and self.y are in memory as with the previous data loader.
Can anybody clarify what is needed in Exercise 3.3.5.2.2? You cannot shuffle a dataset stored on a disk unless you open the file. When the file is in memory you can use torch.utils.data.DataLoader with shuffle=True.
I do not know how it can be but the get_dataloader that is added in to the SyntheticRegressionData class in this chapter is not the method that is called when we run X, y = next(iter(data.train_dataloader())).
The train_dataloader method calls get_dataloader but the get_dataloader that is run is the one that is already in d2l.SyntheticRegressionData and it calls get_tensorloader.
So when we try to add get_dataloader, the new definition does not replace the one that is already there.
You can check it by running d2l.SyntheticRegressionData?? after attempted addition.
When we call next(iter(data.train_dataloader()))) it calls train_dataloader in SyntheticRegressionData which is inherited from DataModule. This train_dataloader then call the get_dataloader we defined not the one from DataModule. When we added this method to SyntheticRegressionData it replaces original one.
You can try it my modifying an putting an print call or something and calling it again.
In 3.3.3 get_tensorloader
actually we can avoid to create a tuple
for tensors
, which leads to unnecessary memory allocation. We can just keep the generator expression and operator unpack (*tensors
) will do the rest when we create the TensorDataset
.
@d2l.torch.add_to_class(d2l.torch.DataModule)
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
tensors = (a[indices] for a in tensors)
dataset = torch.utils.data.TensorDataset(*tensors)
return torch.utils.data.DataLoader(dataset, self.batch_size, shuffle=train)
- If
num_examples % batch size != 0
, your final batch will be smaller thanbatch_size
. To ignore this last batch and move on to the next epoch, one can setdrop_last=True
when instantiating theirtorch.utils.data.DataLoader
. - If we cannot hold all the data in memory, one has to lazily load/evaluate it on the fly in the dataset’s
__getitem__
method (maybe with chunked/queued reads from disk for efficiency). As the hint suggests, one could use a pseudorandom permutation generator to generate new indices on the fly without storing a full permutation table - you’d only have to store a new seed for each epoch, and find a strategy for reading from disk based on these indices.
For question 3:
def lazy_dataset():
while True:
yield torch.randn(32, 2), torch.randn(32)
For question 4:
def same_dataset():
data = (torch.randn(32, 2), torch.randn(32))
while True:
yield data
My ex 2 solution:
- It’ll raise an error: RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can’t allocate memory: you tried to allocate 80000000000000 bytes. Error code 12 (Cannot allocate memory)
- Using a Pseudorandom permutation generator we can generate an order for those data. And we use the order as a slice of the data. (Ex. In origin code if we want to get
data[i]
, now we usedata[order[i]]
)
and my Ex 3 solution:
class MySyntheticOnFlyDataloader(d2l.DataModule): #@save
"""Synthetic data for linear regression."""
def __init__(self,w,b,noise=0.01,batch_size=32):
super().__init__()
self.save_hyperparameters()
def get_dataloader(self, train):
while True:
X=torch.randn(self.batch_size,len(self.w))
noise=self.noise*torch.randn(self.batch_size,1)
y=torch.matmul(X,self.w.reshape((-1,1)))+self.b+noise
yield X,y
data = MySyntheticOnFlyDataloader(w=torch.tensor([2, -3.4]), b=4.2)
dataloader=data.train_dataloader()
X,y=next(dataloader)
print(X[0],y[0])
ex 4
Just need to fix the random seed.
import random
def random_generator(seed=42):
random.seed(seed)
return random.random()
random_generator()