https://d2l.ai/chapter_linear-regression/synthetic-regression-data.html
My opinions for exs
ex.1
I use ?torch.utils.data.DataLoader
then find parameter “drop_last”,
set it to true in the defination of function d2l.DataModule.get_tensorloader(), like:
@d2l.add_to_class(d2l.DataModule) #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
tensors = tuple(a[indices] for a in tensors)
dataset = torch.utils.data.TensorDataset(*tensors)
return torch.utils.data.DataLoader(dataset, self.batch_size,
shuffle=train, drop_last = True)
@d2l.add_to_class(SyntheticRegressionData) #@save
def get_dataloader(self, train):
i = slice(0, self.num_train) if train else slice(self.num_train, None)
return self.get_tensorloader((self.X, self.y), train, i)
and test with
len(data.train_dataloader())
the result changed from 32 to 31
ex.2
This one is too much for me now.
ex.3
class SyntheticRegressionData_onTheFly(d2l.HyperParameters):
def __init__(self, w, b, noise=0.01, batch_size=8):
self.save_hyperparameters()
self.w = self.w.reshape((-1, 1))
def get_dataloader(self):
X = torch.randn(self.batch_size, len(self.w))
noise_tmp = torch.randn(self.batch_size, 1) * self.noise
y = torch.matmul(X, self.w) + self.b + noise_tmp
return X, y
test = SyntheticRegressionData_onTheFly(w=torch.tensor([1., -2.]), b=3.)
print(test.get_dataloader()[0],'\n',test.get_dataloader()[1])
result:
tensor([[ 0.7405, -0.8744],
[-1.6136, 0.6811],
[ 0.3348, -1.2086],
[-0.6661, 0.9301],
[ 0.8505, -0.2203],
[ 0.9009, -0.3271],
[ 0.7607, -0.2932],
[ 0.1139, -0.7248]])
tensor([[ 1.2106],
[ 1.9595],
[ 2.4213],
[-0.6816],
[ 5.0581],
[ 5.3575],
[ 0.1735],
[-1.9671]])
ex.4
Let the num_train=any number, num_val = batch_size, get a val_batch each time
data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2, num_train=1, num_val=8, batch_size = 8)
X, y = next(iter(data.val_dataloader()))
print(X)
print(y)
X_, y_ = next(iter(data.val_dataloader()))
print(X_)
print(y_)
ex4.
add “torch.manual_seed(2)”
before
“class SyntheticRegressionData(d2l.DataModule): #@save”
Thanks a lot to @DReidiano , I add “torch.manual_seed()” for ex.4
and the new code is
class SyntheticRegressionData_onTheFly(d2l.HyperParameters):
def __init__(self, w, b, noise=0.01, batch_size=8):
self.save_hyperparameters()
self.w = self.w.reshape((-1, 1))
def get_dataloader(self, seed):
torch.manual_seed(seed)
X = torch.randn(self.batch_size, len(self.w))
noise_tmp = torch.randn(self.batch_size, 1) * self.noise
y = torch.matmul(X, self.w) + self.b + noise_tmp
return X, y
data = SyntheticRegressionData_onTheFly(w=torch.tensor([2, -3.4]), b=4.2, batch_size = 8)
print()
for i in range(2):
X, y = data.get_dataloader(seed = 1)
print(torch.cat((X,y),1))
The result is:
tensor([[-1.5256, -0.7502, 3.6893],
[-0.6540, -1.6095, 8.3587],
[-0.1002, -0.6092, 6.0620],
[-0.9798, -1.6091, 7.7108],
[-0.7121, 0.3037, 1.7411],
[-0.7773, -0.2515, 3.4907],
[-0.2223, 1.6871, -1.9765],
[ 0.2284, 0.4676, 3.0696]])
tensor([[-1.5256, -0.7502, 3.6893],
[-0.6540, -1.6095, 8.3587],
[-0.1002, -0.6092, 6.0620],
[-0.9798, -1.6091, 7.7108],
[-0.7121, 0.3037, 1.7411],
[-0.7773, -0.2515, 3.4907],
[-0.2223, 1.6871, -1.9765],
[ 0.2284, 0.4676, 3.0696]])
In the page of 3.32:
X, y = next(iter(data.get_dataloader()))
But I think it is loss a parameter ‘train’
My opinions:
X, y = next(iter(data.get_dataloader(train=True)))
or
def get_dataloader(self, train=True)
I think it should be modified
Why data loader in 3.3.3 is considered to be more efficient? Self.X and self.y are in memory as with the previous data loader.
Can anybody clarify what is needed in Exercise 3.3.5.2.2? You cannot shuffle a dataset stored on a disk unless you open the file. When the file is in memory you can use torch.utils.data.DataLoader with shuffle=True.
I do not know how it can be but the get_dataloader that is added in to the SyntheticRegressionData class in this chapter is not the method that is called when we run X, y = next(iter(data.train_dataloader())).
The train_dataloader method calls get_dataloader but the get_dataloader that is run is the one that is already in d2l.SyntheticRegressionData and it calls get_tensorloader.
So when we try to add get_dataloader, the new definition does not replace the one that is already there.
You can check it by running d2l.SyntheticRegressionData?? after attempted addition.
When we call next(iter(data.train_dataloader()))) it calls train_dataloader in SyntheticRegressionData which is inherited from DataModule. This train_dataloader then call the get_dataloader we defined not the one from DataModule. When we added this method to SyntheticRegressionData it replaces original one.
You can try it my modifying an putting an print call or something and calling it again.
In 3.3.3 get_tensorloader
actually we can avoid to create a tuple
for tensors
, which leads to unnecessary memory allocation. We can just keep the generator expression and operator unpack (*tensors
) will do the rest when we create the TensorDataset
.
@d2l.torch.add_to_class(d2l.torch.DataModule)
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
tensors = (a[indices] for a in tensors)
dataset = torch.utils.data.TensorDataset(*tensors)
return torch.utils.data.DataLoader(dataset, self.batch_size, shuffle=train)
My exercise solutions.
Question 1
On its final iteration, the DataLoader
would return values with fewer rows. To prevent this, you can pass in drop_last=True
.
Question 2
- You’ll get an out of memory error.
- I would (as the book suggests) likely use a pseudorandom permutation generator to index into the data. A simple Fisher-Yates shuffle would work.
Question 3
Maybe I’m misinterpreting this question? Naively:
def generator():
while True:
yield torch.randn(2)
Question 4
Again, I may be misinterpreting. I’m assuming this question is asking for each generator instance to have random data, but for a given instance to return the same data each time.
def generator():
a = torch.randn(2)
while True:
yield a