Synthetic Regression Data

https://d2l.ai/chapter_linear-regression/synthetic-regression-data.html

My opinions for exs
ex.1
I use ?torch.utils.data.DataLoader
then find parameter “drop_last”,
set it to true in the defination of function d2l.DataModule.get_tensorloader(), like:

@d2l.add_to_class(d2l.DataModule)  #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
    tensors = tuple(a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size,
                                       shuffle=train, drop_last = True)
@d2l.add_to_class(SyntheticRegressionData)  #@save
def get_dataloader(self, train):
    i = slice(0, self.num_train) if train else slice(self.num_train, None)
    return self.get_tensorloader((self.X, self.y), train, i)

and test with

len(data.train_dataloader())

the result changed from 32 to 31

ex.2
This one is too much for me now.

ex.3

class SyntheticRegressionData_onTheFly(d2l.HyperParameters):
    def __init__(self, w, b, noise=0.01, batch_size=8):
        self.save_hyperparameters()
        self.w = self.w.reshape((-1, 1))
    def get_dataloader(self):
        X = torch.randn(self.batch_size, len(self.w))
        noise_tmp = torch.randn(self.batch_size, 1) * self.noise
        y = torch.matmul(X, self.w) + self.b + noise_tmp
        return X, y

test = SyntheticRegressionData_onTheFly(w=torch.tensor([1., -2.]), b=3.)
print(test.get_dataloader()[0],'\n',test.get_dataloader()[1])

result:
tensor([[ 0.7405, -0.8744],
[-1.6136, 0.6811],
[ 0.3348, -1.2086],
[-0.6661, 0.9301],
[ 0.8505, -0.2203],
[ 0.9009, -0.3271],
[ 0.7607, -0.2932],
[ 0.1139, -0.7248]])
tensor([[ 1.2106],
[ 1.9595],
[ 2.4213],
[-0.6816],
[ 5.0581],
[ 5.3575],
[ 0.1735],
[-1.9671]])

ex.4
Let the num_train=any number, num_val = batch_size, get a val_batch each time

data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2, num_train=1, num_val=8, batch_size = 8)
X, y = next(iter(data.val_dataloader()))
print(X)
print(y)
X_, y_ = next(iter(data.val_dataloader()))
print(X_)
print(y_)
1 Like

ex4.
add “torch.manual_seed(2)”
before
“class SyntheticRegressionData(d2l.DataModule): #@save

1 Like

Thanks a lot to @DReidiano , I add “torch.manual_seed()” for ex.4
and the new code is

class SyntheticRegressionData_onTheFly(d2l.HyperParameters):
    def __init__(self, w, b, noise=0.01, batch_size=8):
        self.save_hyperparameters()
        self.w = self.w.reshape((-1, 1))
    def get_dataloader(self, seed):
        torch.manual_seed(seed)
        X = torch.randn(self.batch_size, len(self.w))
        noise_tmp = torch.randn(self.batch_size, 1) * self.noise
        y = torch.matmul(X, self.w) + self.b + noise_tmp
        return X, y
data = SyntheticRegressionData_onTheFly(w=torch.tensor([2, -3.4]), b=4.2, batch_size = 8)
print()
for i in range(2):
    X, y = data.get_dataloader(seed = 1)
    print(torch.cat((X,y),1))

The result is:

tensor([[-1.5256, -0.7502, 3.6893],
[-0.6540, -1.6095, 8.3587],
[-0.1002, -0.6092, 6.0620],
[-0.9798, -1.6091, 7.7108],
[-0.7121, 0.3037, 1.7411],
[-0.7773, -0.2515, 3.4907],
[-0.2223, 1.6871, -1.9765],
[ 0.2284, 0.4676, 3.0696]])
tensor([[-1.5256, -0.7502, 3.6893],
[-0.6540, -1.6095, 8.3587],
[-0.1002, -0.6092, 6.0620],
[-0.9798, -1.6091, 7.7108],
[-0.7121, 0.3037, 1.7411],
[-0.7773, -0.2515, 3.4907],
[-0.2223, 1.6871, -1.9765],
[ 0.2284, 0.4676, 3.0696]])

In the page of 3.32:

X, y = next(iter(data.get_dataloader()))

But I think it is loss a parameter ‘train’

My opinions:
X, y = next(iter(data.get_dataloader(train=True)))
or
def get_dataloader(self, train=True)

I think it should be modified

Why data loader in 3.3.3 is considered to be more efficient? Self.X and self.y are in memory as with the previous data loader.

1 Like

Can anybody clarify what is needed in Exercise 3.3.5.2.2? You cannot shuffle a dataset stored on a disk unless you open the file. When the file is in memory you can use torch.utils.data.DataLoader with shuffle=True.

I do not know how it can be but the get_dataloader that is added in to the SyntheticRegressionData class in this chapter is not the method that is called when we run X, y = next(iter(data.train_dataloader())).

The train_dataloader method calls get_dataloader but the get_dataloader that is run is the one that is already in d2l.SyntheticRegressionData and it calls get_tensorloader.

So when we try to add get_dataloader, the new definition does not replace the one that is already there.

You can check it by running d2l.SyntheticRegressionData?? after attempted addition.

When we call next(iter(data.train_dataloader()))) it calls train_dataloader in SyntheticRegressionData which is inherited from DataModule. This train_dataloader then call the get_dataloader we defined not the one from DataModule. When we added this method to SyntheticRegressionData it replaces original one.

You can try it my modifying an putting an print call or something and calling it again.

In 3.3.3 get_tensorloader actually we can avoid to create a tuple for tensors, which leads to unnecessary memory allocation. We can just keep the generator expression and operator unpack (*tensors) will do the rest when we create the TensorDataset.

@d2l.torch.add_to_class(d2l.torch.DataModule)
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
    tensors = (a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size, shuffle=train)
  1. If num_examples % batch size != 0, your final batch will be smaller than batch_size. To ignore this last batch and move on to the next epoch, one can set drop_last=True when instantiating their torch.utils.data.DataLoader.
  2. If we cannot hold all the data in memory, one has to lazily load/evaluate it on the fly in the dataset’s __getitem__ method (maybe with chunked/queued reads from disk for efficiency). As the hint suggests, one could use a pseudorandom permutation generator to generate new indices on the fly without storing a full permutation table - you’d only have to store a new seed for each epoch, and find a strategy for reading from disk based on these indices.

For question 3:

def lazy_dataset():
    while True:
        yield torch.randn(32, 2), torch.randn(32)

For question 4:

def same_dataset():
    data = (torch.randn(32, 2), torch.randn(32))
    while True:
        yield data

My ex 2 solution:

  1. It’ll raise an error: RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can’t allocate memory: you tried to allocate 80000000000000 bytes. Error code 12 (Cannot allocate memory)
  2. Using a Pseudorandom permutation generator we can generate an order for those data. And we use the order as a slice of the data. (Ex. In origin code if we want to get data[i], now we use data[order[i]])

and my Ex 3 solution:

class MySyntheticOnFlyDataloader(d2l.DataModule):  #@save
    """Synthetic data for linear regression."""
    def __init__(self,w,b,noise=0.01,batch_size=32):
        super().__init__()
        self.save_hyperparameters()
    def get_dataloader(self, train):
        while True:
            X=torch.randn(self.batch_size,len(self.w))
            noise=self.noise*torch.randn(self.batch_size,1)
            y=torch.matmul(X,self.w.reshape((-1,1)))+self.b+noise
            yield X,y

data = MySyntheticOnFlyDataloader(w=torch.tensor([2, -3.4]), b=4.2)
dataloader=data.train_dataloader()
X,y=next(dataloader)
print(X[0],y[0])

ex 4
Just need to fix the random seed.

import random
def random_generator(seed=42):
    random.seed(seed)
    return random.random()
random_generator()

Ex 2:
To tackle the challenge of efficiently loading globally shuffled data from disk, especially for machine learning applications, I devised an innovative algorithm I call the “Pseudo-random Sorted-Read” method.

The Core Principle of My Algorithm:

The core of my algorithm is an attempt to merge two goals: achieving global randomness and optimizing disk I/O reads. Here is how I designed it to work:

  1. Batch Processing: First, I logically divide the dataset’s natural indices, 0, 1, ..., N-1, into batches, processing one batch of m indices at a time.
  2. Global Mapping: For each batch, I use a pseudo-random permutation function to map these m natural indices to their globally unique, pseudo-random target indices within the full range of the dataset [0, N-1].
  3. Sort to Optimize Reads: Here lies the essence of my algorithm. I don’t use these unordered, random target indices to read from the disk directly. Instead, I first sort these m target indices in ascending order.
  4. Efficient Reading: Then, I read the data from the source file according to this sorted list of indices. With this step, I successfully transformed what would have been random I/O, causing frequent disk head movements, into a much more disk-friendly, near-sequential I/O pattern.
  5. In-Memory Reordering and Output: Finally, I take the block of data read into memory, reorder it back to its original pseudo-random sequence, and then output it sequentially as a fully shuffled minibatch.

I believe that my algorithm not only correctly achieves a global shuffle but also holds significant theoretical value in terms of I/O efficiency and engineering, as it requires no temporary files and is naturally suited for parallel processing.

Although I was confident in my algorithm’s theoretical I/O advantages, I failed to observe a noticeable performance improvement in the practical tests I conducted using a Python script. Upon reflection, I realized this is primarily due to two key reasons, which together shifted the real performance bottleneck away from the disk I/O I was trying to optimize:

  1. Python Interpreter Performance and Overhead:
    I recognized that as a high-level interpreted language, Python’s execution speed is far slower than compiled languages like C++. In the DataLoader workflow, substantial CPU time is consumed by Python’s own logic, such as inter-process communication, loop dispatching, and function call overhead. The time cost of these operations is on the order of milliseconds. In contrast, the disk seek time I saved with my algorithm is likely on the order of microseconds. Therefore, the massive overhead of the Python interpreter itself became the dominant performance bottleneck, completely masking the optimization I had achieved at the I/O level.
  2. Dataset Size and Operating System Caching:
    I also discovered that modern operating systems have very intelligent file system caching. During my tests, because the dataset I used was only a few dozen megabytes, the OS likely cached it entirely in RAM after the first read. On subsequent runs, all data read requests were served directly from high-speed memory, meaning no slow, physical disk I/O occurred at all. Within RAM, there is a negligible performance difference between random and sequential access. Consequently, my test was not actually measuring “disk I/O performance” but rather “memory access performance,” which naturally couldn’t reveal the benefits of my disk-focused algorithm. I understood that the value of my algorithm would only truly become apparent with datasets far larger than the available system memory.

My conclusion is: The algorithm I proposed is theoretically sound and highly efficient. The fact that I couldn’t see its advantages in my Python test is not a flaw in the algorithm itself, but a consequence of my specific test environment, where the true bottlenecks were Python’s execution overhead and system caching. To validate its full potential, I would need to implement its core logic in a high-performance language like C++ and test it on a dataset that significantly exceeds memory capacity.