Jun '20
Others (like error rate) are difficult to optimize directly, owing to non-differentiability or other complications. In these cases, it is common to optimize a surrogate objective
It’s not quite clear to me from reading what exactly is meant with “error rate”. I think it would be great if an example could be given.
2 replies
Jun '20
▶ manuel-arno-korfmann
Jun '20
▶ goldpiggy
I am still unable to understand error rate.
“How much mistake to model makes” is not clear enough, did you mean how much mistake ‘the’ model makes, which is L1 distance(y-y¹).
Also can you please explain what is surrogate objective?
1 reply
Jun '20
I’m having a difficult time understanding
Hence, the loss 𝐿L incurred by eating the mushroom is 𝐿(𝑎=eat|𝑥)=0.2∗∞+0.8∗0=∞L(a=eat|x)=0.2∗∞+0.8∗0=∞, whereas the cost of discarding it is 𝐿(𝑎=discard|𝑥)=0.2∗0+0.8∗1=0.8L(a=discard|x)=0.2∗0+0.8∗1=0.8.
Is it possible to explain it in more depth via 1 or 2 paragraphs?
Jun '20
▶ goldpiggy
Ok, so a person in the reading group explained that the error rate is the accumulated loss for all examples, is that correct?
1 reply
Jun '20
▶ syedmech47
Hey @syedmech47, Sorry for the typo here. Yes you got the idea here - the error rate is to measure the distance between y (the truth) and the $\hat{y}$ (the estimate). However the measurement metrics (which measure the error) does not limit to L1 distance, but also can accuracy, precision, recall, f1, etc.
A surrogate is a function that approximates an objective function. There are lots of measurement metrics are not differentiable (like f1 etc.), hence we need some other functions (i.e., the loss function ) to approximate the objective function.
Let me know if this is clear enough!
Jun '20
▶ manuel-arno-korfmann
It can be the accumulated loss, or average loss. It doesn’t make a lot difference here for optimization.
Jun '20
Thanks a lot. It totally made sense.
Side Note: I just want to thank each and every person’s effort in making this wonderful resource open for all and also providing such wonderful support through discussion forums.
1 reply
Jun '20
▶ syedmech47
Fantastic! It’s our pleasure to enable more talents learn, apply and benefit from deep learning!
Aug '20
The last exercise mentions “the end-to-end learning approach”, but it is nowhere explained in the section what is “end-to-end learning”.
1 reply
Aug '20
▶ manuel-arno-korfmann
Aug '20
▶ goldpiggy
Hi @goldpiggy, i’m reading this thread and I’d like to further clarify. My understanding is:
-
there is a difference between error rate and cost/loss function
-
error rate is the number of errors in predictions for a given set of inputs X. Perhaps its the total number of errors divided by the total examples in the input data?
-
the loss function is a quantification of the “distance” between right and wrong predictions, which is different from the error rate which is percentage of errors as described in 2 above?
Thank you for these resources and your guidance.
2 replies
Aug '20
▶ zeuslawyer
@zeuslawyer
My understanding is as follows -
- Loss function and cost functions (in this context) mean the same thing
- Loss functions are a family of functions that could be relevant for a problem. For classification problems, the error rate is one such loss function. But as it isn’t differentiable. So, cross-entropy is used as a surrogate loss function
Ashok
Aug '20
Tailors have developed a small number of parameters that describe human body shape fairly accurately for the purpose of fitting clothes. These problems are referred to as subspace estimation problems. If the dependence is linear, it is called principal component analysis.
In the above sentence, what is the word dependence referring to - dependence between what and what?
Ashok
1 reply
Aug '20
yes.
yes, we also referred it as the average error rate
Sometimes this two can be the same if error rate function is differentiable. While most of the time, it is not differentiable. For example, most of the classification error rate function is not differentiable, so we use a loss function. Check more details here.
Aug '20
▶ meetashok
Hi @meetashok, great question! The dependence between the principal components and the original data. So PCA transforms the data to a new coordinate system (ordering by the principal components) by orthogonal linear transformation. It tries to capture and recreate the new features from the data.
Feb '21
This text mentioned few times about “from first principles”, what does this really mean here?
2 replies
Apr '21
"As we will see later, this loss corresponds to the assumption that our data were corrupted by Gaussian noise ". This sentence appears in the Regression subsection. However, to my knowledge, the Gaussian noise assumption is not necessary for the least-square method for linear regression. The Gaussian assumption is very useful for statistical inference but not necessary for parameter estimation.
May '21
▶ rzwck
Jun '21
This may be a nitpick. Regarding the drain repair contractor example when it says if some of the variance owes to a few factors besides your two features. Here we have only one feature, the hours worked by the contractor, and two parameters, the hour rate and the bias, i.e. y = ax + b
. Since the hourly rate is $100 and charges $50 to show up, a = 100
and b = 50
.
1 reply
Jun '21
▶ roncato
Sep '21
In Section 1.5, you write:
It is evident that random-access memory has not kept pace with the growth in data. At the same time, the increase in computational power has outpaced that of the data available. This means that statistical models need to become more memory efficient (this is typically achieved by adding nonlinearities) while simultaneously being able to spend more time on optimizing these parameters, due to an increased computational budget.
Can you kindly explain how statistical models can become memory efficient by “adding non-linearities”? Is memory efficiency the reason for adding non-linearities or a good side effect of adding non-linearity?
Jul '22
The following sentence is missing its citation.
The breakthrough deep Q-network that beat humans at Atari games using only the visual input .
It seems you intended to cite, but maybe it didn’t compile properly?
Sep '22
▶ manuel-arno-korfmann
The error rate is the probability your model makes a mistake and is used to measure how good your model is on a classification problem. Here’s a concrete example:
Suppose you trained a model to predict if an image is a cat or a dog. Suppose you take 100 images with known labels of cat and dog that your model never trained on. Let’s take the first image (suppose it’s a cat) and your model predicts dog. The error would be 1. Suppose the next image is a dog and your model predicts a dog. The error would be 0. You keep doing this and sum up the errors. Finally, divide by the total number of examples you predicted on (100 in this case). That is the error rate. It’s a number between 0 and 1. Suppose the error rate is 0.25 or 25%. It means there’s a 25% chance your model will make a mistake if you were to randomly select an image from your test dataset.
A few more things to know. For you model to make a decision, it means you had to pick a threshold. Neural networks for classification problems will produce a probability it thinks the image is a cat (say 30%). But this is just a probability! Next, you need to pick a threshold. So you might pick a threshold of say 50% and say that if the model emits a probability of 50% of greater, I will say the image is a cat. At this point you might wonder, is there a way to measure how good my model is without having to pick a threshold? In fact there is! That’s what ROC AUC is all about it. Which you can read more about here: Understanding ROC AUC (part 1/2). Introduction | by matt johnson | Building Ibotta | Medium
1 reply
Nov '22
data is basical and computation is runing bt that, and elegant algorithm can help running better.
Mar '23
▶ matt_johnson
This is such a brilliant explanation. Thanks!!
Oct '23
This is amazing! Thank you so much! I did the exercises and it’s being very enjoyable.
Nov '23
Hey folks, just wondering if anyone’s come across a resource for exercise solutions for each chapter in ‘Dive into Deep Learning’? Any leads would be awesome!
Jan '24
Hi Dan, did you get any answers, please share, I am also looking for resources.
May '24
are we going to learn high level tools?
reading this books, maybe will make me able to do research and be scientist in the field? will i be able to read research papers?
Jun '24
My answers to the exercises:
- I recently created a Discord bot to prevent spam using simple heuristics – Levenshtein distance from suspicious words, frequent messaging, and the like. Learning from examples of manually annotated spam could be used for more intelligent content-based prevention.
- Implementing software based on a high-level description. This may take a few years, but seems possible.
- Computation is used to apply algorithms to data. As compute becomes cheaper, simpler algorithms which can more effectively leverage greater amounts of compute will out-perform other approaches (as described by Richard Sutton).
- Possibly in spaced repetition systems – Anki uses the SuperMemo 2 algorithm, but leveraging e2e training might allow for repetition which more appropriately adapts to a given user and the content they’re studying.