Continue Discussion 32 replies
Jun '20

manuel-arno-korfmann

Others (like error rate) are difficult to optimize directly, owing to non-differentiability or other complications. In these cases, it is common to optimize a surrogate objective

It’s not quite clear to me from reading what exactly is meant with “error rate”. I think it would be great if an example could be given.

2 replies
Jun '20 ▶ manuel-arno-korfmann

goldpiggy

Hi @manuel-arno-korfmann, “error rate” means “how much mistake the model makes”. Is that more clear?

3 replies
Jun '20 ▶ goldpiggy

syedmech47

I am still unable to understand error rate.
“How much mistake to model makes” is not clear enough, did you mean how much mistake ‘the’ model makes, which is L1 distance(y-y¹).
Also can you please explain what is surrogate objective?

1 reply
Jun '20

manuel-arno-korfmann

I’m having a difficult time understanding

Hence, the loss 𝐿L incurred by eating the mushroom is 𝐿(𝑎=eat|𝑥)=0.2∗∞+0.8∗0=∞L(a=eat|x)=0.2∗∞+0.8∗0=∞, whereas the cost of discarding it is 𝐿(𝑎=discard|𝑥)=0.2∗0+0.8∗1=0.8L(a=discard|x)=0.2∗0+0.8∗1=0.8.

Is it possible to explain it in more depth via 1 or 2 paragraphs?

Jun '20 ▶ goldpiggy

manuel-arno-korfmann

Ok, so a person in the reading group explained that the error rate is the accumulated loss for all examples, is that correct?

1 reply
Jun '20 ▶ syedmech47

goldpiggy

Hey @syedmech47, Sorry for the typo here. Yes you got the idea here - the error rate is to measure the distance between y (the truth) and the $\hat{y}$ (the estimate). However the measurement metrics (which measure the error) does not limit to L1 distance, but also can accuracy, precision, recall, f1, etc.

A surrogate is a function that approximates an objective function. There are lots of measurement metrics are not differentiable (like f1 etc.), hence we need some other functions (i.e., the loss function ) to approximate the objective function.

Let me know if this is clear enough!

Jun '20 ▶ manuel-arno-korfmann

goldpiggy

It can be the accumulated loss, or average loss. It doesn’t make a lot difference here for optimization.

Jun '20

syedmech47

Thanks a lot. It totally made sense.

Side Note: I just want to thank each and every person’s effort in making this wonderful resource open for all and also providing such wonderful support through discussion forums.

1 reply
Jun '20 ▶ syedmech47

goldpiggy

Fantastic! It’s our pleasure to enable more talents learn, apply and benefit from deep learning!

Aug '20

manuel-arno-korfmann

The last exercise mentions “the end-to-end learning approach”, but it is nowhere explained in the section what is “end-to-end learning”.

1 reply
Aug '20 ▶ manuel-arno-korfmann

goldpiggy

Great call @manuel-arno-korfmann. I suspect it refers to " Fig. 1.1.2 A typical training process".

Aug '20 ▶ goldpiggy

zeuslawyer

Hi @goldpiggy, i’m reading this thread and I’d like to further clarify. My understanding is:

  1. there is a difference between error rate and cost/loss function

  2. error rate is the number of errors in predictions for a given set of inputs X. Perhaps its the total number of errors divided by the total examples in the input data?

  3. the loss function is a quantification of the “distance” between right and wrong predictions, which is different from the error rate which is percentage of errors as described in 2 above?

Thank you for these resources and your guidance.

2 replies
Aug '20 ▶ zeuslawyer

meetashok

@zeuslawyer

My understanding is as follows -

Ashok

Aug '20

meetashok

Tailors have developed a small number of parameters that describe human body shape fairly accurately for the purpose of fitting clothes. These problems are referred to as subspace estimation problems. If the dependence is linear, it is called principal component analysis.

In the above sentence, what is the word dependence referring to - dependence between what and what?

Ashok

1 reply
Aug '20

goldpiggy

yes.

yes, we also referred it as the average error rate

Sometimes this two can be the same if error rate function is differentiable. While most of the time, it is not differentiable. For example, most of the classification error rate function is not differentiable, so we use a loss function. Check more details here.

Aug '20 ▶ meetashok

goldpiggy

Hi @meetashok, great question! The dependence between the principal components and the original data. So PCA transforms the data to a new coordinate system (ordering by the principal components) by orthogonal linear transformation. It tries to capture and recreate the new features from the data.

Feb '21

rzwck

This text mentioned few times about “from first principles”, what does this really mean here?

2 replies
Apr '21

jioyoung

"As we will see later, this loss corresponds to the assumption that our data were corrupted by Gaussian noise ". This sentence appears in the Regression subsection. However, to my knowledge, the Gaussian noise assumption is not necessary for the least-square method for linear regression. The Gaussian assumption is very useful for statistical inference but not necessary for parameter estimation.

May '21 ▶ rzwck

bravi

First principles thinking is the act of boiling a process down to the fundamental parts that you know are true and building up from there

First Principles: Elon Musk on the Power of Thinking for Yourself (jamesclear.com)

Jun '21

roncato

This may be a nitpick. Regarding the drain repair contractor example when it says if some of the variance owes to a few factors besides your two features. Here we have only one feature, the hours worked by the contractor, and two parameters, the hour rate and the bias, i.e. y = ax + b. Since the hourly rate is $100 and charges $50 to show up, a = 100 and b = 50.

1 reply
Jun '21 ▶ roncato

goldpiggy

Great catch @roncato ! Would like to be a contributor for us?! Many thanks!

Aug '21 ▶ rzwck

SuradechKKPB

Sep '21

kiranrtn

In Section 1.5, you write:

It is evident that random-access memory has not kept pace with the growth in data. At the same time, the increase in computational power has outpaced that of the data available. This means that statistical models need to become more memory efficient (this is typically achieved by adding nonlinearities) while simultaneously being able to spend more time on optimizing these parameters, due to an increased computational budget.

Can you kindly explain how statistical models can become memory efficient by “adding non-linearities”? Is memory efficiency the reason for adding non-linearities or a good side effect of adding non-linearity?

Jul '22

MahdiYousefi

The following sentence is missing its citation.

The breakthrough deep Q-network that beat humans at Atari games using only the visual input .

It seems you intended to cite, but maybe it didn’t compile properly?

Sep '22 ▶ manuel-arno-korfmann

matt_johnson

The error rate is the probability your model makes a mistake and is used to measure how good your model is on a classification problem. Here’s a concrete example:

Suppose you trained a model to predict if an image is a cat or a dog. Suppose you take 100 images with known labels of cat and dog that your model never trained on. Let’s take the first image (suppose it’s a cat) and your model predicts dog. The error would be 1. Suppose the next image is a dog and your model predicts a dog. The error would be 0. You keep doing this and sum up the errors. Finally, divide by the total number of examples you predicted on (100 in this case). That is the error rate. It’s a number between 0 and 1. Suppose the error rate is 0.25 or 25%. It means there’s a 25% chance your model will make a mistake if you were to randomly select an image from your test dataset.

A few more things to know. For you model to make a decision, it means you had to pick a threshold. Neural networks for classification problems will produce a probability it thinks the image is a cat (say 30%). But this is just a probability! Next, you need to pick a threshold. So you might pick a threshold of say 50% and say that if the model emits a probability of 50% of greater, I will say the image is a cat. At this point you might wonder, is there a way to measure how good my model is without having to pick a threshold? In fact there is! That’s what ROC AUC is all about it. Which you can read more about here: Understanding ROC AUC (part 1/2). Introduction | by matt johnson | Building Ibotta | Medium

1 reply
Nov '22

Shawn_Xiao

data is basical and computation is runing bt that, and elegant algorithm can help running better.

Mar '23 ▶ matt_johnson

R8a

This is such a brilliant explanation. Thanks!!

Oct '23

Gabriel_Fernandez

This is amazing! Thank you so much! I did the exercises and it’s being very enjoyable.

Nov '23

DanDan-Xia

Hey folks, just wondering if anyone’s come across a resource for exercise solutions for each chapter in ‘Dive into Deep Learning’? Any leads would be awesome!

Jan '24

surtakur.career

Hi Dan, did you get any answers, please share, I am also looking for resources.

May '24

Mohamed_Ahmed_Naji

are we going to learn high level tools?
reading this books, maybe will make me able to do research and be scientist in the field? will i be able to read research papers?

Jun '24

filipv

My answers to the exercises:

  1. I recently created a Discord bot to prevent spam using simple heuristics – Levenshtein distance from suspicious words, frequent messaging, and the like. Learning from examples of manually annotated spam could be used for more intelligent content-based prevention.
  2. Implementing software based on a high-level description. This may take a few years, but seems possible.
  3. Computation is used to apply algorithms to data. As compute becomes cheaper, simpler algorithms which can more effectively leverage greater amounts of compute will out-perform other approaches (as described by Richard Sutton).
  4. Possibly in spaced repetition systems – Anki uses the SuperMemo 2 algorithm, but leveraging e2e training might allow for repetition which more appropriately adapts to a given user and the content they’re studying.