Introduction - English Version

Jun '20

manuel-arno-korfmann

Others (like error rate) are difficult to optimize directly, owing to non-differentiability or other complications. In these cases, it is common to optimize a surrogate objective

It’s not quite clear to me from reading what exactly is meant with “error rate”. I think it would be great if an example could be given.

2 replies

Jun '20 ▶ manuel-arno-korfmann

goldpiggy

Hi @manuel-arno-korfmann, “error rate” means “how much mistake the model makes”. Is that more clear?

3 replies

Jun '20 ▶ goldpiggy

I am still unable to understand error rate.
“How much mistake to model makes” is not clear enough, did you mean how much mistake ‘the’ model makes, which is L1 distance(y-y¹).
Also can you please explain what is surrogate objective?

1 reply

Jun '20

manuel-arno-korfmann

I’m having a difficult time understanding

Hence, the loss 𝐿L incurred by eating the mushroom is 𝐿(𝑎=eat|𝑥)=0.2∗∞+0.8∗0=∞L(a=eat|x)=0.2∗∞+0.8∗0=∞, whereas the cost of discarding it is 𝐿(𝑎=discard|𝑥)=0.2∗0+0.8∗1=0.8L(a=discard|x)=0.2∗0+0.8∗1=0.8.

Is it possible to explain it in more depth via 1 or 2 paragraphs?

Jun '20 ▶ goldpiggy

manuel-arno-korfmann

Ok, so a person in the reading group explained that the error rate is the accumulated loss for all examples, is that correct?

1 reply

Jun '20 ▶ syedmech47

goldpiggy

Hey @syedmech47, Sorry for the typo here. Yes you got the idea here - the error rate is to measure the distance between y (the truth) and the $\hat{y}$ (the estimate). However the measurement metrics (which measure the error) does not limit to L1 distance, but also can accuracy, precision, recall, f1, etc.

A surrogate is a function that approximates an objective function. There are lots of measurement metrics are not differentiable (like f1 etc.), hence we need some other functions (i.e., the loss function ) to approximate the objective function.

Let me know if this is clear enough!

Jun '20 ▶ manuel-arno-korfmann

goldpiggy

It can be the accumulated loss, or average loss. It doesn’t make a lot difference here for optimization.

Jun '20

syedmech47

Thanks a lot. It totally made sense.

Side Note: I just want to thank each and every person’s effort in making this wonderful resource open for all and also providing such wonderful support through discussion forums.

1 reply

Jun '20 ▶ syedmech47

goldpiggy

Fantastic! It’s our pleasure to enable more talents learn, apply and benefit from deep learning!

Aug '20

manuel-arno-korfmann

The last exercise mentions “the end-to-end learning approach”, but it is nowhere explained in the section what is “end-to-end learning”.

1 reply

Aug '20 ▶ manuel-arno-korfmann

goldpiggy

Great call @manuel-arno-korfmann. I suspect it refers to " Fig. 1.1.2 A typical training process".

Aug '20 ▶ goldpiggy

zeuslawyer

Hi @goldpiggy, i’m reading this thread and I’d like to further clarify. My understanding is:

there is a difference between error rate and cost/loss function
error rate is the number of errors in predictions for a given set of inputs X. Perhaps its the total number of errors divided by the total examples in the input data?
the loss function is a quantification of the “distance” between right and wrong predictions, which is different from the error rate which is percentage of errors as described in 2 above?

Thank you for these resources and your guidance.

2 replies

Aug '20 ▶ zeuslawyer

meetashok

@zeuslawyer

My understanding is as follows -

Loss function and cost functions (in this context) mean the same thing
Loss functions are a family of functions that could be relevant for a problem. For classification problems, the error rate is one such loss function. But as it isn’t differentiable. So, cross-entropy is used as a surrogate loss function

Ashok

Aug '20

meetashok

Tailors have developed a small number of parameters that describe human body shape fairly accurately for the purpose of fitting clothes. These problems are referred to as subspace estimation problems. If the dependence is linear, it is called principal component analysis.

In the above sentence, what is the word dependence referring to - dependence between what and what?

Ashok

1 reply

Aug '20

goldpiggy

yes.

yes, we also referred it as the average error rate

Sometimes this two can be the same if error rate function is differentiable. While most of the time, it is not differentiable. For example, most of the classification error rate function is not differentiable, so we use a loss function. Check more details here.

Aug '20 ▶ meetashok

goldpiggy

Hi @meetashok, great question! The dependence between the principal components and the original data. So PCA transforms the data to a new coordinate system (ordering by the principal components) by orthogonal linear transformation. It tries to capture and recreate the new features from the data.

Feb '21

rzwck

This text mentioned few times about “from first principles”, what does this really mean here?

2 replies

Apr '21

jioyoung

"As we will see later, this loss corresponds to the assumption that our data were corrupted by Gaussian noise ". This sentence appears in the Regression subsection. However, to my knowledge, the Gaussian noise assumption is not necessary for the least-square method for linear regression. The Gaussian assumption is very useful for statistical inference but not necessary for parameter estimation.

May '21 ▶ rzwck

bravi

First principles thinking is the act of boiling a process down to the fundamental parts that you know are true and building up from there

First Principles: Elon Musk on the Power of Thinking for Yourself (jamesclear.com)

Jun '21

roncato

This may be a nitpick. Regarding the drain repair contractor example when it says if some of the variance owes to a few factors besides your two features. Here we have only one feature, the hours worked by the contractor, and two parameters, the hour rate and the bias, i.e. y = ax + b. Since the hourly rate is $100 and charges $50 to show up, a = 100 and b = 50.

1 reply

Jun '21 ▶ roncato

goldpiggy

Great catch @roncato ! Would like to be a contributor for us?! Many thanks!

Aug '21 ▶ rzwck

SuradechKKPB

1st priniciples model will specify an entire solution irrespective of data - examples are physic laws
Parametric model will assume a form of solution and fits it to the data (e.g. linear regression model assumes y = b0 + b1x then we fit it to the data to find b0 and b1)
Nonparametric will have no assumption about the solution form, this is totally data driven

Sep '21

kiranrtn

In Section 1.5, you write:

It is evident that random-access memory has not kept pace with the growth in data. At the same time, the increase in computational power has outpaced that of the data available. This means that statistical models need to become more memory efficient (this is typically achieved by adding nonlinearities) while simultaneously being able to spend more time on optimizing these parameters, due to an increased computational budget.

Can you kindly explain how statistical models can become memory efficient by “adding non-linearities”? Is memory efficiency the reason for adding non-linearities or a good side effect of adding non-linearity?

Jul '22

MahdiYousefi

The following sentence is missing its citation.

The breakthrough deep Q-network that beat humans at Atari games using only the visual input .

It seems you intended to cite, but maybe it didn’t compile properly?

Sep '22 ▶ manuel-arno-korfmann

matt_johnson

The error rate is the probability your model makes a mistake and is used to measure how good your model is on a classification problem. Here’s a concrete example:

Suppose you trained a model to predict if an image is a cat or a dog. Suppose you take 100 images with known labels of cat and dog that your model never trained on. Let’s take the first image (suppose it’s a cat) and your model predicts dog. The error would be 1. Suppose the next image is a dog and your model predicts a dog. The error would be 0. You keep doing this and sum up the errors. Finally, divide by the total number of examples you predicted on (100 in this case). That is the error rate. It’s a number between 0 and 1. Suppose the error rate is 0.25 or 25%. It means there’s a 25% chance your model will make a mistake if you were to randomly select an image from your test dataset.

A few more things to know. For you model to make a decision, it means you had to pick a threshold. Neural networks for classification problems will produce a probability it thinks the image is a cat (say 30%). But this is just a probability! Next, you need to pick a threshold. So you might pick a threshold of say 50% and say that if the model emits a probability of 50% of greater, I will say the image is a cat. At this point you might wonder, is there a way to measure how good my model is without having to pick a threshold? In fact there is! That’s what ROC AUC is all about it. Which you can read more about here: Understanding ROC AUC (part 1/2). Introduction | by matt johnson | Building Ibotta | Medium

1 reply

Nov '22

I recently created a Discord bot to prevent spam using simple heuristics – Levenshtein distance from suspicious words, frequent messaging, and the like. Learning from examples of manually annotated spam could be used for more intelligent content-based prevention.
Implementing software based on a high-level description. This may take a few years, but seems possible.
Computation is used to apply algorithms to data. As compute becomes cheaper, simpler algorithms which can more effectively leverage greater amounts of compute will out-perform other approaches (as described by Richard Sutton).
Possibly in spaced repetition systems – Anki uses the SuperMemo 2 algorithm, but leveraging e2e training might allow for repetition which more appropriately adapts to a given user and the content they’re studying.

manuel-arno-korfmann

goldpiggy

syedmech47

manuel-arno-korfmann

manuel-arno-korfmann

goldpiggy

goldpiggy

syedmech47

goldpiggy

manuel-arno-korfmann

goldpiggy

zeuslawyer

meetashok

meetashok

goldpiggy

goldpiggy

rzwck

jioyoung

bravi

roncato

goldpiggy

SuradechKKPB

kiranrtn

MahdiYousefi

matt_johnson

Shawn_Xiao

R8a

Gabriel_Fernandez

DanDan-Xia

surtakur.career

Mohamed_Ahmed_Naji

filipv