Long Short Term Memory (LSTM)

astonzhang · September 17, 2020, 4:42am

https://d2l.ai/chapter_recurrent-modern/lstm.html

thainq · February 22, 2022, 2:19am

i have a question, why does the LSTM or GRU over come the vanishing gradient problem?

sanjaradylov · February 22, 2022, 2:58pm

@thainq I think the intro of 9.1 answers your question:

… long products of matrices can lead to vanishing or exploding gradients. Let us briefly think about what such gradient anomalies mean in practice:

We might encounter a situation where an early observation is highly significant for predicting all future observations. Consider the somewhat contrived case where the first observation contains a checksum and the goal is to discern whether the checksum is correct at the end of the sequence. In this case, the influence of the first token is vital. We would like to have some mechanisms for storing vital early information in a memory cell . Without such a mechanism, we will have to assign a very large gradient to this observation, since it affects all the subsequent observations.

We might encounter situations where some tokens carry no pertinent observation. For instance, when parsing a web page there might be auxiliary HTML code that is irrelevant for the purpose of assessing the sentiment conveyed on the page. We would like to have some mechanism for skipping such tokens in the latent state representation.

We might encounter situations where there is a logical break between parts of a sequence. For instance, there might be a transition between chapters in a book, or a transition between a bear and a bull market for securities. In this case it would be nice to have a means of resetting our internal state representation.

thainq · February 23, 2022, 1:32am

yes, but it’s better to have a mathematics for explaining them.

iAmKankan · February 24, 2022, 3:15pm

What is “short-term input skipping in latent variable models has existed for a long time” please someone explain. Thanks in advance

Konne · July 18, 2022, 12:48pm

Are the dimesions of the bias vectors, e.g. b_f (1xh) correct? It doen’t seem to fit to the resulting F_t matrix (nxh)?

IamTC · August 14, 2022, 2:01pm

Adding to this good piece of writing on LSTM

Good video for intuition : Illustrated Guide to LSTM's and GRU's: A step by step explanation - YouTube
The video led me to this piece of writing by Chris Olah : Understanding LSTM Networks -- colah's blog

pandalabme · September 5, 2023, 9:03am

My solutions to the exs: 10.1

WMY-AAA · October 7, 2023, 10:58am

“@d2l.add_to_class(LSTMScratch)
def forward(self, inputs, H_C=None):
…
return outputs, (H, C)”
I got error here. Should it be “return outputs, H_C”?

qqfox · May 11, 2025, 6:31am

The Figures mentioned the concatenation between, for example X_t W_xi + H_t-W_hi . However, according to the dimension in results, it should be element-wise addition for the + symbol right?