i have a question, why does the LSTM or GRU over come the vanishing gradient problem?
… long products of matrices can lead to vanishing or exploding gradients. Let us briefly think about what such gradient anomalies mean in practice:
- We might encounter a situation where an early observation is highly significant for predicting all future observations. Consider the somewhat contrived case where the first observation contains a checksum and the goal is to discern whether the checksum is correct at the end of the sequence. In this case, the influence of the first token is vital. We would like to have some mechanisms for storing vital early information in a memory cell . Without such a mechanism, we will have to assign a very large gradient to this observation, since it affects all the subsequent observations.
- We might encounter situations where some tokens carry no pertinent observation. For instance, when parsing a web page there might be auxiliary HTML code that is irrelevant for the purpose of assessing the sentiment conveyed on the page. We would like to have some mechanism for skipping such tokens in the latent state representation.
- We might encounter situations where there is a logical break between parts of a sequence. For instance, there might be a transition between chapters in a book, or a transition between a bear and a bull market for securities. In this case it would be nice to have a means of resetting our internal state representation.
yes, but it’s better to have a mathematics for explaining them.
What is “short-term input skipping in latent variable models has existed for a long time” please someone explain. Thanks in advance