Submitted by fromnighttilldawn t3_y11a7r in MachineLearning
_Arsenie_Boca_ t1_irvjdtn wrote
I dont have the papers on hand that investigate this, but here are 2 things that dont make me proud of being part of this field.
Are transformers really architecturally better than LSTMs or is their success mainly due to the huge amount of compute and data we throw at them? More generally, papers tend to make many changes to a system and credit the improvement to the thing they are most proud of without a fair comparison.
Non-opensource models like GPT3 dont make their training dataset public. People evaluate the performance on benchmarks but nobody can say for sure if the benchmark data was in the training data. ML used to be very cautious about data leakage, but this is simply ignored in most cases when its about those models.
harharveryfunny t1_irvssm1 wrote
It seems transformers really have two fundamental advantages over LSTMs:
- By design (specifically to improve over the shortcomings of recurrent models), they are much more efficient to train since samples can be presented in parallel. Also, positional encoding allows transformers to more accurately deal with positional structure which is critical for language.
- Transformers scale up very successfully. Per Rich Sutton's "Bitter Lesson", generally dumb methods that scale up in terms of ability to usefully absorb compute and data do better than more highly engineered "smart" methods. I wouldn't argue that transformers are any simpler in architecture than LSTMs, but as GPT-3 proved they do scale very successfully - increasing performance while still being relatively easy to train.
The context of your criticism is still valid though. Not sure whether it's fair or not, but I tend to look at DeepMind's recent matrix multiplication paper like that - they are touting it as a success of "AI" and RL, when really it's not at all apparent what RL is adding here. Surely the tensor factorization space could equally well been explored by other techniques such as evolution or even just MCTS.
sambiak t1_irwzqdv wrote
> Surely the tensor factorization space could equally well been explored by other techniques such as evolution or even just MCTS.
I think you're underestimating the difficulty of exploring an enormous state space. The state space of this problem is bigger than the one in go or chess.
Reinforcement Learning specializes in finding good solutions when only a small subset of state space can be explored. You're quite right that Monte Carlo Tree Search would work here because that's exactly what they used ^ ^
> Similarly to AlphaZero, AlphaTensor uses a deep neural network to guide a Monte Carlo tree search (MCTS) planning procedure.
That said, you do need a good way to guide this MCTS, and a neural network is a great solution to evaluate how good a given state is. But then you've got a new problem, how do you train this neural network ? And so on. It's not trivial, and frankly even the best tools have quite some weaknesses.
But no, evolution algorithms would not be easier, because you still need a fitness function, and once again you can use neural networks for approximating it, but you run into training issues once again. As far as I know, evolution algorithms are just worse than MCTS at the moment until someone figures a better way to approximate fitness functions.
csreid t1_irxfue3 wrote
Imo, transformers are significantly less simple and more "hand-crafted" than lstm.
The point of the bitter lesson, I think, is that trying to be clever ends up biting you and eventually compute will reach a point when you can just learn it. Cross attention and all this special architecture to help a model capture intraseries information is definitely being clever when compared LSTM (or rnns in general) which just give a way for the network to keep some information around when presented with things in series.
harharveryfunny t1_irxuxr9 wrote
Yes, I agree about the relative complexity (not that an LSTM doesn't also have a fair bit of structure), but the bitter lesson requires an approach that above all else will scale, which transformers do.
I think many people, myself included, were surprised with the emergent capabilities of GPT-3 and derivatives such as OpenAI Codex ... of course it makes sense how much domain knowledge (about fairy tales, programming, etc, etc) is needed to be REALLY REALLY good at "predict next word", but not at all obvious that something as relatively simple as a transformer was sufficient to learn that.
At the end of the day any future architecture capable of learning intelligent behavior will have to have some amount of structure - it needs to be a learning machine, and that machine needs some cogs. Is the transformer more complex than necessary for what it is capable of learning? I'm not sure - it's certainly conceptually pretty minimal.
elbiot t1_irwyleo wrote
The fact that you can throw a bunch of compute at transformers is part of their superiority. Even if it's the only factor, its really important
_Arsenie_Boca_ t1_irx1ubl wrote
Thats definitely a fair point (although you can do that with recurrent models as well, see reddit link in my other comment). Anyway, the more general point about multiple changes stands, maybe I chose a bad example
nickkon1 t1_irxid6a wrote
> ML used to be very cautious about data leakage, but this is simply ignored in most cases when its about those models.
I work on economic stuff. Either I am super unlucky or the number of papers that have data leakage is incredibly high. A decent chunk of papers that try to predict some macro-economic data one quarter a head dont leave a gap of one quarter between their training date and the prediction. Their backtest is awesome, the error is small, nice, a new paper! But it cant be used in production since how can I train a model on the 01.09.2022 if I need the data from 1st Oct to 31rd Dec for my target value.
It is incredibly frustrating. There have been papers, master thesis and even a dissertation that did this. I am incredibly frustrated and stopped trusting anything without code/data
scarynut t1_irxshd1 wrote
I noticed this on a lot of YouTube stock prediction tutorials. Made me conclude that people are idiots. Shocking that this mistake makes its way into papers..
popcornn1 t1_is03bja wrote
Sorry, but, I cannot understand your comment. What you mean by "don't leave gap"? So how they make forecast? Training data from January 2021 to December 2021 and then forecast from October 2021 to December 2021????
nickkon1 t1_is09o1x wrote
A lot of papers, articles, youtube videos on time series have the premise:
Our data is dependent on time. Not only does new data come in regularly, it might also happen that the coefficients of our model change over time and important features in 2020 (e.g. the number of people who are ill with covid) are less relevant now in 2022. To combat that, you retrain your model in regular intervals. Let us retrain our model daily.
That is totally fine and a sensible approach.
The key is: How far into the future do you want to predict something?
Because a lot of medium, towardsdatascience, and plenty of other blogs do that: Let us try to predict the 7-day return of a stock.
To train a new model today at t_{n}, I need data from the next week.  But since I cant view into the future and do not know the future 7-day return of my stock, I dont have my y variable. The same holds for time step t_{n-1} and so on until I reach time step t_{n-prediction window}. Only there, I can calculate the future 7-day return of my stock with today's information.
This means that the last data point of my training data is always lagging by 7 days from my evaluation date.
The issue is: This becomes a problem only at your most recent data points (specifically the last #{prediction window} data points). Since you are creating a blog, publishing a paper... who cares? You dont really use that model daily for your business anyway. But: You can still do that on your backtest where you iterate through each time step t_{i}, take the last 2 years of training data up until t_{i} and make your prediction.
Your backtest is suddenly a lot better, your error becomes smaller, BAM 80% accuracy on a stock prediction! You beat the live tested performance of your competition! It is a great achievement and let us write a paper about it! But the reality is: Your model is actually unusable in a live setting and the errors you reported from your backtest are wrong. The reason is a subtle way of giving your model info about the future by accident. Throughout the whole backtest you have retrained your model's parameters at time t_{i} with data about your target variable at t_{i+1} to t_{i+prediction_window-1}. You need a gap between your training data and validation/test data.
Specifically in numbers (YYYY-MM-DD):
Wrong:
Training: 2020-10-10 - 2022-10-10
You try to retrain your model on 2022-10-10 and make a prediction on that date.
Correct:
Training: 2020-10-03 - 2022-10-03
You retrain your model on 2022-10-10 and make a prediction on that date. Notice that the last data point of your training data is not today, but today - #{prediction window}
CommunismDoesntWork t1_irwxgxk wrote
>Are transformers really architecturally better than LSTMs or is their success mainly due to the huge amount of compute and data we throw at them?
That's like asking if B-trees are actually better than red black trees, or if modern CPUs and their large caches just happen to lead to better performance. It doesn't matter. If one algorithm works theoretically but doesn't scale, then it might as well not work. It's the same reason no one uses fully connected networks even though they're universal function approximators.
_Arsenie_Boca_ t1_irwzk3j wrote
The point is that you cannot confirm the superiority of an architecture (or whatever component) when you change multiple things. And yes, it does matter where an improvement comes from, it is the only scientfically sound method to improve. Otherwise we might as well try random things until we find something that works.
To come back to LSTM vs Transformers: Im not saying LSTMs are better or anything. Im just saying that if LSTMs would have received the amount of engineering attention that went into making transformers better and faster, who knows if they might be similarly successful?
_Arsenie_Boca_ t1_irx1c80 wrote
In fact, here is a post of someone who apparently found pretty positive results about scaling up recurrent models to billions of parameters https://www.reddit.com/r/MachineLearning/comments/xfup9f/r_rwkv4_scaling_rnn_to_7b_params_and_beyond_with/?utm_source=share&utm_medium=android_app&utm_name=androidcss&utm_term=1&utm_content=share_button
visarga t1_irzdrho wrote
> if LSTMs would have received the amount of engineering attention that went into making transformers better and faster
There was a short period when people were trying to improve LSTMs using genetic algorithms or RL.
- 
An Empirical Exploration of Recurrent Network Architectures (2015, Sutskever) 
- 
LSTM: A Search Space Odyssey (2015, Schmidhuber) 
- 
Neural Architecture Search with Reinforcement Learning (2016, Quoc Le) 
The conclusion was that the LSTM cell is somewhat arbitrary and many other architectures work just as well, but none much better. So people stuck with classic LSTMs.
CommunismDoesntWork t1_is0wj7j wrote
If an architecture of more scalable, then it's the superior architecture.
SleekEagle t1_irx6j3n wrote
I think it's more about the parallelizability of Transformers than anything. For all intents and purposes that makes them better than LSTMs and any recurrent model in general imo.
Viewing a single comment thread. View all comments