_Arsenie_Boca_
_Arsenie_Boca_ t1_ityccjh wrote
Reply to comment by AuspiciousApple in [D] What's the best open source model for GPT3-like text-to-text generation on local hardware? by AuspiciousApple
I dont think there is a fundamental difference between cv and nlp. However, we expect language models to be much more generalist than any vision model (Have you ever seen a vision model that performs well on discriminative and generative tasks across domains without finetuning?) I believe this is where scale is the enabling factor.
_Arsenie_Boca_ t1_ityby3b wrote
Reply to comment by deeceeo in [D] What's the best open source model for GPT3-like text-to-text generation on local hardware? by AuspiciousApple
True, I forgot about this one. Although getting to run a 20b model (Neox20b and UL20B) on an rtx gpu is probably a big stretch
_Arsenie_Boca_ t1_itwt4ls wrote
Reply to [D] What's the best open source model for GPT3-like text-to-text generation on local hardware? by AuspiciousApple
I dont think any model you can run on a single commodity gpu will be on par with gpt-3. Perhaps GPT-J, Opt-{6.7B / 13B} and GPT-Neox20B are the best alternatives. Some might need significant engineering (e.g. deepspeed) to work on limited vram
_Arsenie_Boca_ t1_isnytx5 wrote
Reply to comment by No_Slide_1942 in Testing Accuracy higher than Training Accuracy by redditnit21
If the performance on the train set after training is better than during training, its simply due to the fact that train acc is usually calculated while the model is still learning.
_Arsenie_Boca_ t1_isnyhtb wrote
You should test if this happens only during training or also when evaluating on the train set afterwards. As others have mentioned, dropout could be a possible factor. But you should also consider that the train accuracy is calculated during the training process, while the model is still learning. I.e. the final weights are not reflected in the average train acc.
_Arsenie_Boca_ t1_isbgyhs wrote
If the hardware is optimized for it, there probably is not a huge difference in speed, but the performance gain is probably negligible too.
The real reason people dont use 64bit is mainly memory usage. When you train a large model, you can fit much bigger 32bit/16bit batches into memory and thereby speed up training.
_Arsenie_Boca_ t1_irxaphg wrote
Reply to comment by freezelikeastatue in [D] Looking for some critiques on recent development of machine learning by fromnighttilldawn
While I am optimistic about the open-ness of AI, I am much more pessimistic regarding its capabilities. I dont believe AI could replace a team of software engineers anytime soon.
_Arsenie_Boca_ t1_irx86or wrote
Reply to comment by freezelikeastatue in [D] Looking for some critiques on recent development of machine learning by fromnighttilldawn
I see your point but I wouldnt see it too pessimistically. If anything the not-so-open policy of OpenAI has lead to many initiatives that aim to demcratize AI. If they decide to go commercial as well, others will take their place.
_Arsenie_Boca_ t1_irx6bvg wrote
Reply to comment by freezelikeastatue in [D] Looking for some critiques on recent development of machine learning by fromnighttilldawn
I guess this is part of the bitter lesson. Sacrificing some quality for quantity seems to pay off in many cases
_Arsenie_Boca_ t1_irx1ubl wrote
Reply to comment by elbiot in [D] Looking for some critiques on recent development of machine learning by fromnighttilldawn
Thats definitely a fair point (although you can do that with recurrent models as well, see reddit link in my other comment). Anyway, the more general point about multiple changes stands, maybe I chose a bad example
_Arsenie_Boca_ t1_irx1c80 wrote
Reply to comment by _Arsenie_Boca_ in [D] Looking for some critiques on recent development of machine learning by fromnighttilldawn
In fact, here is a post of someone who apparently found pretty positive results about scaling up recurrent models to billions of parameters https://www.reddit.com/r/MachineLearning/comments/xfup9f/r_rwkv4_scaling_rnn_to_7b_params_and_beyond_with/?utm_source=share&utm_medium=android_app&utm_name=androidcss&utm_term=1&utm_content=share_button
_Arsenie_Boca_ t1_irwzk3j wrote
Reply to comment by CommunismDoesntWork in [D] Looking for some critiques on recent development of machine learning by fromnighttilldawn
The point is that you cannot confirm the superiority of an architecture (or whatever component) when you change multiple things. And yes, it does matter where an improvement comes from, it is the only scientfically sound method to improve. Otherwise we might as well try random things until we find something that works.
To come back to LSTM vs Transformers: Im not saying LSTMs are better or anything. Im just saying that if LSTMs would have received the amount of engineering attention that went into making transformers better and faster, who knows if they might be similarly successful?
_Arsenie_Boca_ t1_irwat46 wrote
Reply to comment by MohamedRashad in [D] Reversing Image-to-text models to get the prompt by MohamedRashad
Thats a fair point. You would have a fixed length for the prompt.
Not sure if this makes sense but you could use an LSTM with arbitrary constant input to generate a variable-length sequence of embeddings and optimize the LSTM rather than the embeddings directly.
_Arsenie_Boca_ t1_irw1ti7 wrote
Reply to comment by MohamedRashad in [D] Reversing Image-to-text models to get the prompt by MohamedRashad
No, I believe you are right to think that an arbitrary image captioning model cannot accurately generate prompts that actually lead to a very similar image. Afterall, the prompts are very model-dependent.
Maybe you could use something similar to prompt tuning. Use a number of randomly initialized prompt embeddings, generate an image and backprop the distance between your target image and the generated image. After convergence, you can perform a nearest neighbor search to find the words closest to the embeddings.
Not sure if this has been done, but I think it should work reasonably well
_Arsenie_Boca_ t1_irvjdtn wrote
Reply to [D] Looking for some critiques on recent development of machine learning by fromnighttilldawn
I dont have the papers on hand that investigate this, but here are 2 things that dont make me proud of being part of this field.
Are transformers really architecturally better than LSTMs or is their success mainly due to the huge amount of compute and data we throw at them? More generally, papers tend to make many changes to a system and credit the improvement to the thing they are most proud of without a fair comparison.
Non-opensource models like GPT3 dont make their training dataset public. People evaluate the performance on benchmarks but nobody can say for sure if the benchmark data was in the training data. ML used to be very cautious about data leakage, but this is simply ignored in most cases when its about those models.
_Arsenie_Boca_ t1_irrcmwh wrote
Reply to comment by Constant-Cranberry29 in how to find out the problem when want to do testing the model? by Constant-Cranberry29
If all you want to see is the two curves close to each other, I guess you could size up the model, so that it overfits terribly. But is that really desirable?
If my assumption that you predict the whole graph autoregressively is correct, then I believe it works just fine. You should check the forecast horizon and think about what it is you want to achieve in the end
_Arsenie_Boca_ t1_irr9bb4 wrote
Reply to comment by Constant-Cranberry29 in how to find out the problem when want to do testing the model? by Constant-Cranberry29
No this is not a modelling issue. It actually isnt a real issue at all. Predicting a very long trajectory is simply very hard. At each timestep, a slight error will occur which will exponentiate, even if the error per timestep is marginal. Imagine being asked to predict a certain stock price. Given some expertise and current information, you might be able to do it for tomorrow, but can you do it precisely for the next year?
_Arsenie_Boca_ t1_irr84f2 wrote
I guess this is timeseries forecasting. You should think about the lookahead. Probably, during training, the model only has to predict the next point, while during testing, it has to predict many values autoregressively
_Arsenie_Boca_ t1_iqzeu3o wrote
Reply to comment by _Arsenie_Boca_ in [D] Why restrict to using a linear function to represent neurons? by MLNoober
There are a few papers researching this (effect of high dimensions for SGD), but I cant seem to find any right now. Maybe someone can help me out :)
_Arsenie_Boca_ t1_iqze3a1 wrote
As many others have mentioned, the decision boundaries from piecewise linear models are actually quite smooth in the end, given a sufficient amount of layers.
But to get to the core of your question, why would you prefer many stupid neurons over few smart ones. I believe there is a relatively simple explanation why the former is better. Having more complex neurons would mean that the computational complexity goes up while the number of parameters stays the same. I.e. with the same compute, you can train bigger (number of params) models if the neurons are simple. A high number of parameters is important for optimization as extra dimensions can be helpful in getting out of local minima. Not sure if this has been fully explained, but it is in part the reason why pruning works so well: we wouldnt need that many parameters to represent a good fit, but it is much easier to find in high dimensions, from where we can prune down to simpler models (only 5% of parameters with almost same performance).
_Arsenie_Boca_ t1_iqr4l61 wrote
what?
_Arsenie_Boca_ t1_iu7yz56 wrote
Reply to [R] Open source inference acceleration library - voltaML by harishprab
Looks promising. A comparison with other competitors (hf accelerate, neuralmagic, nebullvm, ...) would be great