Recent comments in /f/MachineLearning
FallUpJV t1_jclpydo wrote
Reply to comment by xEdwin23x in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
Yes it's definitely not small, I meant comparated to the models people have been paying attention to the most on the last few years I guess.
The astronaut pointing gun meme is a good analogy, almost a scary one, I wonder how much we could improve existing models with simply better data.
Franck_Dernoncourt t1_jclpll3 wrote
Thanks for sharing! How does it compare against other models (eg, alpaca or gpt 3.5/4)?
MysteryInc152 t1_jclpjzi wrote
Reply to comment by FallUpJV in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
It's predicting language. as long as the structure can allow properly to learn to predict language, you're good to go.
KerfuffleV2 t1_jclo0oh wrote
Reply to comment by felheartx in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
I'm not an ML person, but it seems like that paper is just teaching the LLM to simulate a Turing machine. Actually making it respond normally while doing practical stuff like answering user queries would be a different thing.
Also, suppose the LLM has access to external memory. First, you have to teach it how to interact with that external memory (via special command sequences in its tokens, most likely). Then you have to teach it/take steps to make it appropriately note which things are important or not and store/retrieve them as necessary. All of this requires tokens for input/output so it will increase processing time even when used perfectly, these tokens will also consume the existing context window.
One really big thing with LLMs now is it seems like they don't (and maybe can't) know what they know/don't know. They just predict tokens, they can't really do introspection. Of course, they can be trained to respond that they don't know certain things, but getting the LLM to decide it needs to use the external memory doesn't seem like the simplest thing.
I mean, take humans as an example: Are you effective at taking notes, organizing them in a way that lets you easily recall them in the future, etc? It's not even an easy skill for humans to develop, and we're relatively good at knowing what we don't know.
Another thing is the paper you linked to says it set the temperature to 0, to make the responses very deterministic. Generally this makes them a lot less creative as well. If you turn up temperature, you potentially increase the chances that the LLM generates malformed queries for the external memory or stuff like that.
Anyway, I don't know much about the technical side of increasing the context window but when the context window is bigger the thing can just use it as far as I know. Taking advantage of some sort of external memory system seems like it's a very, very complicated thing to solve reliably.
Again, note this is coming from someone that doesn't really know much about ML, LLMs, etc. I'm just a normal developer, so take all this with a grain of salt.
lmericle t1_jcln487 wrote
Reply to comment by felheartx in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
You will find that in hype circles such as NLP there's a lot of thought-terminating cliches passed around by people who are not so deep in the weeds. Someone says something with confidence, another person doesn't know how to vet it and so just blindly passes it on, and all of a sudden a hack becomes a rumor becomes dogma. It seems to me to be this way with context vs memory.
Put another way: it's the kind of attitude that says "No, Mr. Ford, what we wanted was faster horses".
LappenX t1_jcln1pc wrote
You don't want to use jax without jit.
Batteredcode t1_jcllc74 wrote
Reply to comment by LeN3rd in [D] Simple Questions Thread by AutoModerator
Thank you this is really helpful, I think you're right that the cycle GAN is the way to go!
kreuzguy t1_jcliuwx wrote
Reply to comment by kittenkrazy in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Someone should definitely look into this!
felheartx t1_jclij93 wrote
Reply to comment by hfnuser0000 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
I have no idea about how many other ways there are, but this looks extremely promising: https://arxiv.org/abs/2301.04589#
So there's at least one :P
felheartx t1_jcli6si wrote
Reply to comment by -Rizhiy- in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
You said working with external memory is not as straightforward. Can you explain that?
I've read this: https://arxiv.org/abs/2301.04589# and even though I'm not super familiar with the details, to my untrained eye it seems like attaching external memory is easier than extending the context size.
Just from reading posts on this subreddit I get the feeling that getting larger and larger context sizes is very difficult. Whereas simply attaching this sort of "dictionary" thing is pretty easy to do.
Paarthri t1_jcli5zq wrote
Reply to comment by bogdantudorache in ML models for User Recognition using Keystroke Dynamics [P] by bogdantudorache
Did you measure any performance metrics? Its standard to do that when writing about an ml project.
bogdantudorache OP t1_jclfnmt wrote
Reply to comment by Paarthri in ML models for User Recognition using Keystroke Dynamics [P] by bogdantudorache
You can always start one 😁
Sonicxc t1_jcleo4j wrote
Reply to comment by LeN3rd in [D] Simple Questions Thread by AutoModerator
Hey man, thanks for the input. I will look into what you have mentioned
fastinguy11 t1_jcle8cn wrote
Reply to comment by Nhabls in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Gpt4 32 k api when available ?
Paarthri t1_jcle34m wrote
I don’t see any discussion on performance of the models…
limpbizkit4prez t1_jcld70o wrote
Reply to comment by r_linux_mod_isahoe in [N] Jumpy 1.0 has now been released by the Farama Foundation by jkterry1
Yeah, but if I have a code base written in numpy and want to use jax, wouldn't I need to do the same amount of refactoring to integrate this as I would with regular jax? Are there a lot of functions in numpy that don't exist in jax.numpy?
r_linux_mod_isahoe t1_jclc6rp wrote
Reply to comment by limpbizkit4prez in [N] Jumpy 1.0 has now been released by the Farama Foundation by jkterry1
porting an existing codebase to jax? Using any existing algorithm that's implemented in numpy but on jax backend? The opportunities are massive
[deleted] t1_jclc4rv wrote
[removed]
limpbizkit4prez t1_jclabza wrote
What's value in using this instead of "jax.numpy as np"?
Competitive-Rub-1958 t1_jcl97q0 wrote
Reply to comment by No-Belt7582 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
> Either autograd is disabled (using torch.inference_mode or torch.no_grad) or no tensor argument requires_grad > training is disabled (using .eval())
What's the point of FlashAttention if you can't use it during training? 🤔
https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html
[deleted] t1_jcl92n0 wrote
[removed]
dats_ah_numba_wang t1_jcl7k6w wrote
Didnt know i needed this.
BungaBunga6767 t1_jcl6vf9 wrote
Reply to comment by harharveryfunny in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
LongFormer does it but not with FlashAttention
lucidraisin t1_jcl6ecd wrote
Reply to comment by super_deap in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
no worries, thanks for running the experiments and sharing your results 🙏
bubudumbdumb t1_jclromf wrote
Reply to comment by Paarthri in ML models for User Recognition using Keystroke Dynamics [P] by bogdantudorache
They don't seem to claim "we have a dataset of typing logs from DIFFERENT people working on DIFFERENT tasks while typing on DIFFERENT keyboards".
If they had it I would be more concerned about the ethics of data collection than about the model accuracy