Recent comments in /f/MachineLearning
[deleted] OP t1_jcko3jr wrote
Reply to comment by yumiko14 in [D] GPT-4 is really dumb by [deleted]
[deleted]
[deleted] OP t1_jcknwkx wrote
Reply to comment by yumiko14 in [D] GPT-4 is really dumb by [deleted]
[deleted]
NotARedditUser3 t1_jckne7y wrote
Reply to comment by Available_Lion_652 in [D] GPT-4 is really dumb by [deleted]
He wasn't talking to you, dingus
JaCraig t1_jckmll4 wrote
Reply to comment by Available_Lion_652 in [D] GPT-4 is really dumb by [deleted]
My point is more it's the wrong tool for the job. Something designed for calculations like wolfram alpha and their API is probably better suited:
https://www.wolframalpha.com/input?i=%28x%5E3%29%2B%28y%5E3%29%2B%28z%5E3%29+%3D+1024
BUT I did ask ChatGPT (so 3.5) to write an app to do it in a couple languages and it gave me a working app first try on each. It's not a very good app as I could optimize it a lot more, but it works. GPT-4 gave a slightly better app in each instance.
mrpogiface t1_jckmi7d wrote
Reply to comment by kittenkrazy in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Definitely, but you'd need to further fine-tune the model to "teach" it to make use of the additional context
farmingvillein t1_jckm5r2 wrote
Reply to comment by 2muchnet42day in [D] What is the best way to fine tune a LLM with your own data and build a custom text classifier? by pgalgali
Although note that OP does say that his data isn't labeled...and you of course need to label it for Roberta. So you're going to need to bootstrap that process via manual labeling or--ideally, if able--via an LLM labeling process.
If you go through the effort to set up an LLM labeling pipeline, you might just find that it is easier to use the LLM as a classifier, instead of fine-tuning yet another model (depending on cost, quality, etc. concerns).
harharveryfunny t1_jckltrp wrote
> I think it should be possible to replicate even GPT-4 with open source tools something like Bloom + FlashAttention & fine-tune on 32k tokens.
So you mean build a model with a 32K attention window, but somehow initialize it with weights from BLOOM (2K window) then finetune ? Are you aware of any attempts to do this sort of thing ?
[deleted] t1_jckkb5t wrote
Reply to comment by RobbinDeBank in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
[deleted]
2muchnet42day t1_jckjy9i wrote
Reply to comment by farmingvillein in [D] What is the best way to fine tune a LLM with your own data and build a custom text classifier? by pgalgali
Thank you
farmingvillein t1_jckjsyr wrote
Reply to comment by 2muchnet42day in [D] What is the best way to fine tune a LLM with your own data and build a custom text classifier? by pgalgali
-
Much more off-the-shelf right now (although that is changing rapidly)
-
No/minimal IP issues/concerns (although maybe OP doesn't care about that)
pitrucha t1_jckiv1q wrote
Any plans to quantize it? I saw that someone managed to do so with 65B LLama and push it from 120 to 30 GB
RobbinDeBank t1_jcki2vl wrote
Reply to comment by No-Belt7582 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
What are its biggest improvements over pytorch 1?
MirrorBredda t1_jckho6w wrote
Reply to [D] Simple Questions Thread by AutoModerator
Subject: Template to create new library with Scikit Learn Fit Predict API style
Hi every1ne,
I have seen so many packages re-using the fit.predict API style Scikit Learn came up with which is the most popular nowadays.
I was reckoning whether there was a sort of Python Github template project to fork and start from? It would be to create a new library based on such fit.predict style but as alone researcher in the project, we are trying to find the optimal development sprints to avoid loosing time re-creating the wheel.
Best wishes,
mikljohansson t1_jckedf9 wrote
Reply to [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
Very interesting work! I've been following this project for a while now
Can I ask a few questions?
-
What's the difference between RWKV-LM and ChatRWKV, e.g. is ChatRWKV mainly RWKV-LM but streamlined for inference and ease of use, or is there more differences?
-
Are you planning to fine tune on the Stanford Alpaca dataset (like was recently done for LLaMa and GPT-J to create instruct versions of them), or a similar GPT-generated instruction dataset? I'd love to see a instruct-tuned version of RWKV-LM 14B with a 8k+ context len!
[deleted] t1_jckd4tg wrote
Reply to comment by CleanThroughMyJorts in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
[removed]
yaosio t1_jckchbe wrote
Reply to [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
I like it's plan to make money. Did it learn from wallstreetbets?
2muchnet42day t1_jckb1es wrote
Reply to comment by abstract000 in [D] Neat project that would "fit" onto a 4090? by lifesthateasy
Why not LLama/Alpaca ?
schwagggg t1_jckar2a wrote
Reply to comment by bo_peng in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
i thought it was
“r - dub - kay - vi”
which is a little long but unique
Nhabls t1_jck9a4c wrote
Reply to comment by kittenkrazy in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Yeah just need enough training time and data to be able to train those 32k context layers effectively........................
tripple13 t1_jck9593 wrote
Does anyone know why they didn't add the flashattention directly into the Seems to be integrated, awesome!MultiheadAttention-modules?
super_deap OP t1_jck82rd wrote
Reply to comment by Spiritual-Reply5896 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Nuance is proportion to context.
Imagine we want to ask the language model to improve a certain module in Linux Kernel.
If I understood them correctly, memory-augmented transformers won't be able to fit together all the pieces to understand what needs to be improved and how because they need to make repeated calls to memory and search/summarize those calls to get a basic understanding and thus miss out on important details.
Compare that to huge context, they have everything they need for the memory in their context and there is no loss of details (in case of full attention).
Hyper1on t1_jck7qjx wrote
Reply to comment by VelveteenAmbush in [D] What do people think about OpenAI not releasing its research but benefiting from others’ research? Should google meta enforce its patents against them? by [deleted]
DM already does hoard their secrets, there are successful projects there which are not published. What they show you is what they decide needs to be public to get good PR.
CleanThroughMyJorts t1_jck7114 wrote
Reply to comment by Spiritual-Reply5896 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
I don't think the two are mutually exclusive.
The problem with retrieval though (at least current implementations) is the model can't attend to memory globally the way it does with context memory; you're bottlenecked by the retrieval process having to bring things into context through a local search.
-Rizhiy- t1_jck6j55 wrote
Reply to comment by Spiritual-Reply5896 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Increasing the context window is a simple albeit costly method of increasing amount of addressable information. Working with external memory is not as straightforward.
NotARedditUser3 t1_jckof25 wrote
Reply to comment by yumiko14 in [D] GPT-4 is really dumb by [deleted]
https://vgel.me/posts/tools-not-needed/