Recent comments in /f/MachineLearning
Art10001 t1_jcnv4te wrote
Reply to comment by hfnuser0000 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
https://github.com/LagPixelLOL/ChatGPTCLIBot
There are other similar projects I found while trying to recover this one, which may also be of interest. You can find them by searching "chatgpt embeddings memory github"
Art10001 t1_jcnv4kf wrote
Reply to comment by KerfuffleV2 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
https://github.com/LagPixelLOL/ChatGPTCLIBot
There are other similar projects I found while trying to recover this one, which may also be of interest. You can find them by searching "chatgpt embeddings memory github"
hfnuser0000 t1_jcnspad wrote
Reply to comment by Art10001 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Hi there! It sounds really interesting! Could you please share the name of the project or provide a link to it? I would love to check it out. Thank you!
kross00 t1_jcnr2yo wrote
Reply to [P] ControlNetInpaint: No extra training and you can use 📝text +🌌image + 😷mask to generate new images. by mikonvergence
How is it different from the inpain already built in controlnet?
thekevsh0w t1_jcnkybt wrote
Reply to comment by gopher9 in [D] Neat project that would "fit" onto a 4090? by lifesthateasy
>RWKV
How bout a 3080 TI? im guessing the 12 gigs vs 24 gigs of VRAM is gonna be rather lacking :(
royalemate357 t1_jcnjaeo wrote
Reply to comment by mike94025 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
oh cool, thanks for the clarification. Nice that you folk made it more backend independent. Would be interesting to try it out on amd/mps devices, i wonder if those requirements are met on those devices though.
imaginethezmell t1_jcnj7ia wrote
Reply to comment by mikonvergence in [P] ControlNetInpaint: No extra training and you can use 📝text +🌌image + 😷mask to generate new images. by mikonvergence
everyone is interested
are you kidding lol
sqweeeeeeeeeeeeeeeps t1_jcngpzd wrote
If u were to use gpt 3.5 turbo just wait for 4 before you spend $600 on compute costs
KerfuffleV2 t1_jcncad2 wrote
Reply to comment by Art10001 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
You'd have to link me what you're talking about for me to say anything. I doubt it works as straightforwardly as "infinite memory" though.
Art10001 t1_jcnakzz wrote
Reply to comment by KerfuffleV2 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
There is a github project that uses embeddings with GPT-3.5 to create infinite memory, as long as you have infinite disk space. The database grows and grows the more you talk.
Competitive-Rub-1958 t1_jcn8bti wrote
Reply to comment by mike94025 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
cool. I just wanted to make it explicit to make sure I'm running `FlashAttention`. Perhaps there's an easy way to check that?
alterframe t1_jcn87ue wrote
Do you have any explanation why on Figure 9 the training loss decrease slower for the early dropout? The previous sections are all about how reducing variance in the mini-batch gradients, allows us to travel longer distance in the hyperparameter space (Figure 1 from the post). It seems that it is not reflected in the value of the loss.
Any idea why? It catches up very quickly after the dropout is turned off, but I'm still curious about this behavior.
mike94025 t1_jcn7ksu wrote
Reply to comment by royalemate357 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Works for all. You need a compiler backend that can code-gen for your target, and need a frontend for the optimizer that can process the IR.
Alternatively, you need a backend for Triton (or another already supported optimizer) that can codegen for your target architecture.
Educational_Ice151 t1_jcn7818 wrote
Reply to [P] Web Stable Diffusion by crowwork
Wow! 🤯 this is so useful
Shared to r/aipromptprogramming
KingsmanVince t1_jcn734b wrote
Positively maybe llama is better than alpaca if you do so
Negatively maybe it responds closely to ChatGPT
kalakau t1_jcn6q2v wrote
Reply to [P] Web Stable Diffusion by crowwork
absolutely amazing, thanks for doing this!
f-d-t777 t1_jcn2oye wrote
Reply to comment by ggf31416 in [D] Simple Questions Thread by AutoModerator
Interesting, how would you alter my project idea then?
[deleted] t1_jcmtgb4 wrote
Reply to comment by mike94025 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
[removed]
ggf31416 t1_jcmql4c wrote
Reply to comment by f-d-t777 in [D] Simple Questions Thread by AutoModerator
At the speeds these things move, when you see them coming it's already too late to do any corrective maneuver. It's the same reason you don't use your eyeballs to detect aircraft 100km away. See https://en.wikipedia.org/wiki/Space_debris#Tracking_and_measurement, Algorithms to Antenna: Tracking Space Debris with a Radar Network, RADAR and LIDAR are used.
Screye t1_jcmpd5i wrote
Reply to comment by VarietyElderberry in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
This is more derived from extensive personal experience with prompt engineering / fine tuning over the last 2 years.
Simply put:
- The model learns what it sees. Or, throw enough data of a certain type and emergent properties relating to that data will shop given enough data & compute.
- If it has never seen data past 8k tokens in the past (due to context window limitations), the model won't need to learn to reason over more than 8k tokens.
- The source data (humans) have limitations on the complexity of thoughts that can be captured within 8k tokens vs 32k tokens
- That's not say that the model doesn't reason over longer windows using latent knowledge, which makes its implicit 'reasoning window' much larger than just 8k tokens. But, that is fundamentally different than explicitly reasoning over a 32k window.
- The model today can only assemble a chain-of-thought prompt of 8k tokens. If there is never any human feedback or loss-landscape-optimization for when it fails to reason past 8k tokens, then any ability the model gains there will be purely incidental.
- On the other hand, when you have chain-of-thought prompt chains that are 32k tokens long, we can naturally expect it to contain more axioms, postulates and relationships between those postulates/axioms.
- Those completions will get evaluated against human feedback & just self-supervised-scenarios, which should explicitly optimize the loss landscape to reason over far more complex logical statements.
Idk if that makes sense. Our field keeps moving away from math, and as embarrassing as it is to antromorphize the model, it does make it easier to get the point across.
mike94025 t1_jcmlddm wrote
Reply to comment by cthorrez in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Better Transformer supports both, today. Some optimizations are still inference-only (and in particular support for variable-sequence length Nested Tensor) and the inference fastpath is a bit silo'ed, but nothing that future PyTorch update could not fix.
Necessary-Meringue-1 t1_jcmjqhm wrote
Reply to comment by Alimbiquated in Modern language models refute Chomsky’s approach to language [R] by No_Draft4778
I don't understand why it's so hard for people to acknowledge that LLMs deliver extremely impressive results, but that does not mean they have human-like intelligence of language understanding.
Alimbiquated t1_jcmi1fd wrote
Reply to comment by Necessary-Meringue-1 in Modern language models refute Chomsky’s approach to language [R] by No_Draft4778
Right, it makes no sense.
mike94025 t1_jcmho8t wrote
Reply to comment by Competitive-Rub-1958 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Don't call flash_sdp directly. That way you're locked into particular hardware and create non-portable models. You can either use F.scaled_dot_product_attention() , or you use nn.MultiHeadAttention. In either case it will pick the right implementation based on the hardware you have, and the constraints. Ideally, the constraints would be weakened in the future, and/or new kernels might support other operating points in an optimized manner, and then the kernel picker can dispatch to that implementation.
See the kernel-picker logic that dispatches based on input characteristics in the source code, and/or the SDPA tutorial here => https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html
remghoost7 t1_jcnvpqh wrote
Reply to [P] Web Stable Diffusion by crowwork
Very interesting....
Reminds me of how some VR apps can run natively in browsers, using hardware acceleration I believe. I'm guessing this is something sort of similar to that....? Could be entirely wrong though.
Cool stuff though. Would be need to make an extension of this for A1111.... Not to diminish the work you've done, but it would probably get more exposure that way (since it's the most used Stable Diffusion front end out there).