remghoost7 t1_jcnvpqh wrote on March 18, 2023 at 5:05 AM

Reply to [P] Web Stable Diffusion by crowwork

Very interesting....

Reminds me of how some VR apps can run natively in browsers, using hardware acceleration I believe. I'm guessing this is something sort of similar to that....? Could be entirely wrong though.

Cool stuff though. Would be need to make an extension of this for A1111.... Not to diminish the work you've done, but it would probably get more exposure that way (since it's the most used Stable Diffusion front end out there).

Art10001 t1_jcnv4te wrote on March 18, 2023 at 4:59 AM

Reply to comment by hfnuser0000 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

https://github.com/LagPixelLOL/ChatGPTCLIBot

There are other similar projects I found while trying to recover this one, which may also be of interest. You can find them by searching "chatgpt embeddings memory github"

Art10001 t1_jcnv4kf wrote on March 18, 2023 at 4:59 AM

Reply to comment by KerfuffleV2 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

https://github.com/LagPixelLOL/ChatGPTCLIBot

There are other similar projects I found while trying to recover this one, which may also be of interest. You can find them by searching "chatgpt embeddings memory github"

hfnuser0000 t1_jcnspad wrote on March 18, 2023 at 4:33 AM

Reply to comment by Art10001 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

Hi there! It sounds really interesting! Could you please share the name of the project or provide a link to it? I would love to check it out. Thank you!

kross00 t1_jcnr2yo wrote on March 18, 2023 at 4:17 AM

Reply to [P] ControlNetInpaint: No extra training and you can use 📝text +🌌image + 😷mask to generate new images. by mikonvergence

How is it different from the inpain already built in controlnet?

thekevsh0w t1_jcnkybt wrote on March 18, 2023 at 3:19 AM

Reply to comment by gopher9 in [D] Neat project that would "fit" onto a 4090? by lifesthateasy

>RWKV

How bout a 3080 TI? im guessing the 12 gigs vs 24 gigs of VRAM is gonna be rather lacking :(

royalemate357 t1_jcnjaeo wrote on March 18, 2023 at 3:04 AM

Reply to comment by mike94025 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

oh cool, thanks for the clarification. Nice that you folk made it more backend independent. Would be interesting to try it out on amd/mps devices, i wonder if those requirements are met on those devices though.

imaginethezmell t1_jcnj7ia wrote on March 18, 2023 at 3:04 AM

Reply to comment by mikonvergence in [P] ControlNetInpaint: No extra training and you can use 📝text +🌌image + 😷mask to generate new images. by mikonvergence

everyone is interested

are you kidding lol

sqweeeeeeeeeeeeeeeps t1_jcngpzd wrote on March 18, 2023 at 2:42 AM

Reply to [D] Newbie question about Stanford Alpaca 7b fine-tuning by [deleted]

If u were to use gpt 3.5 turbo just wait for 4 before you spend $600 on compute costs

KerfuffleV2 t1_jcncad2 wrote on March 18, 2023 at 2:05 AM

Reply to comment by Art10001 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

You'd have to link me what you're talking about for me to say anything. I doubt it works as straightforwardly as "infinite memory" though.

Art10001 t1_jcnakzz wrote on March 18, 2023 at 1:50 AM

Reply to comment by KerfuffleV2 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

There is a github project that uses embeddings with GPT-3.5 to create infinite memory, as long as you have infinite disk space. The database grows and grows the more you talk.

Competitive-Rub-1958 t1_jcn8bti wrote on March 18, 2023 at 1:32 AM

Reply to comment by mike94025 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

cool. I just wanted to make it explicit to make sure I'm running `FlashAttention`. Perhaps there's an easy way to check that?

alterframe t1_jcn87ue wrote on March 18, 2023 at 1:31 AM

Reply to [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

Do you have any explanation why on Figure 9 the training loss decrease slower for the early dropout? The previous sections are all about how reducing variance in the mini-batch gradients, allows us to travel longer distance in the hyperparameter space (Figure 1 from the post). It seems that it is not reflected in the value of the loss.

Any idea why? It catches up very quickly after the dropout is turned off, but I'm still curious about this behavior.

mike94025 t1_jcn7ksu wrote on March 18, 2023 at 1:26 AM

Reply to comment by royalemate357 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

Works for all. You need a compiler backend that can code-gen for your target, and need a frontend for the optimizer that can process the IR.

Alternatively, you need a backend for Triton (or another already supported optimizer) that can codegen for your target architecture.

Educational_Ice151 t1_jcn7818 wrote on March 18, 2023 at 1:23 AM

Reply to [P] Web Stable Diffusion by crowwork

Wow! 🤯 this is so useful

Shared to r/aipromptprogramming

KingsmanVince t1_jcn734b wrote on March 18, 2023 at 1:22 AM

Reply to [D] Newbie question about Stanford Alpaca 7b fine-tuning by [deleted]

Positively maybe llama is better than alpaca if you do so

Negatively maybe it responds closely to ChatGPT

kalakau t1_jcn6q2v wrote on March 18, 2023 at 1:20 AM

Reply to [P] Web Stable Diffusion by crowwork

absolutely amazing, thanks for doing this!

f-d-t777 t1_jcn2oye wrote on March 18, 2023 at 12:50 AM

Reply to comment by ggf31416 in [D] Simple Questions Thread by AutoModerator

Interesting, how would you alter my project idea then?

[deleted] t1_jcmtgb4 wrote on March 17, 2023 at 11:41 PM

Reply to comment by mike94025 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

[removed]

ggf31416 t1_jcmql4c wrote on March 17, 2023 at 11:20 PM

Reply to comment by f-d-t777 in [D] Simple Questions Thread by AutoModerator

At the speeds these things move, when you see them coming it's already too late to do any corrective maneuver. It's the same reason you don't use your eyeballs to detect aircraft 100km away. See https://en.wikipedia.org/wiki/Space_debris#Tracking_and_measurement, Algorithms to Antenna: Tracking Space Debris with a Radar Network, RADAR and LIDAR are used.

Screye t1_jcmpd5i wrote on March 17, 2023 at 11:11 PM

Reply to comment by VarietyElderberry in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

This is more derived from extensive personal experience with prompt engineering / fine tuning over the last 2 years.

Simply put:

The model learns what it sees. Or, throw enough data of a certain type and emergent properties relating to that data will shop given enough data & compute.
If it has never seen data past 8k tokens in the past (due to context window limitations), the model won't need to learn to reason over more than 8k tokens.
The source data (humans) have limitations on the complexity of thoughts that can be captured within 8k tokens vs 32k tokens
That's not say that the model doesn't reason over longer windows using latent knowledge, which makes its implicit 'reasoning window' much larger than just 8k tokens. But, that is fundamentally different than explicitly reasoning over a 32k window.
The model today can only assemble a chain-of-thought prompt of 8k tokens. If there is never any human feedback or loss-landscape-optimization for when it fails to reason past 8k tokens, then any ability the model gains there will be purely incidental.
On the other hand, when you have chain-of-thought prompt chains that are 32k tokens long, we can naturally expect it to contain more axioms, postulates and relationships between those postulates/axioms.
Those completions will get evaluated against human feedback & just self-supervised-scenarios, which should explicitly optimize the loss landscape to reason over far more complex logical statements.

Idk if that makes sense. Our field keeps moving away from math, and as embarrassing as it is to antromorphize the model, it does make it easier to get the point across.

mike94025 t1_jcmlddm wrote on March 17, 2023 at 10:42 PM

Reply to comment by cthorrez in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

Better Transformer supports both, today. Some optimizations are still inference-only (and in particular support for variable-sequence length Nested Tensor) and the inference fastpath is a bit silo'ed, but nothing that future PyTorch update could not fix.

Necessary-Meringue-1 t1_jcmjqhm wrote on March 17, 2023 at 10:31 PM

Reply to comment by Alimbiquated in Modern language models refute Chomsky’s approach to language [R] by No_Draft4778

I don't understand why it's so hard for people to acknowledge that LLMs deliver extremely impressive results, but that does not mean they have human-like intelligence of language understanding.

Alimbiquated t1_jcmi1fd wrote on March 17, 2023 at 10:18 PM

Reply to comment by Necessary-Meringue-1 in Modern language models refute Chomsky’s approach to language [R] by No_Draft4778

Right, it makes no sense.

mike94025 t1_jcmho8t wrote on March 17, 2023 at 10:16 PM

Reply to comment by Competitive-Rub-1958 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

Don't call flash_sdp directly. That way you're locked into particular hardware and create non-portable models. You can either use F.scaled_dot_product_attention() , or you use nn.MultiHeadAttention. In either case it will pick the right implementation based on the hardware you have, and the constraints. Ideally, the constraints would be weakened in the future, and/or new kernels might support other operating points in an optimized manner, and then the kernel picker can dispatch to that implementation.

See the kernel-picker logic that dispatches based on input characteristics in the source code, and/or the SDPA tutorial here => https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html

Recent comments in /f/MachineLearning