Recent comments in /f/MachineLearning

remghoost7 t1_jcnvpqh wrote

Very interesting....

Reminds me of how some VR apps can run natively in browsers, using hardware acceleration I believe. I'm guessing this is something sort of similar to that....? Could be entirely wrong though.

Cool stuff though. Would be need to make an extension of this for A1111.... Not to diminish the work you've done, but it would probably get more exposure that way (since it's the most used Stable Diffusion front end out there).

3

alterframe t1_jcn87ue wrote

Do you have any explanation why on Figure 9 the training loss decrease slower for the early dropout? The previous sections are all about how reducing variance in the mini-batch gradients, allows us to travel longer distance in the hyperparameter space (Figure 1 from the post). It seems that it is not reflected in the value of the loss.

Any idea why? It catches up very quickly after the dropout is turned off, but I'm still curious about this behavior.

1

ggf31416 t1_jcmql4c wrote

At the speeds these things move, when you see them coming it's already too late to do any corrective maneuver. It's the same reason you don't use your eyeballs to detect aircraft 100km away. See https://en.wikipedia.org/wiki/Space_debris#Tracking_and_measurement, Algorithms to Antenna: Tracking Space Debris with a Radar Network, RADAR and LIDAR are used.

2

Screye t1_jcmpd5i wrote

This is more derived from extensive personal experience with prompt engineering / fine tuning over the last 2 years.

Simply put:

  • The model learns what it sees. Or, throw enough data of a certain type and emergent properties relating to that data will shop given enough data & compute.
  • If it has never seen data past 8k tokens in the past (due to context window limitations), the model won't need to learn to reason over more than 8k tokens.
  • The source data (humans) have limitations on the complexity of thoughts that can be captured within 8k tokens vs 32k tokens
  • That's not say that the model doesn't reason over longer windows using latent knowledge, which makes its implicit 'reasoning window' much larger than just 8k tokens. But, that is fundamentally different than explicitly reasoning over a 32k window.
  • The model today can only assemble a chain-of-thought prompt of 8k tokens. If there is never any human feedback or loss-landscape-optimization for when it fails to reason past 8k tokens, then any ability the model gains there will be purely incidental.
  • On the other hand, when you have chain-of-thought prompt chains that are 32k tokens long, we can naturally expect it to contain more axioms, postulates and relationships between those postulates/axioms.
  • Those completions will get evaluated against human feedback & just self-supervised-scenarios, which should explicitly optimize the loss landscape to reason over far more complex logical statements.

Idk if that makes sense. Our field keeps moving away from math, and as embarrassing as it is to antromorphize the model, it does make it easier to get the point across.

2

mike94025 t1_jcmho8t wrote

Don't call flash_sdp directly. That way you're locked into particular hardware and create non-portable models. You can either use F.scaled_dot_product_attention() , or you use nn.MultiHeadAttention. In either case it will pick the right implementation based on the hardware you have, and the constraints. Ideally, the constraints would be weakened in the future, and/or new kernels might support other operating points in an optimized manner, and then the kernel picker can dispatch to that implementation.

See the kernel-picker logic that dispatches based on input characteristics in the source code, and/or the SDPA tutorial here => https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html

2