neuralbeans OP t1_j6nmccc wrote on January 31, 2023 at 4:42 PM

Reply to comment by No_Cryptographer9806 in Best practice for capping a softmax by neuralbeans

It's for reinforcement learning to keep the model exploring possibilities.

No_Cryptographer9806 t1_j6nfqhq wrote on January 31, 2023 at 4:01 PM

Reply to Best practice for capping a softmax by neuralbeans

I am curious why do you want to do that? You can always post process the logits but forcing the Network to learn it will cause harm to the underlying representation imo

chatterbox272 t1_j6n3vx6 wrote on January 31, 2023 at 2:43 PM

Reply to comment by neuralbeans in Best practice for capping a softmax by neuralbeans

My proposed function does that. Let's say you have two outputs, and don't want either to go below 0.25. Your minimum value already adds up to 0.5, so you rescale the softmax to add up to 0.5 as well, giving you a sum of 1 and a valid distribution.

nutpeabutter t1_j6n2eaf wrote on January 31, 2023 at 2:32 PM

Reply to Best practice for capping a softmax by neuralbeans

Taking a leaf out of RL, you can add an additional entropy loss.

Alternatively, clip the logits but apply STE (copy gradients) on backprop

neuralbeans OP t1_j6n0ima wrote on January 31, 2023 at 2:18 PM

Reply to comment by chatterbox272 in Best practice for capping a softmax by neuralbeans

I want the output to remain a proper distribution.

chatterbox272 t1_j6myph4 wrote on January 31, 2023 at 2:05 PM

Reply to Best practice for capping a softmax by neuralbeans

If the goal is to keep all predictions above a floor, the easiest way is to make the activation into floor + (1 - floor * num_logits) * softmax(logits). This doesn't have any material impact on the model, but it imposes a floor.

If the goal is to actually change something about how the predictions are made, then adding a floor isn't going to be the solution though. You could modify the activation function some other way (e.g. by scaling the logits, normalising them, etc.), or you could impose a loss penalty for the difference between the logits or the final predictions.

Lankyie t1_j6mjvpy wrote on January 31, 2023 at 11:48 AM

Reply to comment by neuralbeans in Best practice for capping a softmax by neuralbeans

yeah true, you can implement that by factoring everything back to the sum of 1 though

emilrocks888 t1_j6mjnk7 wrote on January 31, 2023 at 11:45 AM

Reply to comment by neuralbeans in Best practice for capping a softmax by neuralbeans

Sorry, dictionary issue. I meant Self Attention (I ve edited previous answer)

neuralbeans OP t1_j6mjhog wrote on January 31, 2023 at 11:43 AM

Reply to comment by emilrocks888 in Best practice for capping a softmax by neuralbeans

What's this about del attention?

emilrocks888 t1_j6mjf7m wrote on January 31, 2023 at 11:42 AM

Reply to Best practice for capping a softmax by neuralbeans

I would scale logits before softmax, like it’s been done in self attention.Actually that scaling in self attn is to make the final dist of the attention weights to be smooth.

neuralbeans OP t1_j6miw6o wrote on January 31, 2023 at 11:36 AM

Reply to comment by Lankyie in Best practice for capping a softmax by neuralbeans

It needs to remain a valid softmax distribution.

FastestLearner t1_j6mhjd2 wrote on January 31, 2023 at 11:20 AM

Reply to Best practice for capping a softmax by neuralbeans

Use composite loss, i.e. add extra terms in the loss function to make the optimizer force the logits to stay within a fixed range.

For example, if current min logit = m and allowed minimum = u, current max logit = n and allowed maximum = v, then the following loss function should help:

Overall loss = CrossEntropy loss + lambda1 * max(u - m, 0) and lambda2 * max(n - v, 0)

The max terms ensure that no loss is added when the logits are all within the allowed range. Use lamba1 and lambda2 to scale each term so that they roughly match the CE loss in strength.

Lankyie t1_j6mf6pt wrote on January 31, 2023 at 10:49 AM

Reply to Best practice for capping a softmax by neuralbeans

max[softmax, lowest accepted probability]

neuralbeans OP t1_j6md46u wrote on January 31, 2023 at 10:20 AM

Reply to comment by like_a_tensor in Best practice for capping a softmax by neuralbeans

That will just make the model learn larger logits to undo the effect of the temperature.

like_a_tensor t1_j6mcv1v wrote on January 31, 2023 at 10:16 AM

Reply to Best practice for capping a softmax by neuralbeans

I'm not sure how to fix a minimum probability, but you could try softmax with a high temperature.

LiquidDinosaurs69 t1_j6kzkhm wrote on January 31, 2023 at 1:45 AM

Reply to comment by hugio55 in Hobbyist: desired software to run evolution by hugio55

Check out Lenia artificial life simulator on YouTube. Similar concept to evolution, pretty sick. Might scratch your itch

hugio55 OP t1_j6jies6 wrote on January 30, 2023 at 7:51 PM

Reply to comment by Extra-most-best in Hobbyist: desired software to run evolution by hugio55

Thanks for these replies - I appreciate it. I have a lot to dig into. I think the barrier of entry may be higher than I had hoped for.

hugio55 OP t1_j6jhyb3 wrote on January 30, 2023 at 7:48 PM

Reply to comment by LiquidDinosaurs69 in Hobbyist: desired software to run evolution by hugio55

Hey LiquidDinosaur - thanks for this info. I have been hearing about open AI quite a bit and thus will dive in deep into what they have to offer. I will say that anything C++ (or C, or A through Z for that matter) will be beyond my breadth, but that's OK. I still enjoy watching it happen from some of the pros on youtube.

Sorry-Resolution-334 t1_j6hqdft wrote on January 30, 2023 at 12:26 PM

Reply to Why did the original ResNet paper not use dropout? by V1bicycle

gif

Sorry-Resolution-334 t1_j6hqaom wrote on January 30, 2023 at 12:25 PM

Reply to Why did the original ResNet paper not use dropout? by V1bicycle

廿4土44

suflaj t1_j6hfkdj wrote on January 30, 2023 at 10:09 AM

Reply to comment by XecutionStyle in Why did the original ResNet paper not use dropout? by V1bicycle

> BN is used to reduce covariate shift, it just happened to regularize.

The first part was hypothesized, but not proven. It is a popular belief, like all other hypotheses why BN works so well.

> Dropout as a regularizing technique didn't become big before ResNet (2014 vs. 2015).

What does becoming big mean? Dropout was introduced in 2012 and used ever since. It was never big in the sense that you would always use it.

It is certainly false that Dropout was used because of ResNets or immediately after them for CNNs, as the first paper proving that there is benefit in using Dropout for convolutional layers was in 2017: https://link.springer.com/chapter/10.1007/978-3-319-54184-6_12

> I doubt what you're saying is true, that they're effectively the same.

I never said that.

raulkite OP t1_j6h7xth wrote on January 30, 2023 at 8:20 AM

Reply to comment by Fourstrokeperro in M2 pro vs M2 max by raulkite

Explanation in source link. https://twitter.com/danielgross/status/1619417503561818112.

It’s a implementation using neural engine. And it’s near half A100 performance doing interference

florisjuh t1_j6h1ffh wrote on January 30, 2023 at 6:54 AM

Reply to comment by Kuchenkiller in How can I start to study Deep learning? by Ill-Sprinkles9588

Probably good to accompany it with a more practical book (or courses) though, such as Sebastian Raschka's Machine Learning with PyTorch and scikit-learn or Francois Chollet's Deep Learning with Python (Keras/Tensorflow). Also I found Dive into Deep Learning https://d2l.ai to be a pretty nice resource to learn about more SOTA deep learning models and techniques.

Autogazer t1_j6h0bi5 wrote on January 30, 2023 at 6:41 AM

Reply to comment by Severe-Improvement32 in If anyone know answer of my question, please tell me by Severe-Improvement32

That’s not how unsupervised training works. All training requires data, unsupervised just means that the data isn’t labeled.

XecutionStyle t1_j6ggq37 wrote on January 30, 2023 at 3:35 AM

Reply to comment by suflaj in Why did the original ResNet paper not use dropout? by V1bicycle

BN is used to reduce covariate shift, it just happened to regularize. Dropout as a regularizing technique didn't become big before ResNet (2014 vs. 2015).

I doubt what you're saying is true, that they're effectively the same. Try putting one after the other to see the effect. Two drop-out layers or BN layers in contrast have no problem co-existing.

edit: sorry what I mean is the variants of drop-out that work with CNNs (that don't have detrimental effects) haven't existed then.

Recent comments in /f/deeplearning