Recent comments in /f/deeplearning

jobeta t1_iw6zxwa wrote

I don’t have much experience with that specific problem but I would tend to think it’s hard to generalize like this to “models that hit the bottom” without knowing what the validation loss actually looked like and what that new data looks like. Chances are, this data is not just perfectly sampled from the first dataset and the features have some idiosyncratic/new statistical properties. In that case, by feeding them in some way to your pre-trained model, the model loss is mechanically not in that minima it supposedly reached in the first training run anymore.

1

scitech_boom t1_iw6zpa9 wrote

There are multiple reasons. The main issue has to do with validation error. It usually follows a U curve, with a minimum at some epoch. This is the point at which we usually stop the training (`early stopping`). Any further training, with or without new data is only going to make the performance worse (I don't have a paper to cite for that).

I also started with the best model and that did not work. But when I took the model 2 epochs before the best model, it worked well. In my case(speech recognition), it was a nice balance between improvement and training time.

1

Thijs-vW OP t1_iw6ryeq wrote

Thanks for the advice. Unfortunately I do not think transfer learning is the best thing for me to do, considering:

>if you train only on the new data, that's all it will know how to predict.

Anyhow,

>If retraining the entire model on the complete data set is possible with nominal cost in less than a few days, do that.

This is indeed the case. However, if I retrain my entire model, it is very likely that the new model will make entirely different predictions due to its weight matrix not being identical. This is the problem I would like to avoid. Do you have any advice on that?

1

BugSlayerJohn t1_iw5kxgs wrote

If retraining the entire model on the complete data set is possible with nominal cost in less than a few days, do that. If not, it's worth trying transfer learning: https://www.tensorflow.org/guide/keras/transfer_learning

Note that transfer learning is a shortcut, you are almost certainly sacrificing some accuracy to avoid a prohibitive amount of retraining. You'll also still need to train the new layers against a data set that completely represents the results you want. I.E. if you train only on the new data, that's all it will know how to predict.

If you don't have the original data set, but do have abundant training resources and time, you could try a Siamese-like approach, where a suitable percentage of the training data fed to the new network is generated data with target values provided based on predictions from the current network, and the remaining data is the new data you would like the network to learn. This will probably work better when the new data is entirely novel.

3

scitech_boom t1_iw4ck9z wrote

>Concatenate old and new data and train one epoch.

This is what I did in the past and it worked reasonably well for my cases. But is that the best? I don't know.

Anyhow, you cannot do this:

>Simultaneously, I do want to use this model as starting point,

Instead pick the weights from 2 or 3 epochs before the best performing one in the previous training. That should be the starting point.

Training on the top of something that has already hit the bottom wont help, even if we add more data.

5

RichardBJ1 t1_iw43b43 wrote

Well transfer learning would be the thing I would expect peep to say, freeze the top and bottom layers, re-load the old model weights and continue training….. but for me the best thing to do has always been to use throw the old weights away and mix up the old and new training data sets and start again…. Sorry!!

3

artsybashev t1_iw29zh1 wrote

A lot of deep learning has been modern equivalent of witchcraft. Just some ideas that might make sense being squashed together.

Hyperparameter tuning is one of the most obscure and hard to learn part of neural network training since it is hard to do multiple runs with it for models that take more than a few weeks/thousands of dollars to train. Most of the researchers just have learned some good initial guess and might run the model with some set of hyperparameters from which the best result is chosen.

Some of the hyperparameter tunings can also be done with a smaller model and the amount of hyperparameter tuning can be reduced while growing the model to the target size.

1

BrotherAmazing t1_iw18e2b wrote

Usually if they share their dataset and problem with you, you can find something that is incredibly simple (just normal learning rate decay) and an alternative to gradient clipping, showing it was only “crucial” for their setup but not “crucial” in general, if you spend just a few hours on the problem and have extensive experience with designing and training deep NNs from scratch, and it will work just as well.

Often you can analyze datasets to see which mini-batches had the gradient exceeding various thresholds and understand what training examples led to large gradients and why, and pre-process the data, get rid of the need for clipping, and since the whole thing is nonlinear that might completely invalidate their other hyperparams once the training set is “cleaned up”.

Not saying this is what is going on here with this research group, but you’d be amazed how often this is the case and some complex trial-and-error is being done just to avoid debugging and understanding why the simpler approach that should have worked didn’t.

3

arhetorical t1_iw16x4q wrote

It looks like a lot but there's nothing especially weird in there. If you spend some time tuning your model you'll probably end up with something like that too.

Adam - standard.

Linear warmup and decay - warmup and decay is very common. The exact shape might vary but cosine decay is often used.

Decreasing the update frequency - probably something you'd come up with after inspecting the training curve and trying to get a little more performance out of it.

Clipping the gradients - pretty common solution for "why isn't my model training properly". Maybe a bit hacky but if it works, it works.

The numbers themselves are usually just a matter of hand tuning and/or hyperparameter search.

5

vk6flab t1_ivxzjs5 wrote

It depends on who's paying.

If it works, it's the idea that the head of marketing came up with over lunch and he'll let everyone know about how insightful and brilliant he is.

If it doesn't work, it's the boat anchor devised by the idiot consultant, hired by the former head of marketing who now is sadly no longer with the company, due to family reasons.

In actuality, likely the intern did it.

Source: I work in IT.

−13

atlvet t1_ivwrghj wrote

Not sure how they’re doing it but software like Chorus.ai can log in to Zoom meetings and transcribe them. I don’t know if they’re doing it by identifying which attendees feed is speaking somehow or if they just get a straight video/audio feed and can pick out different speakers.

1

suflaj t1_ivumz3h wrote

It has not been marketed as such because it's built on top of ASR. Hence, you search for ASR and then look for its features. The same way you look for object detection, and if you need segmentation, you look if it has a detector that does segmentation. A layman looking for a solution does not search for specific terms and marketers know this.

Be as it be, the answer remains the same - Google offers the most advanced and performant solution, it markets it as ASR or how they call it text to speech, with this so called diarization being one feature of it.

2

Snickersman6 t1_ivum0nx wrote

You mentioned automatic speech recognition which is not what I was really asking about, I was asking about speaker diarization. The link below goes over the differences. It may be a part of ASR, but I don't know if it's does that on it's own as part of the speech recognition.

https://deepgram.com/blog/what-is-speaker-diarization/

1