Recent comments in /f/deeplearning
scitech_boom t1_iw6zpa9 wrote
Reply to comment by Thijs-vW in Update an already trained neural network on new data by Thijs-vW
There are multiple reasons. The main issue has to do with validation error. It usually follows a U curve, with a minimum at some epoch. This is the point at which we usually stop the training (`early stopping`). Any further training, with or without new data is only going to make the performance worse (I don't have a paper to cite for that).
I also started with the best model and that did not work. But when I took the model 2 epochs before the best model, it worked well. In my case(speech recognition), it was a nice balance between improvement and training time.
Thijs-vW OP t1_iw6ryeq wrote
Reply to comment by BugSlayerJohn in Update an already trained neural network on new data by Thijs-vW
Thanks for the advice. Unfortunately I do not think transfer learning is the best thing for me to do, considering:
>if you train only on the new data, that's all it will know how to predict.
Anyhow,
>If retraining the entire model on the complete data set is possible with nominal cost in less than a few days, do that.
This is indeed the case. However, if I retrain my entire model, it is very likely that the new model will make entirely different predictions due to its weight matrix not being identical. This is the problem I would like to avoid. Do you have any advice on that?
Thijs-vW OP t1_iw6rmvh wrote
Reply to comment by scitech_boom in Update an already trained neural network on new data by Thijs-vW
>Anyhow, you cannot do this:
I do not understand why I cannot use train my already trained model on new data. Could you elaborate?
Conscious_Amount1339 t1_iw5wkmw wrote
Lol
BugSlayerJohn t1_iw5kxgs wrote
If retraining the entire model on the complete data set is possible with nominal cost in less than a few days, do that. If not, it's worth trying transfer learning: https://www.tensorflow.org/guide/keras/transfer_learning
Note that transfer learning is a shortcut, you are almost certainly sacrificing some accuracy to avoid a prohibitive amount of retraining. You'll also still need to train the new layers against a data set that completely represents the results you want. I.E. if you train only on the new data, that's all it will know how to predict.
If you don't have the original data set, but do have abundant training resources and time, you could try a Siamese-like approach, where a suitable percentage of the training data fed to the new network is generated data with target values provided based on predictions from the current network, and the remaining data is the new data you would like the network to learn. This will probably work better when the new data is entirely novel.
scitech_boom t1_iw4ck9z wrote
>Concatenate old and new data and train one epoch.
This is what I did in the past and it worked reasonably well for my cases. But is that the best? I don't know.
Anyhow, you cannot do this:
>Simultaneously, I do want to use this model as starting point,
Instead pick the weights from 2 or 3 epochs before the best performing one in the previous training. That should be the starting point.
Training on the top of something that has already hit the bottom wont help, even if we add more data.
MyHomeworkAteMyDog t1_iw48fy1 wrote
How about you mix old and new samples together, and only back propagate error on new samples, while tracking the error on old samples. Observe whether training on new samples is hurting performance on old samples.
RichardBJ1 t1_iw43b43 wrote
Well transfer learning would be the thing I would expect peep to say, freeze the top and bottom layers, re-load the old model weights and continue training….. but for me the best thing to do has always been to use throw the old weights away and mix up the old and new training data sets and start again…. Sorry!!
alcome1614 t1_iw35t9r wrote
First thing is to keep a copy of the neural network already trained. So you can try whatever you want
artsybashev t1_iw29zh1 wrote
A lot of deep learning has been modern equivalent of witchcraft. Just some ideas that might make sense being squashed together.
Hyperparameter tuning is one of the most obscure and hard to learn part of neural network training since it is hard to do multiple runs with it for models that take more than a few weeks/thousands of dollars to train. Most of the researchers just have learned some good initial guess and might run the model with some set of hyperparameters from which the best result is chosen.
Some of the hyperparameter tunings can also be done with a smaller model and the amount of hyperparameter tuning can be reduced while growing the model to the target size.
ConsiderationCivil74 t1_iw1vtqd wrote
Like the words of the villain in Agents of shield; Discovery requires experimentation
dipthinker t1_iw1oz77 wrote
It's an art
BrotherAmazing t1_iw18e2b wrote
Usually if they share their dataset and problem with you, you can find something that is incredibly simple (just normal learning rate decay) and an alternative to gradient clipping, showing it was only “crucial” for their setup but not “crucial” in general, if you spend just a few hours on the problem and have extensive experience with designing and training deep NNs from scratch, and it will work just as well.
Often you can analyze datasets to see which mini-batches had the gradient exceeding various thresholds and understand what training examples led to large gradients and why, and pre-process the data, get rid of the need for clipping, and since the whole thing is nonlinear that might completely invalidate their other hyperparams once the training set is “cleaned up”.
Not saying this is what is going on here with this research group, but you’d be amazed how often this is the case and some complex trial-and-error is being done just to avoid debugging and understanding why the simpler approach that should have worked didn’t.
arhetorical t1_iw16x4q wrote
It looks like a lot but there's nothing especially weird in there. If you spend some time tuning your model you'll probably end up with something like that too.
Adam - standard.
Linear warmup and decay - warmup and decay is very common. The exact shape might vary but cosine decay is often used.
Decreasing the update frequency - probably something you'd come up with after inspecting the training curve and trying to get a little more performance out of it.
Clipping the gradients - pretty common solution for "why isn't my model training properly". Maybe a bit hacky but if it works, it works.
The numbers themselves are usually just a matter of hand tuning and/or hyperparameter search.
chengstark t1_iw0xw0w wrote
Some trail and error and some common techniques. Warm up, lr scheduling is not hard to think of.
lucidrage t1_iw0fvn4 wrote
It's mostly trial and error and cobbling together training methods used in whatever paper the devs most recently read.
kroust2020 t1_ivzrn9e wrote
They overfit :)
neuralbeans t1_ivyhjxo wrote
Usually it's whatever the experimenter likes using together with a little tuning of the numbers.
vk6flab t1_ivxzjs5 wrote
It depends on who's paying.
If it works, it's the idea that the head of marketing came up with over lunch and he'll let everyone know about how insightful and brilliant he is.
If it doesn't work, it's the boat anchor devised by the idiot consultant, hired by the former head of marketing who now is sadly no longer with the company, due to family reasons.
In actuality, likely the intern did it.
Source: I work in IT.
atlvet t1_ivwrghj wrote
Reply to would it be possible to train something that processes a video and outputs a text script like the following? Teacher: That is the topic we will be covering today. Student 1: What about the part of the lesson we didnt go over yesterday. by [deleted]
Not sure how they’re doing it but software like Chorus.ai can log in to Zoom meetings and transcribe them. I don’t know if they’re doing it by identifying which attendees feed is speaking somehow or if they just get a straight video/audio feed and can pick out different speakers.
Garbage-Shoddy t1_ivvrgu4 wrote
Reply to comment by Prestigious_Boat_386 in would it be possible to train something that processes a video and outputs a text script like the following? Teacher: That is the topic we will be covering today. Student 1: What about the part of the lesson we didnt go over yesterday. by [deleted]
I don’t think machine transcription is such a niche application
suflaj t1_ivumz3h wrote
Reply to comment by Snickersman6 in would it be possible to train something that processes a video and outputs a text script like the following? Teacher: That is the topic we will be covering today. Student 1: What about the part of the lesson we didnt go over yesterday. by [deleted]
It has not been marketed as such because it's built on top of ASR. Hence, you search for ASR and then look for its features. The same way you look for object detection, and if you need segmentation, you look if it has a detector that does segmentation. A layman looking for a solution does not search for specific terms and marketers know this.
Be as it be, the answer remains the same - Google offers the most advanced and performant solution, it markets it as ASR or how they call it text to speech, with this so called diarization being one feature of it.
Snickersman6 t1_ivum0nx wrote
Reply to comment by suflaj in would it be possible to train something that processes a video and outputs a text script like the following? Teacher: That is the topic we will be covering today. Student 1: What about the part of the lesson we didnt go over yesterday. by [deleted]
You mentioned automatic speech recognition which is not what I was really asking about, I was asking about speaker diarization. The link below goes over the differences. It may be a part of ASR, but I don't know if it's does that on it's own as part of the speech recognition.
suflaj t1_ivuj62a wrote
Reply to comment by Snickersman6 in would it be possible to train something that processes a video and outputs a text script like the following? Teacher: That is the topic we will be covering today. Student 1: What about the part of the lesson we didnt go over yesterday. by [deleted]
Yeah, as said previously, Google is a master of it - ex. look at Pixel 7 ASR.
I believe it's still called ASR.
jobeta t1_iw6zxwa wrote
Reply to comment by scitech_boom in Update an already trained neural network on new data by Thijs-vW
I don’t have much experience with that specific problem but I would tend to think it’s hard to generalize like this to “models that hit the bottom” without knowing what the validation loss actually looked like and what that new data looks like. Chances are, this data is not just perfectly sampled from the first dataset and the features have some idiosyncratic/new statistical properties. In that case, by feeding them in some way to your pre-trained model, the model loss is mechanically not in that minima it supposedly reached in the first training run anymore.