VirtualHat

VirtualHat t1_izfu724 wrote

I use three scripts.

train.py (which trains my model)

worker.py (which picks up the next job and runs it using train.py)

runner.py (which is basically a list of jobs and code to display what's happening).

I then have multiple machines running multiple instances of worker.py. When a new job is created, the workers see it and start processing it. Work is broken into 5-epoch blocks, and at the end of each block, a new job from the priority queue is selected.

This way I can simply add a new job and within 30 minutes or so one of the workers will finish its current block and pick it up. Also because of the chunking, I get early results on all the jobs rather than having to wait for them to finish. This is important as I often know early on if it's worth finishing or not.

I evaluate the results in a Jupyter notebook using the logs that each job creates.

edit: fixed links.

5

VirtualHat t1_iz2qj72 wrote

Yes, massive amounts of epochs with an overparameterized model. As mentioned, I wouldn't recommend it, though. It's just interesting that some of the intuition about how long to train for is changing from "too much is bad" to "too much is good".

If you are inserted in this subject, I'd highly recommend https://openai.com/blog/deep-double-descent/ (which is about overparameterization), as well as the paper mentioned above (which is about over-training). Again - I wouldn't recommend this for your problem. It's just interesting.

It's also worth remembering that there will be a natural error rate for your problem (i.e. does X actually tell us what y is). So it is possible that 70-75 test accuracy is the best you can do on your problem.

1

VirtualHat t1_iyz96uh wrote

If you make your model large enough, you will get to 100%. In fact, not only can you get to 100% accuracy, but you can also get train loss to effectively 0. The paper I linked above discusses how this previously was considered a very bad idea, but if done carefully can actually improve generalization.

Probably the best bet though is to just stick to the "stop when validation goes up" rule.

1

VirtualHat t1_itkajqr wrote

I use Pytorch every day and haven't gone back to TF for years. That being said, there are lots of old projects still on TF, and indeed on the older 1.x version before they fixed most of the stuff.

I'm glad they're working on XLA and JAX though.

3

VirtualHat t1_isullw4 wrote

Sometimes I get the feeling that people's reasons for rejecting a candidate don't align with the real reason. Could be as simple as "we're already hiring a friend of one of our co-workers", but rather than tell you that, they make up a reason that is (legally) defensible but obviously not correct.

This happens a bit in certain companies where an internal promotion has already been decided on, but for 'fairness', they need to interview external applications just to reject them.

1

VirtualHat t1_ir3tiey wrote

Here are some options

  1. Tune a smaller network, then apply the hyperparameters to the larger one and 'hope for the best'.
  2. As others have said, train less, for example, 10 epochs rather than 100. I typically find this produces the wrong results though (the best performer is often poor early on)
  3. For low dim (2d) perform a very coarse grid search (space samples an order of magnitude apart, maybe two), then use just the best model. This is often the best method as you don't want to overtune the hyperparameters.
  4. For high dim, just use random search, then marginalize over all but one parameter using the mean of the best 5-runs. This works really well.
  5. If the goal is often to compare two methods rather than to maximize the score, you can use other people's hyperparameters.
  6. Baysian optimization is usually not worth the time. In small dims do grid search, in large do random search.
  7. If you have the resources then train your models in parallel. This is a really easy way to make use of multiple GPUs if you have them.
  8. In some cases you can perform early stopping for models which are clearly not working. I try not to do this though.
  9. When I do HPS I'm doing it on another dataset than my main one. This helps make things quicker. I'm doing RL though, so it's a bit different I guess.
1

VirtualHat t1_iql9ewk wrote

I haven't heard this talked about much, but I think reading groups are by far the best way to dip your toes into a research field. It's a chance to read some papers you might not normally read, as well as get to know some interesting people in the field.

In my experience, reading groups are typically very open to outsiders, especially if you have an interest in the field.

4

VirtualHat t1_iql9639 wrote

I'm heading to NeurIPS this year (with a paper), and I see it as an opportunity to network as well as promote my research. This is my first time, and so I'm also asking the same question if it's worth it or not.

If you end up going, I'd be really interested to get your thoughts in two-months time about your experience and if it ended up being worthwhile or not.

2