nateharada
nateharada OP t1_j4tojyh wrote
Reply to comment by lorenzo1384 in [P] A small tool that shuts down your machine when GPU utilization drops too low. by nateharada
Yeah it should work if you use the API (and if you have a GPU in your co-lab). I don't think it'll work with TPU just yet.
nateharada OP t1_j4ou7qz wrote
Reply to comment by MuonManLaserJab in [P] A small tool that shuts down your machine when GPU utilization drops too low. by nateharada
./popquiz_hotshot.sh
nateharada OP t1_j4otocf wrote
Reply to comment by Fit_Schedule5951 in [P] A small tool that shuts down your machine when GPU utilization drops too low. by nateharada
This tool actually doesn't look at memory right now, just actual computation. Usually loading your model into memory eats up basically the max memory until the training is done, even if compute usage is very low.
If your training is hanging and still burning GPU cycles that'd be harder to detect I think.
nateharada OP t1_j4ngy65 wrote
Reply to comment by Zealousideal_Low1287 in [P] A small tool that shuts down your machine when GPU utilization drops too low. by nateharada
It's actually almost entirely ready now, I just need to alter a few things. I'll go ahead and push it soon! Need to do some final tests.
EDIT: The above code should work! See the README on the Github for a complete example.
nateharada OP t1_j4ne979 wrote
Reply to comment by Zealousideal_Low1287 in [P] A small tool that shuts down your machine when GPU utilization drops too low. by nateharada
Nice! Right now you can use the end_process trigger to just return 0 when the trigger is hit from the process, but it should be fairly straightforward to externalize the API a little bit more. This would let you do something like this in your script:
from gpu_sentinel import Sentinel, get_gpu_usage
sentinel = Sentinel(
    arm_duration=10,
    arm_threshold=0.7,
    kill_duration=60,
    kill_threshold=0.7,
    kill_fn=my_callback_fn,
)
while True:
    gpu_usage = get_gpu_usage(device_ids=[0, 1, 2, 3])
    sentinel.tick(gpu_usage)
    time.sleep(1)
Is that something that would be useful? You can define the callback function yourself so maybe you trigger an alert, etc.
Submitted by nateharada t3_10do40p in MachineLearning
nateharada t1_jeh5bir wrote
Reply to comment by lacker in [D][N] LAION Launches Petition to Establish an International Publicly Funded Supercomputing Facility for Open Source Large-scale AI Research and its Safety by stringShuffle
I personally feel we need large scale collaboration, not each lab having a small increase. Something like a James Webb telescope or a CERN. If they make a large cluster that's just time shared between labs that's not as useful IMO as allowing many universities to collaborate on a truly public LLM that competes with the biggest private AI organizations.