Recent comments in /f/MachineLearning

remghoost7 t1_jcq8emm wrote

Ah, it was made with unreal.....? I didn't see that.

I always love adaptations of video game engines. One of the reasons I've been a huge fan of Unity for years. It's essentially just a wrapper for C# code with a pretty interface.

1

londons_explorer t1_jcpzan9 wrote

I would make 'fake' data which isn't hipaa protected and do most of your work on that.

Then do a final fine-tuning on the HIPAA data on some rented servers. Your HIPAA data probably isn't more than a few hundreds of billion words anyway, so a fine-tuning should be quite quick and cheap to do a few full passes of the dataset.

1

Sad-Comedian-711 t1_jcpvahm wrote

So there is flash attention and then there is block sparse flash attention.

Flash attention by itself only got them to 16k on an A100 for their model, to go further they needed to use windowed attention... You could have already gone to 16k with windowed attention before this paper without much issue.

The special thing about this windowed attention is that it is in blocks that can fit into SRAM. From what I can tell Python's implementation of Flash Attention doesn't look like it supports block sparse flash attention.

https://github.com/pytorch/pytorch/blob/eb32bb2ca6811ea21002699f4be884d3012dc362/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_1xN.h

While Triton's looks like it does:https://github.com/openai/triton/blob/c9740f0870f6ae2480acd2a76a5fb4c920bc5ce5/python/triton/ops/flash_attention.py

I think windowing must be done in blocks that align with the SRAM grid so it kinda has to be part of the Flash Attention implementation. You might be able to throw normal Big Bird block sparse attention on top...

You also may be able to call out to triton's implementation:
https://github.com/violethaze74/pumpkin-py/blob/d9250933bec045e6add61b3930ff3dbbe08f6501/aten/src/ATen/native/transformers/attention.cpp#L726

3

MysteryInc152 OP t1_jcputc0 wrote

Uses relative positional encoding. Long context in theory but because it was trained on 2048 tokens of context, performance gradually declines after that. Finetuning for more context wouldn't be impossible though.

You can run with FP-16 (13GB RAM), 8-bit(10GB) and 4-bit(6 GB) quantization.

36

josejo9423 t1_jcpu2pe wrote

I would go with 1 but I would no tune early stopping just the number of estimators , xgbboost has the option of stopping iterations (early stopping) when there are no improvements in the metric, if you plot then what model believes and realizes that could have been stopped early , step up that number that you consider before overfitting

1

Oswald_Hydrabot t1_jcpqshf wrote

Those are bad managers. I certainly have had these conversations and I left companies over their response until I found one that listened.

You have to try harder. You have to stop accepting short-sighted near term profit as "just how it is" or assuming that financial malpratice at scale is "good business", because if you do not and you don't keep trying, failure is inevitable. Corruption and corporate bailouts that take our tax revenue and cost us layoffs to pay for those mistakes are inevitable. Stop being complacent if you cannot accept putting in the effort to make what you know is right a reality.

I have been involved in those conversations at the highest levels in some of the largest companies in the world. More often than not I told them to either listen to the consulting that they PAID me for, or I will take my business somewhere else, and I did. If you don't suck at what you do then firing bad clients will not hurt you; in fact is it critical to your own growth in your career. You need to treat your employer as a client.

1