Recent comments in /f/MachineLearning

brownmamba94 t1_jd6zyd5 wrote

I totally agree, and really wonder how the landscape will look in 10 years when it comes to ML model architectures, training strategies, optimization techniques, etc...it'll be very interesting.Although plasticity-based learning, spiking neural networks, and other neuromorphic algorithms that use local learning rules don't get the same kind of attention as gradient based learning, I do believe mimicing the neural activity of the brain through emulating spiking neural networks could potentially one day be a good solution for inference (in terms of cost and power efficiency). Though, currently, implementing spike-based learning and training has still proven to be a challenge. But hey, one thing is common is that sparsity is a key enabler for these types of hardware.

3

brownmamba94 t1_jd6xm1n wrote

Yes, that's right, usually it's the other way around and that's usually because for the average researcher its computationally expensive to pre-train the LLM from scratch. So, they often typically take existing pre-trained LLM checkpoints and perform fine-tuning on them on a domain specific task. The FLOPs required for pre-training is several orders of magnitude more FLOPs than fine-tuning.

In this work, like you said, we're aiming to show that thanks to the Cerebras CS-2, we can achieve faster pre-training with unstructured weight sparsity, and fine-tune dense to recover the performance on the downstream task. The ability to do faster pre-training opens up a lot of potential for new directions in LLM research. Note that an interesting extension of our work is to do sparse pre-training followed by parameter efficient fine-tuning using techniques like LoRA from Microsoft.

There's actually a couple really nice blogs from Sean Lie, our Co-founder and Chief Hardware Architect, discussing how the Cerebras CS-2 can translate unstructured sparsity to realized gains unlike traditional GPUs. All the experiments in our paper were done on the CS-2, including the 1.3B GPT-3 XL. There was no GPU training here. I encourage you to check out these blogs:

Harnessing the Power of Sparsity for Large GPT AI ModelsCerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning

4

osdd_alt_123 t1_jd6ufjz wrote

Nvidia has 2:4 structured sparsity in the Ampere architecture and one or two below as well, if memory serves.

So in a block of 4, you have to have 2 dropped and 2 retained. It's how they claim their 2x throughput at the hardware level.

You can, however, emulate sparsity in a variety of other ways that are higher than the hardware level. Hope this helps.

5

_Arsenie_Boca_ t1_jd6u2my wrote

First time I hear sparse pretraining and dense finetuning. Usually its the other way around right? So that you get faster inference speeds. Is it correct that you are aiming for faster pretraining through sparsity here, while having normal dense inference speeds?

Also, could you provide an intuition on how cerebras is able to translate unstructured sparsity to speedups? Since you pretrained a 1.3B model, I assume it runs on GPU, unlike DeepSparse?

3

maizeq t1_jd6kpnj wrote

> The Cerebras CS-2 is designed to accelerate unstructured sparsity, whereas GPUs are not.

Don’t modern NVIDIA GPUs (2000s+) have strong support for sparsity (maximum theoretical flops are doubled when doing sparse computation?). From their documentation the type of sparsity they support is also unstructured (e.g randomly pruned values in tensors). Does the Cerebras chip have higher sparse flops, or does the comparison not make sense?

5

fnbr t1_jd6j7gh wrote

Right now, the tech isn't there to train on a single GPU. You're gonna end up training a language model for ~1 month to do so. It is slightly more efficient, though.

Lots of people looking at running locally. In addition to everything that people have said, there's a bunch of companies that will be releasing models that can just barely fit on an A100 soon that I've heard rumours about from employees.

1

brownmamba94 t1_jd6j6pt wrote

Hi, this is the first author on the paper. You asked a great question and it’s something we are pursuing internally. In this study we kept things simple and switched from sparse to completely dense during finetuning. But as for future work, you’re right, we can certainly vary the amount of “redensification” as well (e,g., 25%, 50%, or possibly some schedule). This is a very interesting research direction, because the full dense capacity of the model may not be needed to recover performance on the downstream task.

9