brownmamba94 t1_jd6zyd5 wrote on March 22, 2023 at 8:26 AM

Reply to comment by Carrasco_Santo in [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101

I totally agree, and really wonder how the landscape will look in 10 years when it comes to ML model architectures, training strategies, optimization techniques, etc...it'll be very interesting.Although plasticity-based learning, spiking neural networks, and other neuromorphic algorithms that use local learning rules don't get the same kind of attention as gradient based learning, I do believe mimicing the neural activity of the brain through emulating spiking neural networks could potentially one day be a good solution for inference (in terms of cost and power efficiency). Though, currently, implementing spike-based learning and training has still proven to be a challenge. But hey, one thing is common is that sparsity is a key enabler for these types of hardware.

Fabulous-Possible758 t1_jd6z1rh wrote on March 22, 2023 at 8:13 AM

Reply to comment by Suitable_Goose3637 in [P] Anyone interested in starting a Startup? by [deleted]

Best of luck to you. I’d save this post and if you ever get the chance to say “I told you so…”. Well, I’m not saying I wouldn’t.

brownmamba94 t1_jd6xm1n wrote on March 22, 2023 at 7:52 AM

Reply to comment by _Arsenie_Boca_ in [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101

Yes, that's right, usually it's the other way around and that's usually because for the average researcher its computationally expensive to pre-train the LLM from scratch. So, they often typically take existing pre-trained LLM checkpoints and perform fine-tuning on them on a domain specific task. The FLOPs required for pre-training is several orders of magnitude more FLOPs than fine-tuning.

In this work, like you said, we're aiming to show that thanks to the Cerebras CS-2, we can achieve faster pre-training with unstructured weight sparsity, and fine-tune dense to recover the performance on the downstream task. The ability to do faster pre-training opens up a lot of potential for new directions in LLM research. Note that an interesting extension of our work is to do sparse pre-training followed by parameter efficient fine-tuning using techniques like LoRA from Microsoft.

There's actually a couple really nice blogs from Sean Lie, our Co-founder and Chief Hardware Architect, discussing how the Cerebras CS-2 can translate unstructured sparsity to realized gains unlike traditional GPUs. All the experiments in our paper were done on the CS-2, including the 1.3B GPT-3 XL. There was no GPU training here. I encourage you to check out these blogs:

Harnessing the Power of Sparsity for Large GPT AI Models Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning

juliensalinas OP t1_jd6uju4 wrote on March 22, 2023 at 7:09 AM

Reply to comment by No_Combination_6429 in [D] An Instruct Version Of GPT-J Using Stanford Alpaca's Dataset by juliensalinas

Thx!

kalakau t1_jd6uha4 wrote on March 22, 2023 at 7:08 AM

Reply to comment by Trolann in [P] searchGPT - a bing-like LLM-based Grounded Search Engine (with Demo, github) by michaelthwan_ai

that's got to be my favorite disclaimer i've ever read

"it's unpredictable and could do anything"

osdd_alt_123 t1_jd6ufjz wrote on March 22, 2023 at 7:07 AM

Reply to comment by maizeq in [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101

Nvidia has 2:4 structured sparsity in the Ampere architecture and one or two below as well, if memory serves.

So in a block of 4, you have to have 2 dropped and 2 retained. It's how they claim their 2x throughput at the hardware level.

You can, however, emulate sparsity in a variety of other ways that are higher than the hardware level. Hope this helps.

maizeq t1_jd6u4kb wrote on March 22, 2023 at 7:03 AM

Reply to comment by artsybashev in [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101

The sparsity they describe in this link entails randomly pruning weights (i.e. not specific channels like depthwise convolutions), which is what Graphcore is calling "unstructured".

_Arsenie_Boca_ t1_jd6u2my wrote on March 22, 2023 at 7:02 AM

Reply to [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101

First time I hear sparse pretraining and dense finetuning. Usually its the other way around right? So that you get faster inference speeds. Is it correct that you are aiming for faster pretraining through sparsity here, while having normal dense inference speeds?

Also, could you provide an intuition on how cerebras is able to translate unstructured sparsity to speedups? Since you pretrained a 1.3B model, I assume it runs on GPU, unlike DeepSparse?

atheist-projector t1_jd6trwc wrote on March 22, 2023 at 6:58 AM

Reply to [D] Running an LLM on "low" compute power machines? by Qwillbehr

I am considering doing algotrading with something like this. Nit sure if i will or not.

Gatensio t1_jd6rixk wrote on March 22, 2023 at 6:27 AM

Reply to comment by KerfuffleV2 in [D] Running an LLM on "low" compute power machines? by Qwillbehr

Doesn't 7B parameters require like 12-26GB of RAM depending on precision? How do you run the 30B?

SeaMeasurement9 t1_jd6qgmi wrote on March 22, 2023 at 6:13 AM

Reply to comment by Traditional-Ad-8715 in [Project] AI Voice Narrated Audiobooks by Traditional-Ad-8715

How did you get to do the entire book? I thought it’s limited by the number of characters

frownyface t1_jd6q1qi wrote on March 22, 2023 at 6:08 AM

Reply to comment by currentscurrents in [Project] Alpaca-30B: Facebook's 30b parameter LLaMa fine-tuned on the Alpaca dataset by imgonnarelph

There was an insane age of PC gaming where hardware was moving so fast that game developers were releasing games with max-settings that didn't run on any current hardware to try to future proof themselves from having a game suddenly feeling obsolete shortly after launch.

yahma t1_jd6ptt5 wrote on March 22, 2023 at 6:05 AM

Reply to comment by RedditLovingSun in [P] OpenAssistant is now live on reddit (Open Source ChatGPT alternative) by pixiegirl417

Good point, I forgot about this.

[deleted] t1_jd6pe94 wrote on March 22, 2023 at 5:59 AM

Reply to comment by ReasonablyBadass in [D] Running an LLM on "low" compute power machines? by Qwillbehr

[deleted]

VelvetyPenus t1_jd6ok63 wrote on March 22, 2023 at 5:48 AM

Reply to [Project] AI Voice Narrated Audiobooks by Traditional-Ad-8715

Cat in the Hat by Jeffrey Dahmer.

gootecks t1_jd6o1uo wrote on March 22, 2023 at 5:42 AM

Reply to [Project] Machine Learning for Audio: A library for audio analysis, feature extraction, etc by Leo_D517

Really interesting project! Do you think it could be used to detect sound effects in games? For example, you press a button in the game which triggers an attack that makes a sound when it connects.

ReasonablyBadass t1_jd6lnmk wrote on March 22, 2023 at 5:13 AM

Reply to comment by not_particulary in [D] Running an LLM on "low" compute power machines? by Qwillbehr

Note, Alpace isn't fully Open source. It's legal situation is kinda murky.

artsybashev t1_jd6l85h wrote on March 22, 2023 at 5:07 AM

Reply to comment by maizeq in [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101

nvidia has structured sparsity

[deleted] t1_jd6l1j9 wrote on March 22, 2023 at 5:05 AM

Reply to [Project] AI Voice Narrated Audiobooks by Traditional-Ad-8715

[removed]

maizeq t1_jd6kpnj wrote on March 22, 2023 at 5:01 AM

Reply to comment by CS-fan-101 in [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101

> The Cerebras CS-2 is designed to accelerate unstructured sparsity, whereas GPUs are not.

Don’t modern NVIDIA GPUs (2000s+) have strong support for sparsity (maximum theoretical flops are doubled when doing sparse computation?). From their documentation the type of sparsity they support is also unstructured (e.g randomly pruned values in tensors). Does the Cerebras chip have higher sparse flops, or does the comparison not make sense?

fnbr t1_jd6jsb6 wrote on March 22, 2023 at 4:51 AM

Reply to [D] What's the Time and Space Complexity of Transformer Models Inference? by Smooth-Earth-9897

The best analysis on this is from this blog:

https://kipp.ly/blog/transformer-inference-arithmetic/

fnbr t1_jd6j7gh wrote on March 22, 2023 at 4:45 AM

Reply to [D] Running an LLM on "low" compute power machines? by Qwillbehr

Right now, the tech isn't there to train on a single GPU. You're gonna end up training a language model for ~1 month to do so. It is slightly more efficient, though.

Lots of people looking at running locally. In addition to everything that people have said, there's a bunch of companies that will be releasing models that can just barely fit on an A100 soon that I've heard rumours about from employees.

brownmamba94 t1_jd6j6pt wrote on March 22, 2023 at 4:44 AM

Reply to comment by kilow4tt in [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101

Hi, this is the first author on the paper. You asked a great question and it’s something we are pursuing internally. In this study we kept things simple and switched from sparse to completely dense during finetuning. But as for future work, you’re right, we can certainly vary the amount of “redensification” as well (e,g., 25%, 50%, or possibly some schedule). This is a very interesting research direction, because the full dense capacity of the model may not be needed to recover performance on the downstream task.

[deleted] t1_jd6irlq wrote on March 22, 2023 at 4:40 AM

Reply to [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101

[deleted]

VestPresto t1_jd6iiw1 wrote on March 22, 2023 at 4:37 AM

Reply to comment by KerfuffleV2 in [D] Running an LLM on "low" compute power machines? by Qwillbehr

Sounds faster and less laborious than googling and scanning a few articles

Recent comments in /f/MachineLearning