currentscurrents

currentscurrents t1_j573tug wrote

They announced upscaling support in Chrome at CES 2023.

>The new feature will work within the Chrome and Edge browsers, and also requires an Nvidia RTX 30-series or 40-series GPU to function. Nvidia didn't specify what exactly is required from those two GPU generations to get the new upscaling feature working, nor if there's any sort of performance impact, but at least this isn't a 40-series only feature.

Interesting though that it's working with your GTX 1660 Ti. Maybe Chrome is implementing a simpler upscaler as a fallback for older GPUs?

Check your chrome://flags for anything that looks related.

30

currentscurrents t1_j525hto wrote

Retrieval language models do have some downsides. Keeping a copy of the training data around is suboptimal for a couple reasons:

  • Training data is huge. Retro's retrieval database is 1.75 trillion tokens. This isn't a very efficient way of storing knowledge, since a lot of the text is irrelevant or redundant.

  • Training data is still a mix of knowledge and language. You haven't achieved separation of the two types of information, so it doesn't help you perform logic on ideas and concepts.

  • Most training data is copyrighted. It's currently legal to train a model on copyrighted data, but distributing a copy of the training data with the model puts you on much less firm ground.

Ideally I think you want to condense the knowledge from the training data down into a structured representation, perhaps a knowledge graph. Knowledge graphs are easy to perform logic on and can be human-editable. There's also already an entire sub-field studying them.

19

currentscurrents t1_j4rcc3e wrote

Interesting! I haven't heard of RWKV before.

Getting rid of attention seems like a good way to increase training speed (since training all those attention heads at once is slow), but how can it work so well without attention?

Also aren't RNNs usually slower than transformers because they can't be parallelized?

10

currentscurrents t1_j4ijlqv wrote

You can fine-tune image generator models and some smaller language models.

You can also do tasks that don't require super large models, like image recognition.

>that's beyond just some toy experiment?

Don't knock toy experiments too much! I'm having a lot of fun trying to build a differentiable neural computer or memory-augmented network in pytorch.

3

currentscurrents t1_j4a2las wrote

>Specifically, 1) we design an expert system to generate a melody by developing musical elements from motifs to phrases then to sections with repetitions and variations according to pre-given musical form; 2) considering the generated melody is lack of musical richness, we design a Transformer based refinement model to improve the melody without changing its musical form. MeloForm enjoys the advantages of precise musical form control by expert systems and musical richness learning via neural models.

16

currentscurrents t1_j49u28o wrote

I don't think it's that simple - whether or not generative AI is considered "transformative" has not yet been tested by the courts.

Until somebody actually gets sued over this and it goes to court, we don't know how the legal system is going to handle it. There is currently a lawsuit against Github Copilot, so we will probably know in the next couple years.

5

currentscurrents t1_j490rvn wrote

It's meaningful right now because there's a threshold where LLMs become awesome, but getting there requires expensive specialized GPUs.

I'm hoping in a few years consumer GPUs will have 80GB of VRAM or whatever and we'll be able to run them locally. While datacenters will still have more compute, it won't matter as much since there's a limit where larger models would require more training data than exists.

3

currentscurrents t1_j48csbo wrote

Reply to comment by RandomCandor in [D] Bitter lesson 2.0? by Tea_Pearce

If it is true that performance scales infinitely with compute power - and I kinda hope it is, since that would make superhuman AI achievable - datacenters will always be smarter than PCs.

That said, I'm not sure that it does scale infinitely. You need not just more compute but also more data, and there's only so much data out there. GPT-4 reportedly won't be any bigger than GPT-3 because even terabytes of scraped internet data isn't enough to train a larger model.

6

currentscurrents t1_j4716tp wrote

Reply to comment by ml-research in [D] Bitter lesson 2.0? by Tea_Pearce

Try to figure out systems that can generalize from smaller amounts of data? It's the big problem we all need to solve anyway.

There's a bunch of promising ideas that need more research:

  • Neurosymbolic computing
  • Expert systems built out of neural networks
  • Memory augmented neural networks
  • Differentiable neural computers
8

currentscurrents t1_j4702g0 wrote

Reply to comment by mugbrushteeth in [D] Bitter lesson 2.0? by Tea_Pearce

Compute is going to get cheaper over time though. My phone today has the FLOPs of a supercomputer from 1999.

Also if LLMs become the next big thing you can expect GPU manufacturers to include more VRAM and more hardware acceleration directed at them.

9

currentscurrents t1_j44pu0u wrote

Is it though? These days it seems like even a lot of research papers are just "we stuck together a bunch of pytorch components like lego blocks" or "we fed a transformer model a bunch of data".

Math is important if you want to invent new kinds of neural networks, but for end users it doesn't seem very important.

7

currentscurrents OP t1_j44nngb wrote

The paper does talk about this and calls transformers "first generation compositional systems" - but limited ones.

>Transformers, on the other hand, use graphs, which in principle can encode general, abstract structure, including webs of inter-related concepts and facts.

> However, in Transformers, a layer’s graph is defined by its data flow, yet this data flow cannot be accessed by the rest of the network—once a given layer’s data-flow graph has been used by that layer, the graph disappears. For the graph to be a bona fide encoding, carrying information to the rest of the network, it would need to be represented with an activation vector that encodes the graph’s abstract, compositionally-structured internal information.

>The technique we introduce next—NECST computing—provides exactly this type of activation vector.

They then talk about a more advanced variant called NECSTransformers, which they consider a 2nd generation compositional system. But I haven't heard of this system before and I'm not clear if it actually performs better.

10

currentscurrents OP t1_j43s8ki wrote

In the paper they talk about "first generation compositional systems" and I believe they would include differentiable programming in that category. It has some compositional structure, but the structure is created by the programmer.

Ideally the system would be able to create it's own arbitrarily complex structures and systems to understand abstract ideas, like humans can.

4

currentscurrents t1_j3eo4uc wrote

There's plenty of work to be done in researching language models that train more efficiently or run on smaller machines.

ChatGPT is great, but it needed 600GB of training data and megawatts of power. It must be possible to do better; the average human brain runs on 12W and has seen maybe a million words tops.

2

currentscurrents t1_j3emas4 wrote

>I hate to break your bubble, but the task is also achievable even with GPT2

Is it? I would love to know how. I can run GPT2 locally, and that would be fantastic level of zero-shot learning to be able to play around with.

I have no doubt you can fine-tune GPT2 or T5 to achieve this, but in my experience they aren't nearly as promptable as GPT3/ChatGPT.

>Specifically the task you gave it is likely implicitly present in the dataset, in the sense that the dataset allowed the model to learn the connections between the words you gave it

I'm not sure what you're getting at here. It has learned the connections and meanings between words of course, that's what a language model does.

But it still followed my instructions, and it can follow a wide variety of other detailed instructions you give it. These tasks are too specific to have been in the training data; it is successfully generalizing zero-shot to new NLP tasks.

1