Recent comments in /f/MachineLearning

drinkingsomuchcoffee t1_jdpg1cb wrote

The problem is learned features aren't factored nicely into a minimal set of parameters. For example, identifying if an image is a cat may be 1000s of parameters over n layers, where it may actually be expressed as 10 parameters over fewer layers. A small model does this automatically, as it's obviously physically constrained. A large model has no such constraint, so it is wasteful. There's probably many solutions to get the best of both worlds at training time, but it's by no means an easy problem. And the current distillation methods or retraining feel clunky. We actually want the big model to use all its parameters efficiently and not waste them, which it's likely doing if much more compact models can get similar results. It's probably extremely wasteful if it requires an order of magnitude in size to get a few percentage points improvement. Compare that to biological entities where an order of magnitude size increase results in huge cognitive improvements.

3

agent_zoso t1_jdpbe5o wrote

It sounds like you're pretty hard set on there being no ghost in the shell and pretty judgmental of anyone who thinks otherwise. I'm just saying you're far too certain you have the answers, as my example demonstrates. I also never said I believe a living being is jumping into existence because of whatever successful Turing test. I'm actually agnostic on that and think it's a waste of time trying to answer something that will never be answered. It's always going to come down to bickering over definitions and goalpost-shifting ("It can't be sentient if it doesn't model glial cells/recurrent cortical stacks/neurotransmitter densities/electrochemistry/quantum computational effects inside microtubules/the gut microbiome/the embedding within the environment/the entire rest of the universe like us"). I'd much rather play it safe and treat it as though it is conscious.

Maybe I'm misunderstanding you, but it sounds like you're now also being far too dismissive of the representational power tensors/linear operations and especially eigendecompositions can have (I could probably blow your mind with the examples I've seen), and of statistics as a loss function. After all, we as humans are literally no more than statistical mechanical partition functions of Von Neumann density matrices, what would you even use for a loss function instead? MSE, cross-entropy (perplexity), KL, L1/L2 are statistical and used to build every neural net you've heard about. The only difference between us and say a Turing-complete (nonlinear ReLU + attentional) Kalman filter for text like you're making GPT out to be is how the hyperparameters are adjusted. A Kalman filter uses Bayesian inference with either Laplace's method or maximum-likelihoodist rules, whereas we (and ChatGPT) are genetically rewarded for minimizing both cost (resp. perplexity) and nonlinear human feedback. Keep boiling things down and you'll find you're surrounded by philosophical zombies.

Edit: Didn't see the second paragraph you added. I'm not sure what ML orthodoxy you're from, but Chalmers' result is pretty well accepted in CogSci. The setup that you're describing, the Chinese room, is an appeal to common sense, but a lot of what motivates scientific development is trying to understand paradoxes and counter-intuitive results. Sure, it sounds absurd, but so does Schrödinger's cat or black holes, both of which failed to disprove the underlying phenomena. Chalmer's 1995 result came after the Chinese Room thought experiment (by about 15 years in fact) and updated the consensus since on the Chinese Room by stressing the importance of universality. Since your example has humans performing the computation, I would say it could be alive (depending on the complexity of the equations, are they reproducing the action potentials of a human brain?), and case in point I think the internet even before ChatGPT is the most likely and well-known contender for a mass of human scribbles being conscious.

1

YoloSwaggedBased t1_jdp9cge wrote

I can't find it now, but I've read a paper that essentially proposed this, at least for inferencing. You essentially have a model output and task loss after every n layers of the model. At training time, you produce outputs up to the end of the architecture and then at inference time utilise some heuristic to measure how much accuracy loss you're willing to sacrifice for layer wise model reduction.

2

LacedDecal t1_jdp6z6y wrote

If one is trying to model something where the “correct” answer for a given set of features is inherently probabilistic—for example the outcome of a baseball plate appearance—how should you tell a neural network to grade it’s accuracy?

For those who aren’t familiar with baseball, the most likely outcome for any plate appearance — even the leagues best batter against the leagues worst pitcher — is some kind of out. Generally somewhere on the order of 60-75% that will be the outcome. So I’m realizing that the most “accurate” set of predictions against literally any dataset of at bats were to predict “out” for every one.

What I’m realizing is that the “correct” answer I’m looking for is a set of probabilities. But how does one apply, say, a loss function involving categorical cross entropy, in any kind of meaningful way? Is there even a way to do supervised learning when the data points “label” isn’t the actual probability distribution but rather one collapsed event for each “true” probability distribution?

Am I even making sense?

Edit: I know I need something like softmax but when I start training it quickly spirals into a case of exploding gradients no matter what I do. I think it’s because the “labels” I’m using aren’t the true probabilities each outcome had, but rather a single hard max real life outcome that actually occurred (home run, out, double, etc).

1

cyborgsnowflake t1_jdoz0ia wrote

In a very general sense neurons and nns are the same in that they are both networks but the brain from what we know is structured very differently to gpt which is a more or less simply a linear circuit for processing tensors. I'm not sure what the reasoning is to jump to the conclusion that a living being is popping into existence when you run GPT just because the output 'looks human'. You could just conclude that' knowledge tasks to a certain degree can be approximated statistically'. As anyone who watched horny men get fooled by chatbots in the 90s should know.

If you believe the former than logically if you replaced the computer circuits with humans than even people writing equations on paper together should if there was enough of them theoretically also cause these 'calculation beings' with minds independent of the humans themselves to pop into existence. Which maybe you can argue for under certain philosophies but thats veering off into territory far from orthodox computer science.

2

ganzzahl t1_jdovu3h wrote

I'm also very interested in this – does anyone have papers similar to Chinchilla, but without the training FLOPs restriction, and instead comparing identical dataset sizes?

An aside: I feel like I remember some older MT papers where LSTMs outperformed Transformers for some low resource languages, but I think that's outdated – using transfer learning, multilingual models and synthetic data, I'm fairly certain Transformers always outperform nowadays.

1

ganzzahl t1_jdouip7 wrote

You're definitely missing the entire T5 (encoder-decoder) family of models. From the UL2 paper , it seems encoder-decoder models are more powerful than decoder-only models (such as the GPT family), especially if you're most interested in inference latency.

I do very much wonder if OpenAI has tested equally-sized T5 models, and if there's some secret reason they have found as to why we should stick with GPT models, or if they just are doubling down on "their" idea, even if it is slightly inferior. Or maybe there are newer papers I don't know about.

10

yaosio t1_jdomvtr wrote

I think they give GPT-4 a task, GPT-4 attempts to complete it and is told if it worked or not, then GPT-4 looks at what happened and determines why it failed, and then tries again with this new knowledge. This is all done through natural language prompts, the model isn't being changed.

I saw somebody else in either this sub or /r/openai using a very similar method to get GPT-4 to write and deploy a webpage that could accept valid email addresses. Of course, I can't find it, and neither can Bing Chat, so maybe I dreamed it. I distinctly remember asking if it could do QA, and then the person asked what I meant, and I said have it check for bugs. I post a lot so I can't find it in my post history.

I remember the way it worked was they gave it the task, then GPT-4 would write out what it was going to do, what it predicted would happen, write the code, and then check if what it did worked. If it didn't work it would write out why it didn't work, plan again, then act again. So it went plan->predict->act->check->plan. This successfully worked as it went from nothing to a working and deployed webpage without any human intervention other than setting the task.

2

Difficult_Bid_9828 t1_jdom9pj wrote

I tried something like this by transforming a random image of a game I found (Doom and Mario 64) into ASCII art and giving it to Bing with the instruction to guess which game it was from. Did not work.

2