Recent comments in /f/deeplearning

BrotherAmazing t1_iyazrs4 wrote

Again, they can approximate any function or algorithm. This is proven mathematically.

Just because people are confounded by examples of DNNs that don’t seem to do what they want them to do, and just because people do not yet understand how to construct DNNs that exist that can indeed do these things does not mean they are “dumb” or limited.

Perhaps you are constructing them wrong. Perhaps the engineers are the dumb ones? 🤷🏼

Sometimes people literally argue, just with plain english and not mathematics, that basic mathematically proven concepts are not true.

If you had a mathematical proof that showed DNNs were equivalent to decision trees or incapable of performing certain tasks, with a mathematical proof, neat! If you argue DNNs can’t perform tasks that can be reduced to functions or algorithms though, and do it in mere language without mathematical proofs, I’m not impressed yet!

2

Difficult-Race-1188 OP t1_iyaxhbe wrote

The argument goes much further, NNs are not exactly learning the data distribution. If they had, the affine transformation problem would have been already taken care of, there would have been no need for data augmentation by rotating or flipping. Also approximating any algorithm doesn't necessarily mean the underlying data is following a distribution made out of any known algorithm. Also, Neural network struggle even to learn simple mathematical functions, all they do in the approximation is make piecewise assumptions of algorithms.

Here's the grokking paper review that told that NN couldn't generalize to this equation:

x³ + xy² + y (mod 97)

Article: https://medium.com/p/9dbbec1055ae

Original paper: https://arxiv.org/abs/2201.02177

1

BrotherAmazing t1_iyaux7r wrote

A deep neural network can approximate and function.

A deep recurrent neural network can approximate any algorithm.

The are mathematically proven facts. Can the same be said about “a bunch of decision trees in hyperspace”? If so, then I would say “a bunch of decision trees in hyperspace” are pretty darn powerful, as are deep neural networks. If not, then I would say the author has made a logical error somewhere along the way in his very qualitative reasoning. Plenty of thought experiments in language with “bulletproof” arguments have led to “contradictions” in the past, only for a subtle logical error to be unveiled when we stop using language and start using mathematics.

2

freaky1310 t1_iy91uxy wrote

TL;DR: Each model tries to solve the problems that affect the current state-of-the-art model.

Theoretically, yes. Practically, definitely not.

I’ll try to explain myself, please let me know if something I say is not clear. The whole point of training NNs is to find an approximator that could provide correct answers to our questions, given our data. The different architectures that have been designed through the years address different problems.

Namely, CNNs addressed the curse of dimensionality: using MLPs and similar architecture wouldn’t scale on “large” images (large means larger than 64x64) because the number of connections would increase exponentially on the number of neurons of each layer. Therefore, convolution has been found to provide a nice approximation of aggregated pixels (called “features” from now on) and CNNs were born.

After that, expressiveness has been a problem: for example, stacking too many convolutions would erase too much information on one side, and significantly decrease inference time on the other side. To address this, researchers have found recurrent units to be useful for retaining lost information and propagate it through the network. Et voilá, RNNs are born.

Long story short: each different type of architecture was born to solve the problems of another kind of models, while introducing new issues and limitations at the same time.

So, to go back to your first question: can NNs approximate everything? Not everything everything, but a “wide variety of interesting functions”. In practice, they can try to approximate everything that you will need to, even though some limitation will always stay there.

3

pornthrowaway42069l t1_iy8srkr wrote

Ah, I see. During training, the loss and metrics you see are actually moving averages, not exact losses/metrics in that epoch. I can't find the documentation rn, but I know I seen it before. What this means is that losses/metrics during training won't be a "good" gauge to compare with, since they include information from previous epochs.

1

Sadness24_7 OP t1_iy8qtwh wrote

i did write my own metric based on examples from keras. But since i have to do it using callbacks and their backend, it works only on one output at a time meaning both predictions and true values are vector.
What i meant by that is that when i call model.fit(....) i tells me at each epoch something like this:

Epoch 1/500

63/63 [==============================] - 1s 6ms/step - loss: 4171.9570 - root_mean_squared_error: 42.4592 - val_loss: 2544.3647 - val_root_mean_squared_error: 44.4907

​

where root_mean_squared_error is a custom metric as follow.

def root_mean_squared_error(y_true, y_pred):
return K.sqrt(K.mean(K.square(y_pred - y_true)))

which when called directly wants data in a for of vector, meaning this function has to be called for each output separatly.

In order to better optimize my model i need to understand how the losses/metrics are calculated so it results in one number (as shown above during training)

1

hp2304 t1_iy8lyyf wrote

Any ML classifier or regressor is basically a function approximator.

The function space isn't continuous but rather discrete, discretized by the dataset points. Hence, increasing the size of dataset can help in increasing overall accuracy. This is relatable with Nyquist criterion. Less data and its more likely our approximation is wrong. Given the dimensions of input space and range of each input variable, the dataset size is nothing. E.g. for 224x224 rgb input image, input space has total pow(256, 224x224x3) possible input values, which is unimaginably large number, mapping each to a correct class label (total 1000 classes) is very difficult for any approximator. Hence, one can never get 100% accuracy.

2

RichardBJ1 t1_iy7xvbr wrote

Was interested when I first head about this concept. People seemed to respond with either thinking it was ground shaking, …..or alternatively that it stood to reason that given enough splits it would be the case! Do you think though, that from a practical usage perspective this doesn’t help much because there are so many decisions…. Article has a lot more than just that though and a nice provocative title.

3

freaky1310 t1_iy7ielr wrote

Thanks for pointing out the article, it’s going to be useful for a lot of people.

Anyway, when we refer to the “black box” nature of DNNs we don’t mean “we don’t know what’s going on”, but rather “we know exactly what’s going on in theory, but there are so many simple calculations that it’s impossible for a human being to keep track of them”. Just think of a simple ConvNet for MNIST classification like AlexNet: it has ~62M parameters, meaning that all the simple calculations (gradients update and whatnot) are performed A LOT of times in a single backward pass.

Also, DNNs often work with a latent representation, which adds another layer of abstraction for the user: the “reasoning” part happens in a latent space that we don’t know anything about, except some of its properties (and again, if we make the calculations we actually do know exactly what it is, it’s just unfeasible to do them).

To address these points, several research projects have focused on network interpretability, that is, finding ways of making sense of NNs’ reasoning process. Here’s a review written in 2021 regarding this.

11