tdgros

tdgros t1_j41f1nz wrote

You'll still pay the full price at train time, right? Early decoding works by using decoders on earlier levels at train time. Conversely, if you want to spend more on some tokens, at train time, you will need to have more layers, so at some point you will hit your memory/complexity limits.

4

tdgros t1_j2nfzj6 wrote

a hypernetwork is a term that can be used when a network outputs coefficients for another network.

Sensor fusion is typically used with low-level sensors that are noisy, biased, limited in their dynamics... but can complement each other, be "fused". For UAV navigation, we fuse accelerometers, gyros, pressure sensors, GPS and vision...

2

tdgros t1_j2m63e5 wrote

I can't say for sure, but there isn't necessarily any online training. You can imagine some hypernetwork regressing good parameters for a low level task such as controlling the shoes' motors. It could also be a combination of good old school sensor fusion and a nice marketing speech ;)

3

tdgros t1_j28rhcc wrote

Ah I see what you mean, you're right, my way of seeing is the one that is not standard. My point is that transformers don't really care about the original modality or the order or spatial arrangement of their tokens, ViTs are just transformers over sequences of "patches of pixels" (note, where channels are flattened together!) On top of this, there is work to forcefully bring back locality biases (position embeddings, swin transformers...), this explains why I don't tend to break tokens into different dimensions. You can recompose the sequence into a (H/16)x(W/16)xNdims images, the channels of which can be visualized separately if you want. More often, it's the attention mapsthemselves that are used for visualization or interpetation, head per head (i.e. the number of channels here really is the number of heads)

2

tdgros t1_j28murr wrote

"An image is worth 16x16 words" means you can cut up an image into Nwords patches that are 16x16 spatially. Depending on the size of the image, that gets you a different Nwords. Each of those words is originally 16x16x3 for RGB images, and is projected linearly to Ndims dimensions (usually ~1000). So you get Nwords words of dimension Ndims! whre Nwords depends on the patch size and the image size, and Ndims is arbitrary.

I don't know if your post is a typo, but you're using the same n twice for the number of words and number of channels/dimensions, which doesn't make sense to me. It might be just a different perspective...

2

tdgros t1_j28bwj3 wrote

I don't see tokens corresponding as spatial locations? Before you add or concat a spatial embedding, there is nothing spatial at all, since transformers are permutation invariant! It's only when you add a spatial embedding that the tokens get a relation to their position back.

Maybe you'd prefer concatenating the spatial embeddings as opposed to adding them, so you can (mentally) consider the first channels as content-only and the rest as "spatial-related stuff". It's not strictly true after a first transformer layer, it doesn't change a lot, concat should be the default operation, but adding just makes for smaller tokens and it works fine.

2

tdgros t1_izwrdlr wrote

I am designing a vision transformer for image classification. What do you feel about inserting 1x1 convolutions in between transformer blocks to reduce the dimensionality? would you feel offended if I gradually did that throughout my backbone?

As a large language model trained by OpenAI, I don't have feelings and I am not capable of being offended. I am also not able to browse the internet, so I am unable to provide specific information about using 1x1 convolutions in a vision transformer for image classification. However, in general, using 1x1 convolutions can be a useful technique for reducing the dimensionality of an input in a convolutional neural network, which can help to improve the computational efficiency of the network and reduce the number of parameters it has. Whether or not this technique is appropriate for your specific use case will depend on the details of your model and the data you are working with. It's always a good idea to experiment with different architectures and techniques to see what works best for your particular application.

1

tdgros t1_izwppx0 wrote

You can either take all the existing tokens, average and decode them into logits. But if you can do that, you can also do that with one single token after all.

Or you can append a special learned token at some point, that will have its special decoder, I believe that's what you're describing. You can find this approach in BERT where a CLS token is inserted before any sentence. One final similar approach is Perceiver IO's where the decoder is a transformer whose query is a learned array.

You can change the dimensionality with 1x1 in between transformers, you wouldn't lose meaning but expressivity or capacity. I'm not sure that's recommended, it's not immoral or illegal.

1

tdgros t1_izwkav3 wrote

ViTs keep the same dimension because of the residual connections in the transformer blocks.

At the very end, you want to sum-up the information if you want to do classification, but because all tokens are equivalent, you just average them before further decoding i.e. if you concatenated all the tokens before a linear layer, it'd end up looking like a global pooling.

2

tdgros t1_iyei5h7 wrote

interesting, can you point me to the "types" of tetrachromates? the wiki page does not know about them: https://en.wikipedia.org/wiki/Tetrachromacy#Humans .

Note that the cornea and eye lens block most UV light anyway so your colleague seems very special. She probably did not have aphakia if she used to be a fighter pilot.

16

tdgros t1_iwg6php wrote

Yes. When collecting a "natural" dataset, the variety of camera settings just reflects how images are taken in the wild: sometimes in day time, sometimes in night time. In some cases, you would even want a variety of cameras as well as they handle different conditions differently. If your task is camera-agnostic, then you want to marginalize the camera settings.

1

tdgros t1_iwg1xic wrote

When talking about image restoration: denoising, deblurring, super-resolution... those settings matter a lot, obviously: ISO determines the noise, exposure time participates in the blur and the f-number affects the sharpness of the image. So they are very useful inputs, including in good lighting conditions by the way.

1

tdgros t1_iuvovu1 wrote

that's expert as in "specialized to a certain task", and in this case, each denoiser is specialized to each step of the diffusion process, either to the noise level, or to the kind of images seen (pure noise at the beginning, natural iamges iwth a little noise at the end) depending on your point of view.

11

tdgros t1_iuedlw7 wrote

you can test it here: https://nuclearsecrecy.com/nukemap/ you can use the actual tested Tsar Bomba at 50MT, or the 100MT one! With that last one set off over San Francisco, there is "moderate blast damage" up to Santa Cruz, so a radius a bit below 60 miles, but you do get 3rd degree burns until 50 miles! So it is very, very far from covering all of California, but still gigantic

4

tdgros t1_isg5iy8 wrote

In OP's setting, imho you can use the term you want: inpainting because it's a large missing area, SR because some people see SR as filling in new rows and columns (I don't, I prefer to see it as inverting the lens degradation) and interpolation because it just means "adding things between other things", at least in my native language. I'm not sure what usual methods you are referring to, but you could suggest them to OP!

0