tdgros t1_j41f1nz wrote on January 12, 2023 at 2:47 PM

Reply to comment by Chemont in [R] Is there any research on allowing Transformers to spent more compute on more difficult to predict tokens? by Chemont

You'll still pay the full price at train time, right? Early decoding works by using decoders on earlier levels at train time. Conversely, if you want to spend more on some tokens, at train time, you will need to have more layers, so at some point you will hit your memory/complexity limits.

tdgros t1_j3coj8r wrote on January 7, 2023 at 4:58 PM

Reply to comment by sidney_lumet in [Discussion] Is there any alternative of deep learning ? by sidney_lumet

that's what random forests are...

tdgros t1_j2nfzj6 wrote on January 2, 2023 at 5:11 PM

Reply to comment by waiting4omscs in [D] Simple Questions Thread by AutoModerator

a hypernetwork is a term that can be used when a network outputs coefficients for another network.

Sensor fusion is typically used with low-level sensors that are noisy, biased, limited in their dynamics... but can complement each other, be "fused". For UAV navigation, we fuse accelerometers, gyros, pressure sensors, GPS and vision...

tdgros t1_j2m63e5 wrote on January 2, 2023 at 10:01 AM

Reply to comment by waiting4omscs in [D] Simple Questions Thread by AutoModerator

I can't say for sure, but there isn't necessarily any online training. You can imagine some hypernetwork regressing good parameters for a low level task such as controlling the shoes' motors. It could also be a combination of good old school sensor fusion and a nice marketing speech ;)

tdgros t1_j28rhcc wrote on December 30, 2022 at 1:48 PM

Reply to comment by stecas in [D] In vision transformers, why do tokens correspond to spatial locations and not channels? by stecas

Ah I see what you mean, you're right, my way of seeing is the one that is not standard. My point is that transformers don't really care about the original modality or the order or spatial arrangement of their tokens, ViTs are just transformers over sequences of "patches of pixels" (note, where channels are flattened together!) On top of this, there is work to forcefully bring back locality biases (position embeddings, swin transformers...), this explains why I don't tend to break tokens into different dimensions. You can recompose the sequence into a (H/16)x(W/16)xNdims images, the channels of which can be visualized separately if you want. More often, it's the attention mapsthemselves that are used for visualization or interpetation, head per head (i.e. the number of channels here really is the number of heads)

tdgros t1_j28murr wrote on December 30, 2022 at 1:05 PM

Reply to comment by stecas in [D] In vision transformers, why do tokens correspond to spatial locations and not channels? by stecas

"An image is worth 16x16 words" means you can cut up an image into Nwords patches that are 16x16 spatially. Depending on the size of the image, that gets you a different Nwords. Each of those words is originally 16x16x3 for RGB images, and is projected linearly to Ndims dimensions (usually ~1000). So you get Nwords words of dimension Ndims! whre Nwords depends on the patch size and the image size, and Ndims is arbitrary.

I don't know if your post is a typo, but you're using the same n twice for the number of words and number of channels/dimensions, which doesn't make sense to me. It might be just a different perspective...

tdgros t1_j28bwj3 wrote on December 30, 2022 at 10:58 AM

Reply to [D] In vision transformers, why do tokens correspond to spatial locations and not channels? by stecas

I don't see tokens corresponding as spatial locations? Before you add or concat a spatial embedding, there is nothing spatial at all, since transformers are permutation invariant! It's only when you add a spatial embedding that the tokens get a relation to their position back.

Maybe you'd prefer concatenating the spatial embeddings as opposed to adding them, so you can (mentally) consider the first channels as content-only and the rest as "spatial-related stuff". It's not strictly true after a first transformer layer, it doesn't change a lot, concat should be the default operation, but adding just makes for smaller tokens and it works fine.

tdgros t1_j1uai60 wrote on December 27, 2022 at 1:19 PM

Reply to comment by jens-2420 in 'The Serpent' serial killer Charles Sobhraj returns to France by randy88moss

Just in case, there was a Netflix series about him, played by Tahar Rahim, a few years ago...

tdgros t1_izwrdlr wrote on December 12, 2022 at 12:39 PM

Reply to comment by DeepGamingAI in [D] Global average pooling wrt channel dimensions by Ananth_A_007

I am designing a vision transformer for image classification. What do you feel about inserting 1x1 convolutions in between transformer blocks to reduce the dimensionality? would you feel offended if I gradually did that throughout my backbone?

As a large language model trained by OpenAI, I don't have feelings and I am not capable of being offended. I am also not able to browse the internet, so I am unable to provide specific information about using 1x1 convolutions in a vision transformer for image classification. However, in general, using 1x1 convolutions can be a useful technique for reducing the dimensionality of an input in a convolutional neural network, which can help to improve the computational efficiency of the network and reduce the number of parameters it has. Whether or not this technique is appropriate for your specific use case will depend on the details of your model and the data you are working with. It's always a good idea to experiment with different architectures and techniques to see what works best for your particular application.

tdgros t1_izwppx0 wrote on December 12, 2022 at 12:22 PM

Reply to comment by DeepGamingAI in [D] Global average pooling wrt channel dimensions by Ananth_A_007

You can either take all the existing tokens, average and decode them into logits. But if you can do that, you can also do that with one single token after all.

Or you can append a special learned token at some point, that will have its special decoder, I believe that's what you're describing. You can find this approach in BERT where a CLS token is inserted before any sentence. One final similar approach is Perceiver IO's where the decoder is a transformer whose query is a learned array.

You can change the dimensionality with 1x1 in between transformers, you wouldn't lose meaning but expressivity or capacity. I'm not sure that's recommended, it's not immoral or illegal.

tdgros t1_izwkav3 wrote on December 12, 2022 at 11:17 AM

Reply to comment by DeepGamingAI in [D] Global average pooling wrt channel dimensions by Ananth_A_007

ViTs keep the same dimension because of the residual connections in the transformer blocks.

At the very end, you want to sum-up the information if you want to do classification, but because all tokens are equivalent, you just average them before further decoding i.e. if you concatenated all the tokens before a linear layer, it'd end up looking like a global pooling.

tdgros t1_iyei5h7 wrote on November 30, 2022 at 8:16 PM

Reply to comment by starfyredragon in Do people with tetrachromacy or colorblindness experience seasonal affective disorder at the same rate and intensity as people with trichromacy? by Millennial_Glacier

interesting, can you point me to the "types" of tetrachromates? the wiki page does not know about them: https://en.wikipedia.org/wiki/Tetrachromacy#Humans .

Note that the cornea and eye lens block most UV light anyway so your colleague seems very special. She probably did not have aphakia if she used to be a fighter pilot.

tdgros t1_iye5wx8 wrote on November 30, 2022 at 6:56 PM

Reply to comment by [deleted] in Do people with tetrachromacy or colorblindness experience seasonal affective disorder at the same rate and intensity as people with trichromacy? by Millennial_Glacier

can I ask how you discovered you were tetrachromate and how you tested for it?

tdgros t1_iwg6php wrote on November 15, 2022 at 12:03 PM

Reply to comment by ToTa_12 in [D] Camera settings for dataset collection by ToTa_12

Yes. When collecting a "natural" dataset, the variety of camera settings just reflects how images are taken in the wild: sometimes in day time, sometimes in night time. In some cases, you would even want a variety of cameras as well as they handle different conditions differently. If your task is camera-agnostic, then you want to marginalize the camera settings.

tdgros t1_iwg1xic wrote on November 15, 2022 at 11:03 AM

Reply to [D] Camera settings for dataset collection by ToTa_12

When talking about image restoration: denoising, deblurring, super-resolution... those settings matter a lot, obviously: ISO determines the noise, exposure time participates in the blur and the f-number affects the sharpness of the image. So they are very useful inputs, including in good lighting conditions by the way.

tdgros t1_iuvovu1 wrote on November 3, 2022 at 11:12 AM

Reply to comment by Classic-Rise4742 in [N] eDiffi: Text-to-Image Diffusion Models with Ensemble of Expert Denoisers by jd_3d

that's expert as in "specialized to a certain task", and in this case, each denoiser is specialized to each step of the diffusion process, either to the noise level, or to the kind of images seen (pure noise at the beginning, natural iamges iwth a little noise at the end) depending on your point of view.

tdgros t1_iuerkbf wrote on October 30, 2022 at 7:48 PM

Reply to comment by destuctir in Comparison of Nuclear Explosions by MiamiDevSecOps

There are similar stories about the very first atomic bomb test. There was never any real worry though. Do you have a source for this tsar Bomba claim?

tdgros t1_iuedlw7 wrote on October 30, 2022 at 6:16 PM

Reply to comment by jordzkie05 in Comparison of Nuclear Explosions by MiamiDevSecOps

you can test it here: https://nuclearsecrecy.com/nukemap/ you can use the actual tested Tsar Bomba at 50MT, or the 100MT one! With that last one set off over San Francisco, there is "moderate blast damage" up to Santa Cruz, so a radius a bit below 60 miles, but you do get 3rd degree burns until 50 miles! So it is very, very far from covering all of California, but still gigantic

tdgros t1_isg5iy8 wrote on October 15, 2022 at 6:55 PM

Reply to comment by Red-Portal in [D] Interpolation in medical imaging? by Delacroid

In OP's setting, imho you can use the term you want: inpainting because it's a large missing area, SR because some people see SR as filling in new rows and columns (I don't, I prefer to see it as inverting the lens degradation) and interpolation because it just means "adding things between other things", at least in my native language. I'm not sure what usual methods you are referring to, but you could suggest them to OP!

tdgros t1_isg3vwi wrote on October 15, 2022 at 6:44 PM

Reply to comment by Red-Portal in [D] Interpolation in medical imaging? by Delacroid

You are welcome to call it what you want, I'm pretty sure you see the similarities and why I suggested maskGIT.

tdgros t1_isezwr3 wrote on October 15, 2022 at 1:54 PM

Reply to comment by Delacroid in [D] Interpolation in medical imaging? by Delacroid

this is a weird autocorrection, right? :)

tdgros t1_isez24z wrote on October 15, 2022 at 1:47 PM

Reply to [D] Interpolation in medical imaging? by Delacroid

Filling in a missing slice could be called an "inpainting problem".

There is this line of work that should fit your description: https://arxiv.org/pdf/2202.04200.pdf (there are older similar approaches as well). There are approaches using GANs as well. I can't say if they're popular for medical imaging data, but they're quite general.

tdgros t1_isey85z wrote on October 15, 2022 at 1:40 PM

Reply to comment by eigenham in [D] Interpolation in medical imaging? by Delacroid

https://arxiv.org/pdf/2209.07162.pdf This recent paper released a dataset of 100k brain MRI images generated with a diffusion model. So things are moving a bit...

tdgros t1_ir9hdy2 wrote on October 6, 2022 at 10:10 AM

Reply to comment by ThePerson654321 in [R] Google announces Imagen Video, a model that generates videos from text by Erosis

Phenaki already shows the generation of 2mn videos (using lots of prompts): https://phenaki.video/#interactive it's not that far fetched to imagine that working on longer prompts and videos...

tdgros t1_ir1914n wrote on October 4, 2022 at 4:30 PM

Reply to comment by pjabrony in TIL that the construction of Fort Boyard took so long that by the time of completion it was largely obsolete. Years later it found new purpose as a filming location for game shows. by SilasMarner77

Building started in 1801 and was completed in 1857

But the invention of TV took so long that the game show only started in 1990