tdgros
tdgros t1_j3coj8r wrote
Reply to comment by sidney_lumet in [Discussion] Is there any alternative of deep learning ? by sidney_lumet
that's what random forests are...
tdgros t1_j2nfzj6 wrote
Reply to comment by waiting4omscs in [D] Simple Questions Thread by AutoModerator
a hypernetwork is a term that can be used when a network outputs coefficients for another network.
Sensor fusion is typically used with low-level sensors that are noisy, biased, limited in their dynamics... but can complement each other, be "fused". For UAV navigation, we fuse accelerometers, gyros, pressure sensors, GPS and vision...
tdgros t1_j2m63e5 wrote
Reply to comment by waiting4omscs in [D] Simple Questions Thread by AutoModerator
I can't say for sure, but there isn't necessarily any online training. You can imagine some hypernetwork regressing good parameters for a low level task such as controlling the shoes' motors. It could also be a combination of good old school sensor fusion and a nice marketing speech ;)
tdgros t1_j28rhcc wrote
Reply to comment by stecas in [D] In vision transformers, why do tokens correspond to spatial locations and not channels? by stecas
Ah I see what you mean, you're right, my way of seeing is the one that is not standard. My point is that transformers don't really care about the original modality or the order or spatial arrangement of their tokens, ViTs are just transformers over sequences of "patches of pixels" (note, where channels are flattened together!) On top of this, there is work to forcefully bring back locality biases (position embeddings, swin transformers...), this explains why I don't tend to break tokens into different dimensions. You can recompose the sequence into a (H/16)x(W/16)xNdims images, the channels of which can be visualized separately if you want. More often, it's the attention mapsthemselves that are used for visualization or interpetation, head per head (i.e. the number of channels here really is the number of heads)
tdgros t1_j28murr wrote
Reply to comment by stecas in [D] In vision transformers, why do tokens correspond to spatial locations and not channels? by stecas
"An image is worth 16x16 words" means you can cut up an image into Nwords patches that are 16x16 spatially. Depending on the size of the image, that gets you a different Nwords. Each of those words is originally 16x16x3 for RGB images, and is projected linearly to Ndims dimensions (usually ~1000). So you get Nwords words of dimension Ndims! whre Nwords depends on the patch size and the image size, and Ndims is arbitrary.
I don't know if your post is a typo, but you're using the same n twice for the number of words and number of channels/dimensions, which doesn't make sense to me. It might be just a different perspective...
tdgros t1_j28bwj3 wrote
Reply to [D] In vision transformers, why do tokens correspond to spatial locations and not channels? by stecas
I don't see tokens corresponding as spatial locations? Before you add or concat a spatial embedding, there is nothing spatial at all, since transformers are permutation invariant! It's only when you add a spatial embedding that the tokens get a relation to their position back.
Maybe you'd prefer concatenating the spatial embeddings as opposed to adding them, so you can (mentally) consider the first channels as content-only and the rest as "spatial-related stuff". It's not strictly true after a first transformer layer, it doesn't change a lot, concat should be the default operation, but adding just makes for smaller tokens and it works fine.
tdgros t1_j1uai60 wrote
Reply to comment by jens-2420 in 'The Serpent' serial killer Charles Sobhraj returns to France by randy88moss
Just in case, there was a Netflix series about him, played by Tahar Rahim, a few years ago...
tdgros t1_izwrdlr wrote
Reply to comment by DeepGamingAI in [D] Global average pooling wrt channel dimensions by Ananth_A_007
I am designing a vision transformer for image classification. What do you feel about inserting 1x1 convolutions in between transformer blocks to reduce the dimensionality? would you feel offended if I gradually did that throughout my backbone?
As a large language model trained by OpenAI, I don't have feelings and I am not capable of being offended. I am also not able to browse the internet, so I am unable to provide specific information about using 1x1 convolutions in a vision transformer for image classification. However, in general, using 1x1 convolutions can be a useful technique for reducing the dimensionality of an input in a convolutional neural network, which can help to improve the computational efficiency of the network and reduce the number of parameters it has. Whether or not this technique is appropriate for your specific use case will depend on the details of your model and the data you are working with. It's always a good idea to experiment with different architectures and techniques to see what works best for your particular application.
tdgros t1_izwppx0 wrote
Reply to comment by DeepGamingAI in [D] Global average pooling wrt channel dimensions by Ananth_A_007
You can either take all the existing tokens, average and decode them into logits. But if you can do that, you can also do that with one single token after all.
Or you can append a special learned token at some point, that will have its special decoder, I believe that's what you're describing. You can find this approach in BERT where a CLS token is inserted before any sentence. One final similar approach is Perceiver IO's where the decoder is a transformer whose query is a learned array.
You can change the dimensionality with 1x1 in between transformers, you wouldn't lose meaning but expressivity or capacity. I'm not sure that's recommended, it's not immoral or illegal.
tdgros t1_izwkav3 wrote
Reply to comment by DeepGamingAI in [D] Global average pooling wrt channel dimensions by Ananth_A_007
ViTs keep the same dimension because of the residual connections in the transformer blocks.
At the very end, you want to sum-up the information if you want to do classification, but because all tokens are equivalent, you just average them before further decoding i.e. if you concatenated all the tokens before a linear layer, it'd end up looking like a global pooling.
tdgros t1_iyei5h7 wrote
Reply to comment by starfyredragon in Do people with tetrachromacy or colorblindness experience seasonal affective disorder at the same rate and intensity as people with trichromacy? by Millennial_Glacier
interesting, can you point me to the "types" of tetrachromates? the wiki page does not know about them: https://en.wikipedia.org/wiki/Tetrachromacy#Humans .
Note that the cornea and eye lens block most UV light anyway so your colleague seems very special. She probably did not have aphakia if she used to be a fighter pilot.
tdgros t1_iye5wx8 wrote
Reply to comment by [deleted] in Do people with tetrachromacy or colorblindness experience seasonal affective disorder at the same rate and intensity as people with trichromacy? by Millennial_Glacier
can I ask how you discovered you were tetrachromate and how you tested for it?
tdgros t1_iwg6php wrote
Reply to comment by ToTa_12 in [D] Camera settings for dataset collection by ToTa_12
Yes. When collecting a "natural" dataset, the variety of camera settings just reflects how images are taken in the wild: sometimes in day time, sometimes in night time. In some cases, you would even want a variety of cameras as well as they handle different conditions differently. If your task is camera-agnostic, then you want to marginalize the camera settings.
tdgros t1_iwg1xic wrote
Reply to [D] Camera settings for dataset collection by ToTa_12
When talking about image restoration: denoising, deblurring, super-resolution... those settings matter a lot, obviously: ISO determines the noise, exposure time participates in the blur and the f-number affects the sharpness of the image. So they are very useful inputs, including in good lighting conditions by the way.
tdgros t1_iuvovu1 wrote
Reply to comment by Classic-Rise4742 in [N] eDiffi: Text-to-Image Diffusion Models with Ensemble of Expert Denoisers by jd_3d
that's expert as in "specialized to a certain task", and in this case, each denoiser is specialized to each step of the diffusion process, either to the noise level, or to the kind of images seen (pure noise at the beginning, natural iamges iwth a little noise at the end) depending on your point of view.
tdgros t1_iuerkbf wrote
Reply to comment by destuctir in Comparison of Nuclear Explosions by MiamiDevSecOps
There are similar stories about the very first atomic bomb test. There was never any real worry though. Do you have a source for this tsar Bomba claim?
tdgros t1_iuedlw7 wrote
Reply to comment by jordzkie05 in Comparison of Nuclear Explosions by MiamiDevSecOps
you can test it here: https://nuclearsecrecy.com/nukemap/ you can use the actual tested Tsar Bomba at 50MT, or the 100MT one! With that last one set off over San Francisco, there is "moderate blast damage" up to Santa Cruz, so a radius a bit below 60 miles, but you do get 3rd degree burns until 50 miles! So it is very, very far from covering all of California, but still gigantic
tdgros t1_isg5iy8 wrote
Reply to comment by Red-Portal in [D] Interpolation in medical imaging? by Delacroid
In OP's setting, imho you can use the term you want: inpainting because it's a large missing area, SR because some people see SR as filling in new rows and columns (I don't, I prefer to see it as inverting the lens degradation) and interpolation because it just means "adding things between other things", at least in my native language. I'm not sure what usual methods you are referring to, but you could suggest them to OP!
tdgros t1_isg3vwi wrote
Reply to comment by Red-Portal in [D] Interpolation in medical imaging? by Delacroid
You are welcome to call it what you want, I'm pretty sure you see the similarities and why I suggested maskGIT.
tdgros t1_isezwr3 wrote
Reply to comment by Delacroid in [D] Interpolation in medical imaging? by Delacroid
this is a weird autocorrection, right? :)
tdgros t1_isez24z wrote
Reply to [D] Interpolation in medical imaging? by Delacroid
Filling in a missing slice could be called an "inpainting problem".
There is this line of work that should fit your description: https://arxiv.org/pdf/2202.04200.pdf (there are older similar approaches as well). There are approaches using GANs as well. I can't say if they're popular for medical imaging data, but they're quite general.
tdgros t1_isey85z wrote
Reply to comment by eigenham in [D] Interpolation in medical imaging? by Delacroid
https://arxiv.org/pdf/2209.07162.pdf This recent paper released a dataset of 100k brain MRI images generated with a diffusion model. So things are moving a bit...
tdgros t1_ir9hdy2 wrote
Reply to comment by ThePerson654321 in [R] Google announces Imagen Video, a model that generates videos from text by Erosis
Phenaki already shows the generation of 2mn videos (using lots of prompts): https://phenaki.video/#interactive it's not that far fetched to imagine that working on longer prompts and videos...
tdgros t1_ir1914n wrote
Reply to comment by pjabrony in TIL that the construction of Fort Boyard took so long that by the time of completion it was largely obsolete. Years later it found new purpose as a filming location for game shows. by SilasMarner77
Building started in 1801 and was completed in 1857
But the invention of TV took so long that the game show only started in 1990
tdgros t1_j41f1nz wrote
Reply to comment by Chemont in [R] Is there any research on allowing Transformers to spent more compute on more difficult to predict tokens? by Chemont
You'll still pay the full price at train time, right? Early decoding works by using decoders on earlier levels at train time. Conversely, if you want to spend more on some tokens, at train time, you will need to have more layers, so at some point you will hit your memory/complexity limits.