Recent comments in /f/MachineLearning
cthorrez t1_jcmhg8m wrote
Reply to comment by mike94025 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Because the algorithm seems to only work on inference. Probably due to memory management of the cached activations or something. (Idk the actual technical reasons)
mike94025 t1_jcmepd6 wrote
Reply to comment by cthorrez in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
With Better Transformer, ¿Por que no los dos?
HateRedditCantQuitit t1_jcmdot7 wrote
Reply to comment by Spiritual-Reply5896 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
I think of context as a end-to-end connected version of retrieval. You can backprop from loss to retrieved info, but you also want to backprop from loss to the non-retrieved info, which would basically be equivalent to having it all in context (in a handwavy way). Which is to say that just having more context is a simple solution.
I think everyone knows increasing context length is not 100% sufficient, but it sure is a simple convenient solution.
[deleted] t1_jcmb52b wrote
bo_peng OP t1_jcmajpx wrote
Reply to comment by mikljohansson in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
- RWKV-LM is now mainly for training, while ChatRWKV is for optimal inference.
- Someone in RWKV Discord tried it using LoRA (https://github.com/Blealtan/RWKV-LM-LoRA) and the result is quite nice. Join RWKV Discord for latest updates :)
Available_Lion_652 t1_jcm8ub5 wrote
Some heroes don't wear a cape
Necessary-Meringue-1 t1_jcm6j79 wrote
Reply to comment by Alimbiquated in Modern language models refute Chomsky’s approach to language [R] by No_Draft4778
>There is a general tendency to assume that if something seems intelligent, it must be like a human brain. It's like assuming that because it's fast, a car must have legs like a horse and eat oats.
Ironic, because that is literally what that article is doing.
Necessary-Meringue-1 t1_jcm5x7g wrote
Reply to comment by currentscurrents in Modern language models refute Chomsky’s approach to language [R] by No_Draft4778
just because it's "natural" does not mean it's unstructured or does not have any logic, can you be any more disingenuous than to rely some etymology-based semantics?
Like programmers invented structure
Necessary-Meringue-1 t1_jcm5mye wrote
Reply to comment by harharveryfunny in Modern language models refute Chomsky’s approach to language [R] by No_Draft4778
>These models are learning vastly more than language alone
A child growing up does too.
>These models are learning in an extraordinarily difficult way with *only* "predict next word" feedback and nothing else
Literally the point, that LLMs do not learn language like humans at all. Unless you're trying to say that you and I are pure Skinner-type behavioralist learners.
Competitive-Rub-1958 t1_jcm5ahk wrote
Reply to comment by mike94025 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
cool! So I just need to enable `flash_sdp`, then ensure I'm basically computing self-attention and have `batch_first=True`. Would that be correct?
Necessary-Meringue-1 t1_jcm4o9d wrote
Reply to comment by harharveryfunny in Modern language models refute Chomsky’s approach to language [R] by No_Draft4778
>the Transformer is proof by demonstration that you don't need a language-specific architecture to learn language, and also that you can learn language via prediction feedback, which it highly likely how our brain does it too.
where to even start, how about this:
The fact that a transformer can appear to learn language on a non-specific architecture does not at all mean that humans work the same way.
​
Did you ingest billions of tokens of English growing up? How did you manage to have decent proficiency at the age of 6? Did you read the entire common crawl corpus by age 10?
​
This kind of argument is on paper stilts. LLMs are extremely impressive, but that does not mean they tell you much about how humans do language.
VarietyElderberry t1_jcm4ghk wrote
Reply to comment by Screye in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Could you explain what you mean with a logical-hop and how it is dependent on a certain number of tokens? If you are referring to a paper, a link would be appreciated.
Necessary-Meringue-1 t1_jcm4bbu wrote
Reply to comment by sam__izdat in Modern language models refute Chomsky’s approach to language [R] by No_Draft4778
thanks for this linguistically and ML informed take-down!
yehiaserag t1_jcm31zk wrote
Reply to comment by [deleted] in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
We say RWKV for short, the rest of the stuff is for a specific version
acertainmoment t1_jcm22en wrote
Reply to [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
tried a small modification to one of the examples on huggingface :)
jkterry1 OP t1_jcm0qqj wrote
Reply to comment by LappenX in [N] Jumpy 1.0 has now been released by the Farama Foundation by jkterry1
What do you mean? This allows you do use it if you want
royalemate357 t1_jclz4t0 wrote
Reply to comment by MoistYogurtcloset400 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
hmm, I am not too sure but their blogpost says this:
>TorchInductor uses a pythonic define-by-run loop level IR to automatically map PyTorch models into generated Triton code on GPUs and C++/OpenMP on CPUs.
so it seems like they support CPU. I also tried it briefly on google colab CPU-only, and it seems to work (i didn't benchmark speed though). I doubt it supports non cuda GPUs but then again support for those even in the general case isnt very good.
ironmagnesiumzinc t1_jclz4ml wrote
Really interesting idea
mike94025 t1_jcly3mi wrote
Reply to comment by Competitive-Rub-1958 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Documentation was not updated. Yes, you can use flash attention for training.
The first version included only forward() as we were resolving some issues with backward(). Docstring will be updated.
[deleted] t1_jclws3b wrote
cthorrez t1_jclwi3d wrote
Reply to comment by Competitive-Rub-1958 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Fast inference is also important to anyone who wants to deploy these things.
gonomon t1_jclwgtg wrote
Reply to [D] Simple Questions Thread by AutoModerator
Subject: Generating Synthetic Data for Human Action Recognition
Hello,
In my master's thesis, I generated a realistic dataset that
can be used for human action recognition (using the Unity engine). The dataset
contains 2D - 3D pose information and RGB videos. I wanted to test the effects
of this dataset on real-world action detection (directly on videosYouTube) when
the classifier is trained with synthetic data in addition to real-data (NTU
120).
I want to use skeleton-based action recognition methodology
(since it outperforms RGB-only methodologies for NTU 120) and to achieve this I
applied a pose estimator to videos from YouTube, our synthetic dataset, and
NTU120 and trained them since I believe instead of using directly sterile
ground truth information of our dataset, I can apply pose estimator and use
those pose informations directly instead of worrying with domain adaptation
strategies.
Question is: Should I have directly used ground truth pose
information of our synthetic data in trainings with real-data, or the thing I
did does make sense? If there is any usage of pose estimators as domain
adaptation methods, I would be extremely happy if you can share the papers when
commenting.
Best,
juliensalinas OP t1_jclvlim wrote
Reply to comment by Franck_Dernoncourt in [D] An Instruct Version Of GPT-J Using Stanford Alpaca's Dataset by juliensalinas
Clearly it is below Alpaca (based on what I can see from their web demo) and GPT Davinci.
But still this is a very interesting improvement compared to the base GPT-J.
MoistYogurtcloset400 t1_jclv6r1 wrote
Reply to comment by royalemate357 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Is this torch.compile only compatible with cuda device only?
myself991 t1_jcmhn3k wrote
Reply to [D] Simple Questions Thread by AutoModerator
Hi everybody,
I forgot to submit my file for a conference, but cmt3 submission section was open about 45 minutes passed the deadline. Therefore, I could upload it there.
I was wondering if anybody had any experience with submitting supplementary material to cmt3 for a conference an hour after the deadline? Are they going to remove the paper, although they kept the uploading section open?
Also, do conferences normally set deadline in cmt3 a little more than after deadline?
Thanks,