Recent comments in /f/MachineLearning
hfnuser0000 t1_jcl5akd wrote
Reply to comment by Spiritual-Reply5896 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
How many ways are there to implement memory retrieval? Can someone explain the intuition behind it? Thank you in advance!
Screye t1_jcl549n wrote
Reply to comment by Spiritual-Reply5896 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Context length is also a hard limit on how many logical-hops the model can make.
If each back-n-forth takes 500-ish tokens, then the model can only reason over 16 hops over 8k tokens. With 32k tokens, it can reason over 64 hops. This might allow for emergent behaviors towards tasks that have previously been deemed impossible due to needing at least a minimum number of hops to reason about.
For what it's worth, I think memory retrieval will work just fine for 90% of scenarios and will stay relevant even for 32k tokens. Esp. if the wiki you are retrieving from is millions of lines.
[deleted] t1_jcl50r7 wrote
Reply to comment by super_deap in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
[deleted]
super_deap OP t1_jcl3whl wrote
Reply to comment by lucidraisin in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
That is understandable. I am working with that assumption as well. (I have failed too many of such experiments to have a blind faith 🙈)
RemindMeBot t1_jcl3jkl wrote
Reply to comment by petitponeyrose in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
I will be messaging you in 5 days on 2023-03-22 16:46:07 UTC to remind you of this link
5 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
| ^(Info) | ^(Custom) | ^(Your Reminders) | ^(Feedback) |
|---|
petitponeyrose t1_jcl3hdz wrote
!RemindMe 5 days
lucidraisin t1_jcl2rkh wrote
Reply to comment by super_deap in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
yea, literature is scant and all over the place in the efficient attention field. in this paper, i believe they claim it is query-key dimension (d_dot), but i think it should depend on the number of heads too. i don't know of any other papers that explore this topic. i just don't want people to be surprised if they fine tune to greater context lengths and things don't work as well as gpt4
super_deap OP t1_jcl1omd wrote
Reply to comment by lucidraisin in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Thanks for that paper; I came across it a while ago but have not read it yet. Is the limit due to number of model parameters or size of embedding. I suspect size of embedding to be the biggest factor in limiting how big the context can be.
lucidraisin t1_jcl0y16 wrote
it is important for everyone to know that there may be a capacity limit to the context length, as explored by this paper. gpt4 may not have this limit, but smaller variants like llama may. it also depends on the task you are trying to solve. you cannot just get 'infinite context', as some would sell you that their network can do. more research needed... hopefully pytorch 2.0 leads to that
[deleted] t1_jcl0iss wrote
[deleted]
Taenk t1_jckzuxm wrote
Reply to comment by londons_explorer in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
Sorry, I am not an expert, just an enthusiast, so this is a stupid question: Where can I see a list of these few hundred tests and is there some page where I can see comparisons between different models?
[deleted] t1_jcky91i wrote
Reply to comment by -Rizhiy- in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
[deleted]
juliensalinas OP t1_jcky8ok wrote
Reply to comment by Necessary_Ad_9800 in [D] An Instruct Version Of GPT-J Using Stanford Alpaca's Dataset by juliensalinas
You're welcome.
A token is a unique entity that can either be a small word, part of a word, or punctuation.
On average, 1 token is made up of 4 characters, and 100 tokens are roughly equivalent to 75 words.
Natural Language Processing models need to turn your text into tokens in order to process it.
Necessary_Ad_9800 t1_jckxh1h wrote
Reply to comment by juliensalinas in [D] An Instruct Version Of GPT-J Using Stanford Alpaca's Dataset by juliensalinas
Thanks for the answer. Is 1 letter equal to 1 token?
juliensalinas OP t1_jckwtdj wrote
Reply to comment by Necessary_Ad_9800 in [D] An Instruct Version Of GPT-J Using Stanford Alpaca's Dataset by juliensalinas
No if you want such a model to "remember" previous prompts you will need to add them at the top of each requests you are making.
The output can be up to 2048 tokens. But on a Tesla T4 you might not have enough VRAM so maybe you will be limited to 1024 tokens because the GPU will run out of memory above that.
Necessary_Ad_9800 t1_jckvyg7 wrote
Does it remember previous prompts? And how long outputs can it make?
super_deap OP t1_jcktps3 wrote
Reply to comment by mrpogiface in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
This
juliensalinas OP t1_jcktk8o wrote
Reply to comment by pitrucha in [D] An Instruct Version Of GPT-J Using Stanford Alpaca's Dataset by juliensalinas
This is maybe something I'll focus on in the future. But for the moment I find this fp16 version well suited for small budgets as it runs on a 16GB GPU while the native fp32 version of GPT-J requires at least 24GB of VRAM.
Also, with the bitsandbytes integration in HF Transformers you can use the model in 8 bits: https://huggingface.co/blog/hf-bitsandbytes-integration
Mindless-Ad8595 t1_jckrmcz wrote
Please keep posting your progress, this is very interesting!
Available_Lion_652 t1_jckrfwd wrote
Reply to comment by NotARedditUser3 in [D] GPT-4 is really dumb by [deleted]
I don t understand why you insulted me. I really tried to wrote a post about a case where GPT 4 hallucinate s, with all good intentions, but I guess you have to be a smartass
royalemate357 t1_jckqgsr wrote
Reply to comment by RobbinDeBank in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Pretty sure the main improvement is "torch.compile" which can optimize your code in a nice easy one liner. There's some other nice quality of life improvements like the built in flash attention OP is using, and I think some distributed training stuff. But it's fully backwards compatible, which is great (looking at you tensorflow) https://pytorch.org/get-started/pytorch-2.0/#pytorch-2x-faster-more-pythonic-and-as-dynamic-as-ever
Spiritual-Reply5896 t1_jckq519 wrote
Reply to comment by super_deap in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Lets say Linux kernel manual is embedded as memories. If we can get accurate semantic representation of the question, then we should be able to find relevant context from the memory, and use enough context to answer the question in fewer tokens compared to providing the whole Linux manual as context. If we assume that computing the attention is as fast as vector search, then its a no-brainer that retrieving only relevant context from memory is better approach than using the whole manual. Its of course a trade off between accuracy and speed/scalability, but I argue its a good tradeoff as text isn't often that information dense.
The ability to produce semantically coherent embeddings from text is the grain and salt of LLM, so why would it be any bigger problem to retrieve these memories from external / infinite database than from context window?
Im just hypothesizing with my limited knowledge, please correct me if I make stupid assumptions :)
f-d-t777 t1_jckpovo wrote
Reply to [D] Simple Questions Thread by AutoModerator
Subject: Spacecraft image analysis using computer vision
​
Hi guys,
Im looking to develop a system that uses computer vision algorithms to analyze images captured by spacecraft cameras and identify potential safety hazards or security threats. For example, the system could detect debris or other objects in orbit that could pose a risk to spacecraft.
I am looking to do this using all AWS tools. I am pretty new to this and am developing a technology architecture project around this topic to present for a program I'm doing.
How would I go about approaching/doing this? I am looking to find/create my own mock datasets as well as present the alogrithm/code I used to train my model. More specifically, I am focusing on these aspects for my project:
Preprocess the images: Preprocess the images to improve their quality and prepare them for analysis. This could include cropping, resizing, and adjusting the brightness and contrast of the images.
Train the computer vision algorithms: Train the computer vision algorithms using the dataset of images. There are various computer vision techniques that could be used, such as object detection, segmentation, or classification. The specific technique will depend on the requirements of the system.
​
In addition, it would be cool to have some sort of hardware/interactive portion that actually utilizes a camera to detect things in space. That can be implemented into the system. Once the computer vision algorithms have been trained and evaluated, implement the system. This could involve integrating the algorithms into a larger software system that can process images captured by spacecraft cameras in real-time.
Thank you
super_deap OP t1_jckpoey wrote
Reply to comment by harharveryfunny in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
I think one just needs to duplicate positional embeddings and we are good to go. Of course, there needs to be more comprehensive empirical analysis on this and I have not come across any of such attempts. I did a basic experiment and it seems to work but will have to wait and see.
LetterRip t1_jcl6axl wrote
Reply to comment by bo_peng in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
Rocky