Recent comments in /f/MachineLearning

radi-cho t1_jdz40zp wrote

Last week I released a CLI that can do this at scale: https://github.com/radi-cho/datasetGPT. Will use personal funds to generate somewhat big task oriented dataset later today with gpt-3.5 or gpt-4. Will open source it along a way for people to contribute their own datasets so we can collect bigger ones. Would be helpful both for analysis of how LLMs work and for fine tuning downstream models (Alpaca-like).

1

Professional-Gap-243 t1_jdz27jp wrote

The way I think about this is like I think about OS. Yes you can build your own OS from scratch, but more often than not you just use windows or Linux. And if you need something custom it is often sufficient to setup your own Linux distro.

To me LLMs are in a similar situation. It doesn't really make a sense to build your own LLM from scratch most of the time just like it wouldn't to build your own OS. This doesn't mean that there is no space for building new LLMs tho.

GPT is in this example like windows (closed, controlled by a corporation) and I think the ML community now needs to focus on building open source alternative that could stand toe to toe with it.

Otherwise the space becomes monopolistic/oligopolistic with large corps running the show (just like before Linux came around).

15

Nezarah t1_jdz1zqc wrote

For specifically personal use and research? And not commercial? LlaMA is a good place to start, and/or Alpaca 7B. Small scale (can run on most hardware locally), can be Lora trained and fine-tuned. Also has High token limits (I think it’s 2000 or so?).

Can have outputs comparable to GPT3 which can be further enhanced with Pre-Context training.

Can add branching functionality through the Langchain library.

6

sdmat t1_jdyzb37 wrote

Not a movie, but it's definitely SF:

> "Far Centaurus" (1944) by A. E. van Vogt: This classic science fiction story tells the tale of a group of colonists who embark on a centuries-long voyage to the distant star system Centaurus. Upon arrival, they discover that Earth has developed faster-than-light travel during their journey, and a thriving human civilization already exists in Centaurus. > > "The Songs of Distant Earth" (1986) by Arthur C. Clarke: The novel features the crew of a slower-than-light colony ship, Magellan, who arrive at their destination planet Thalassa, only to discover that faster-than-light ships have already colonized other planets in the meantime. The story explores the consequences of different levels of technology and adaptation for the human settlers. > > "Tau Zero" (1970) by Poul Anderson: In this novel, a group of colonists aboard the starship Leonora Christine set out to explore a distant star system. During their journey, they encounter a series of technical malfunctions that cause their ship to accelerate uncontrollably. As a result, they experience time dilation, and the rest of the universe rapidly advances around them. They must navigate their own obsolescence and search for a new home as other expeditions overtake them.

Being able to find anything with a few vague words about content is one of my favourite GPT4 capabilities!

28

Username2upTo20chars t1_jdyypnk wrote

Okay, I get the response:

>Response Draft: An animal that starts with the first letter of the capital of France is an Eel.Draft Reflection: The answer draft is correct. Paris is the capital of France and the first letter is "P." However, the answer could be more concise by simply stating "Eel" as the answer.

Actual Response: The answer is Eel. The capital of France is Paris and the first letter is "P". Therefore, an animal that starts with the first letter of the capital of France is an Eel.

Given the following instructions:

>For this interaction, we are going to use the following structure.
>
>User (me): [I will ask a topic or question]
>
>You will provide an Response Draft: [Brief or simplified answer to the topic or question]
>
>Then you will undergo Draft Reflection: [You will provide a critique or review of the answer draft, highlighting the limitations, inaccuracies or areas that need improvement, correction or expansion, while providing guidance on how to address these issues in the revised response. Important, think step by step, while reviewing or criticizing the hypothetical response.]
>
>Then you will provide an Actual Response: [The natural and contextually appropriate answer to the topic or question, as generated by the advanced language model, which incorporates the suggestions and improvements from the draft reflection for a more comprehensive and accurate response. This also can include step-by-step reasoning.]You will in general act like the worlds best experts of the respective domain or field of the question.Do you understand?

-_-

Okay, this Update to the instructions gives a correct response:

Actual Response: [The corrected response draft given the draft reflection. This also can include step-by-step reasoning.]You will in general act like the worlds best experts of the respective domain or field of the question.Do you understand?"

2

Spziokles t1_jdyyies wrote

Came to say this. Compare yourself with someone who enters the field in two years, or two months. Heck, we all witness what difference even two weeks currently make.

Will they find a job? Will they have a hard time? If your worries are true, then it should be even harder for them. Which means, you have an advantage having this head start.

I guess we can also safely expect the demand for all skill levels around ML to increase, the more it impacts our societies and economies. Yes, we might need less people for a single task, but the amount of tasks will grow more. I do not worry for either new and old folks.

7

abnormal_human t1_jdyxteq wrote

Model weights are not currently considered to be copyrightable, and there is no DMCA/RIAA/MPAA machinery providing additional consequences for "pirating" them. At least for the moment, it's not a big risk to use LLaMA/Alpaca models for commercial use so long as you have not made an agreement with Facebook not to do it.

The OpenAI policy is about competing models, and comes from the TOS of using their API. Stanford agreed to that TOS, then released the text (which is again, not copyrightable). Random people downloading that data set aren't party to that agreement or bound by it.

I'm sure that Google, Facebook, Amazon, Netflix, etc will be cautious here, but for a random smaller org, this is a risk/benefit tradeoff, not an absolute.

A person who takes a torrented LLaMA and finetunes it using the Stanford data set didn't necessarily engage in any contracts prohibiting that.

The original leaker of LLaMA weights broke the rules. That's about it. Tsk tsk.

2