Recent comments in /f/deeplearning

vk6flab t1_jax2hd7 wrote

I'm guessing that someone got hold of the thing that you needed to submit a Google Form for and decided that they could instead distribute it via a torrent. They updated the source code as a community service to show the torrent to the next poor sod who went looking through the source to get rid of the Google Form.

But, I'm not the developer, I'm not sure what actually happened, but it seems plausible.

3

Jaffa6 t1_javl6ef wrote

No problem.

I believe that if you're using a BERT-esque model, you do indeed need to do "full" tokenisation (part of which is creating the attention mask and padding) because BERT expects its input to be a list of token indices. E.g. Given the token mapping {"a": 1, "cow": 2, "cat": 3, "dog": 4}, tokenisation would turn "a cat" into [1, 3] which is in the form that BERT expects.

And since BERT comes with a token mapping (due to pre-training), if you're just putting in your own features (say, number of likes and number of retweets), they'll quite possibly just get interpreted as random tokens if their numbers match up with known token indices.

If your features are already the right kind (tokenised text, with the resultant indices matching the correct BERT token indices), I suppose you could do truncation/padding yourself and feed that input directly to BERT.

But it'll probably end up simpler and less error-prone to let BERT tokenise it for you (e.g. via HuggingFace's `AutoTokenizer.from_pretrained('bert-base')`)

2

fundamental_entropy t1_jasqy64 wrote

Flan models are trained in almost every open dataset available in Generic English tasks. Recent research suggests models trained to perform multiple tasks (in fact ratios of different tasks too affect see flan 2022 paper) are better than models trained only on a given task. Flan T5 beats T5 in almost every task and sometimes Flan T5 XXL matches gpt3 type of prompt generation.

3