dlrace t1_jdh8ra8 wrote on March 24, 2023 at 11:45 AM

#2,333,312

The new plugins can be/ are created by just documenting the api and feeding it to gpt4 aren't they? no actual coding . So it seems at least plausible that the other approach would be as you say, let it interpret the ui visually.

BinarySplit t1_jdh9zu6 wrote on March 24, 2023 at 11:57 AM

#2,333,436

GPT-4 is potentially missing a vital feature to take this one step further: Visual Grounding - the ability to say where inside an image a specific element is, e.g. if the model wants to click a button, what X,Y position on the screen does that translate to?

Other MLLMs have it though, e.g. One-For-All. I guess it's only a matter of time before we can get MLLMs to provide a layer of automation over desktop applications...

[deleted] t1_jdhfke2 wrote on March 24, 2023 at 12:46 PM

#2,334,100

[removed]

reditum t1_jdhfxfa wrote on March 24, 2023 at 12:49 PM

#2,334,145

Check ACT-1 and WebGPT

banmeyoucoward t1_jdhg7kt wrote on March 24, 2023 at 12:52 PM

#2,334,181

I'd bet that screen recordings + mouse clicks + keyboard inputs made their way into the training data too.

[deleted] t1_jdhgar2 wrote on March 24, 2023 at 12:52 PM

#2,334,189

[deleted]

Deep-Station-1746 t1_jdhhbbg wrote on March 24, 2023 at 1:00 PM

#2,334,330

Nope. Ability to input something doesn't mean being able to use it reliably. For example, take this post - your eyes have an ability to input all the info on the screen, but as a contribution, this post is pretty worthless. And, you are a lot smarter than GPT-4, I think.

Edit: spelling

regular-jackoff t1_jdhikdt wrote on March 24, 2023 at 1:10 PM

#2,334,490

Replying to Deep-Station-1746 (#2,334,330)

Damn that’s rough

ObiWanCanShowMe t1_jdhil2v wrote on March 24, 2023 at 1:11 PM

#2,334,496

Replying to Deep-Station-1746 (#2,334,330)

We are smarter locally, meaning to our experience and our capability, we are not "smarter" in the grand scheme.

SkinnyJoshPeck t1_jdhis65 wrote on March 24, 2023 at 1:12 PM

#2,334,522

Replying to BinarySplit (#2,333,436)

i imagine you could interpolate, given access to more info about the image post-GPT analysis. i.e. i’d like to think it has some boundary defined for the objects it identifies in the image as part of metadata or something in the API.

[deleted] t1_jdhj5k5 wrote on March 24, 2023 at 1:15 PM

#2,334,573

Replying to BinarySplit (#2,333,436)

[deleted]

[deleted] t1_jdhje54 wrote on March 24, 2023 at 1:17 PM

#2,334,609

Replying to banmeyoucoward (#2,334,181)

[removed]

Balance- OP t1_jdhjm0f wrote on March 24, 2023 at 1:19 PM

#2,334,647

Replying to Deep-Station-1746 (#2,334,330)

It doesn’t have to use it yet for actions on its own, but it could be very useful context when prompting questions.

eliminating_coasts t1_jdhkkw3 wrote on March 24, 2023 at 1:26 PM

#2,334,769

Replying to BinarySplit (#2,333,436)

You could in principle send them four images, that align at a corner where the cursor is, if it can work out how images fit together.

harharveryfunny t1_jdhkn99 wrote on March 24, 2023 at 1:26 PM

#2,334,775

> GPT-4 with image input can interpret any computer screen

Not necessarily - it depends how they've implemented it. If it's just dense object and text detection, then that's all you're going to get.

For the model to be able to actually "see" the image they would need to feed it into the model at the level of neural net representation, not post-detection object description.

For example, if you wanted the model to guage whether two photos of someone not in it's training set are the same person, then it'd need face embeddings to do that (to gauge distance). They could special case all sorts of cases like this in addition to object detection, but you could always find something they missed.

The back-of-a-napkin hand-drawn website sketch demo is promising, but could have been done via object detection.

In the announcement of GPT-4, OpenAI said they're working with another company on the image/vision tech, and gave a link to an assistive vision company... for that type of use maybe dense labelling is enough.

acutelychronicpanic t1_jdhksvy wrote on March 24, 2023 at 1:28 PM

#2,334,797

Replying to BinarySplit (#2,333,436)

Let it move a "mouse" and loop the next screen at some time interval. Probably not the best way to do it, but that seems to be how humans do it.

nmkd t1_jdhmgpm wrote on March 24, 2023 at 1:40 PM

#2,334,987

Replying to banmeyoucoward (#2,334,181)

Nope, it's multimodal in terms of understanding language and images. It wasn't trained on mouse movement because that's neither language nor imagery.

dankaiv t1_jdhn2at wrote on March 24, 2023 at 1:44 PM

#2,335,052

... and computer interfaces (i.e. GUIs) have extremely low noise to signal ratio compared to image data from the real world. I believe soon AI will be better at using computers than most humans.

3_Thumbs_Up t1_jdhp6zj wrote on March 24, 2023 at 1:59 PM

#2,335,284

Replying to Deep-Station-1746 (#2,334,330)

Unnecessarily insulting people on the internet make you seem really smart. OP, unlike you, at least contributed something of value.

ThatInternetGuy t1_jdhpq8y wrote on March 24, 2023 at 2:03 PM

#2,335,335

Replying to BinarySplit (#2,333,436)

It's getting there.

reditum t1_jdhqy0b wrote on March 24, 2023 at 2:12 PM

#2,335,480

Replying to Deep-Station-1746 (#2,334,330)

/r/iamverysmart

loopuleasa t1_jdhrwkv wrote on March 24, 2023 at 2:18 PM

#2,335,582

GPT4 is not publicly multimodal though

Single_Blueberry t1_jdhtc58 wrote on March 24, 2023 at 2:28 PM

#2,335,746

Replying to BinarySplit (#2,333,436)

What would keep us from just telling it the screen resolution and origin and asking for coordinates?

Or asking for coordinates in fractional image dimensions.

farmingvillein t1_jdhua51 wrote on March 24, 2023 at 2:34 PM

#2,335,864

Replying to loopuleasa (#2,335,582)

Hmm, what do you mean by "publicly"? OpenAI has publicly stated that GPT-4 is multi-modal, and that they simply haven't exposed the image API yet.

The image API isn't publicly available yet, but it is clearly coming.

loopuleasa t1_jdhuit0 wrote on March 24, 2023 at 2:36 PM

#2,335,890

Replying to farmingvillein (#2,335,864)

talking about consumer access to the image API

is tricky, as the system is swamped already with text

they mentioned an image takes 30 seconds to "comprehend" by the model...

ThirdMover t1_jdhvx8i wrote on March 24, 2023 at 2:45 PM

#2,336,051

Replying to BinarySplit (#2,333,436)

>GPT-4 is potentially missing a vital feature to take this one step further: Visual Grounding - the ability to say where inside an image a specific element is, e.g. if the model wants to click a button, what X,Y position on the screen does that translate to?

You could just ask it to move a cursor around until it's on the specified element. I'd be shocked if GPT-4 couldn't do that.

ingeniare t1_jdhxcds wrote on March 24, 2023 at 2:54 PM

#2,336,234

Replying to BinarySplit (#2,333,436)

I would think image segmentation for UI to identify clickable elements and the like is a very solvable task

entitledypipo t1_jdi1p2x wrote on March 24, 2023 at 3:22 PM

#2,336,751

Good by human TSA

mycall t1_jdi3cko wrote on March 24, 2023 at 3:33 PM

#2,336,900

Can it detect object in the photo? Maybe drive an RC car with it? :)

Disastrous_Elk_6375 t1_jdi6779 wrote on March 24, 2023 at 3:51 PM

#2,337,205

Replying to dankaiv (#2,335,052)

nanoSingularity goes brrrrr

TikiTDO t1_jdi8ims wrote on March 24, 2023 at 4:06 PM

#2,337,473

Replying to harharveryfunny (#2,334,775)

The embeddings are still just a representation of information. They are extremely dense, effectively continuous representations, true, but in theory you could represent that information using other formats. It would just take far more space and require more processing.

Obviously having the visual system provide data that the model can use directly is going to be far more effective, but nothing about dense object detection and description is going to be fundamentally incompatible with any level of detail you could extract into an embedding vectror. I'm not saying it would be a smart or effective solution, but it could be done.

In fact, going to another level, LLMs aren't restricted to working with just words. You could train an LLM to receive a serialized embedding as text input, and then train it to interpret those. After all, it's effectively just a list of numbers. I'm not sure why you'd do that if you could just feed it in directly, but maybe it's more convenient to not have to train in on different types of inputs or something.

ginger_beer_m t1_jdi9e5j wrote on March 24, 2023 at 4:11 PM

#2,337,560

Carry this to the conclusion. Maybe not GPT4, but future LLM could interpret what's on the screen and drive the interaction with the computer themselves. This would potentially displace millions of human out of job as they get automated by the model.

CommunismDoesntWork t1_jdia6kb wrote on March 24, 2023 at 4:16 PM

#2,337,646

Replying to BinarySplit (#2,333,436)

It can do this just fine

harharveryfunny t1_jdic1s3 wrote on March 24, 2023 at 4:28 PM

#2,337,846

Replying to TikiTDO (#2,337,473)

>Obviously having the visual system provide data that the model can use directly is going to be far more effective, but nothing about dense object detection and description is going to be fundamentally incompatible with any level of detail you could extract into an embedding vectror. I'm not saying it would be a smart or effective solution, but it could be done.

I can't see how that could work for something like my face example. You could individually detect facial features, subclassified into hundreds of different eye/mouth/hair/etc/etc variants, and still fail to capture the subtle differences that differentiate one individual from another.

BullockHouse t1_jdidje0 wrote on March 24, 2023 at 4:37 PM

#2,338,011

I'm curious if it can be instructed to play minecraft in a keyboard only mode simply by connecting a sequence of images to key stroke outputs.

RustaceanNation t1_jdiekiv wrote on March 24, 2023 at 4:44 PM

#2,338,122

Replying to BinarySplit (#2,333,436)

Google's Spotlight paper is intended for this use case.

nixed9 t1_jdifhni wrote on March 24, 2023 at 4:49 PM

#2,338,219

Replying to ginger_beer_m (#2,337,560)

This is quite literally what we hope for/deeply fear at /r/singularity. It's going to be able to interact with computer systems itself. Give it read/write memory access and access to it's own API, or the ability to just simply visually process the screen output... and then.... what?

Several years ago, as recently as 2017 or so, this seemed extremely far-fetched and the "estimation" of a technological singularity of 2045 seemed wildly optimistic.

Right now it seems like it's more like than not to happen by 2030.

MjrK t1_jdiflsw wrote on March 24, 2023 at 4:50 PM

#2,338,239

Replying to ThirdMover (#2,336,051)

I'm confident that someone can fine-tune an end-to-end vision-tranformer that can extract user interface elements from photos and enumerate interaction options.

Seems like such an obviously-useful tool and Vit-22B should be able to handle it, or many other Computer Vision tools on Hugging Face... I would've assumed some grad student somewhere is already hacking away at that.

But then also, compute costs are a b**** but generating training data set should be somewhat easy.

Free research paper idea, I guess.

Art10001 t1_jdihrod wrote on March 24, 2023 at 5:04 PM

#2,338,461

Replying to BullockHouse (#2,338,011)

Probably. And if not, certainly someday.

morebikesthanbrains t1_jdii4y7 wrote on March 24, 2023 at 5:06 PM

#2,338,501

Replying to BinarySplit (#2,333,436)

But what about the black box. Just feed it enough data, train it, and it should figure out what to do?

TikiTDO t1_jdiirji wrote on March 24, 2023 at 5:10 PM

#2,338,558

Replying to harharveryfunny (#2,337,846)

For a computer words are just bits of information. If you wanted a system that used text to communicate this info, it would just assign some values to particular words, and you'd probably end up with ultra long strings of descriptions relating things to each other using god knows what terminology. It probably wouldn't really make sense to you if you were reading it because it would just be a text-encoded representation of an embedding vector describing finer relations that would only make sense to AIs.

byteuser t1_jdiirr7 wrote on March 24, 2023 at 5:10 PM

#2,338,561

Replying to reditum (#2,334,145)

Are they still doing development in ACT-1? Last update seems September last year

yashdes t1_jdij1tl wrote on March 24, 2023 at 5:12 PM

#2,338,597

Replying to loopuleasa (#2,335,890)

these models are very sparse, meaning very few of the actual calculations actually effect the output. My guess is trimming the model is how they got gpt3.5-turbo and I wouldn't be surprised if gpt4-turbo is coming.

reditum t1_jdij2tu wrote on March 24, 2023 at 5:12 PM

#2,338,602

Replying to byteuser (#2,338,561)

I honestly don’t know. I also think their approach wasn’t great either. Maybe (hopefully) they ditched it for something better.

wyrdwulf t1_jdikhuo wrote on March 24, 2023 at 5:21 PM

#2,338,768

Replying to BullockHouse (#2,338,011)

They had another model do that already.

OpenAI: We trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video dataset of human Minecraft play

wind_dude t1_jdikwc4 wrote on March 24, 2023 at 5:24 PM

#2,338,821

I'm also curious about this, I reached out for developer access to try and test this on web screenshots for information extraction.

BullockHouse t1_jdil2ok wrote on March 24, 2023 at 5:25 PM

#2,338,844

Replying to wyrdwulf (#2,338,768)

I'm familiar! I'm curious though if it can generalize well enough to play semi-competently without specialized training. Has implications for multi-modal models and robotics.

stephbu t1_jdilpb1 wrote on March 24, 2023 at 5:29 PM

#2,338,914

Replying to nixed9 (#2,338,219)

Amara's Law...

[deleted] t1_jdimp7a wrote on March 24, 2023 at 5:35 PM

#2,339,046

Replying to dankaiv (#2,335,052)

[deleted]

[deleted] t1_jdimvgm wrote on March 24, 2023 at 5:36 PM

#2,339,070

Replying to ginger_beer_m (#2,337,560)

[deleted]

[deleted] t1_jdin627 wrote on March 24, 2023 at 5:38 PM

#2,339,096

Replying to Deep-Station-1746 (#2,334,330)

[deleted]

[deleted] t1_jdinibf wrote on March 24, 2023 at 5:40 PM

#2,339,134

Replying to wind_dude (#2,338,821)

[deleted]

wind_dude t1_jdip38g wrote on March 24, 2023 at 5:50 PM

#2,339,309

Replying to [deleted] (#2,339,134)

access to GPT-4 with multimodel

[deleted] t1_jditxnr wrote on March 24, 2023 at 6:21 PM

#2,339,847

[removed]

LanchestersLaw t1_jdiw7op wrote on March 24, 2023 at 6:36 PM

#2,340,088

Replying to mycall (#2,336,900)

The example data does demonstrate object detection

WokeAssBaller t1_jdixm43 wrote on March 24, 2023 at 6:45 PM

#2,340,258

Replying to reditum (#2,334,145)

So WebGPT doesn’t quite do this, it uses a JavaScript library to simplify web pages to basic text

CollectionLeather292 t1_jdj0jsl wrote on March 24, 2023 at 7:04 PM

#2,340,570

How do I try it out? I can't find a way to ad an image input to the chat...

itsnotlupus t1_jdj2xpr wrote on March 24, 2023 at 7:19 PM

#2,340,814

Meh. We see a few demos and all of the demos work all of the time, but that could easily be an optical illusion.

Yes, GPT-4 is probably hooked to subsystems that can parse an image, be it some revision of CLIP or whatever else, and yes it's going to work well enough some of the time, maybe even most of the time.

But maybe wait until actual non-corpo people have their hands on it and can assess how well it actually works, how often it fails, and whether anyone can actually trust it to do those things consistently.

harharveryfunny t1_jdj5mom wrote on March 24, 2023 at 7:37 PM

#2,341,067

Replying to TikiTDO (#2,338,558)

>it would just be a text-encoded representation of an embedding vector

One you've decided to input image embeddings into the model, you may as well enter them directly, not converted into text.

In any case, embeddings, whether represented as text or not, are not the same as object recognition labels.

TikiTDO t1_jdj6dum wrote on March 24, 2023 at 7:42 PM

#2,341,153

Replying to harharveryfunny (#2,341,067)

I'm not saying it's a good solution, I'm just saying if you want to hack it together for whatever reason, I see no reason why it couldn't work. It's sort of like the idea of building a computer using the game of life. It's probably not something you'd want to run your code on... But you could.

MysteryInc152 t1_jdj8x5e wrote on March 24, 2023 at 7:59 PM

#2,341,417

Replying to loopuleasa (#2,335,890)

>they mentioned an image takes 30 seconds to "comprehend" by the model...

wait really ? Cn you link source or something. There's no reason a native implementation should take that long.

Now i'm wondering if they're just doing something like this -https://github.com/microsoft/MM-REACT

harharveryfunny t1_jdj9if0 wrote on March 24, 2023 at 8:02 PM

#2,341,464

Replying to TikiTDO (#2,341,153)

I'm not sure what your point is.

I started by pointing out that there are some use cases (giving face comparison as an example) where you need access to the neural representation of the image (e.g. embeddings), not just object recognition labels.

You seem to want to argue and say that text labels are all you need, but now you've come full circle back to agree with me and say that the model needs that neural representation (embeddings)!

As I said, embeddings are not the same as object labels. An embedding is a point in n-dimensional space. A label is an object name like "cat" or "nose". Encoding an embedding as text (simple enough - just a vector of numbers) doesn't turn it into an object label.

plocco-tocco t1_jdj9is4 wrote on March 24, 2023 at 8:02 PM

#2,341,467

Replying to ThirdMover (#2,336,051)

It woulde be quite expensive to do tho. You have to do inference very fast with multiple images of your screen, don't know if it is even feasible.

farmingvillein t1_jdj9w98 wrote on March 24, 2023 at 8:05 PM

#2,341,497

Replying to yashdes (#2,338,597)

> these models are very sparse

Hmm, do you have any sources for this assertion?

It isn't entirely unreasonable, but 1) GPU speed-ups for sparsity aren't that high (unless OpenAI is doing something crazy secret/special...possible?), so this isn't actually that big of an upswing (unless we're including MoE?) and 2) openai hasn't released architecture details (beyond the original gpt3 paper--which did not indicate that the model was "very" sparse).

reditum t1_jdja3tw wrote on March 24, 2023 at 8:06 PM

#2,341,515

Replying to WokeAssBaller (#2,340,258)

Oh well, that’s what I get for not reading the paper.

Jean-Porte t1_jdjagqg wrote on March 24, 2023 at 8:09 PM

#2,341,551

Replying to nmkd (#2,334,987)

> use 2 images
> movement
> boom

[deleted] t1_jdjans6 wrote on March 24, 2023 at 8:10 PM

#2,341,573

Replying to reditum (#2,338,602)

[removed]

Spziokles t1_jdjc4wn wrote on March 24, 2023 at 8:20 PM

#2,341,713

So when playing League of Legends, it could tell you which enemy champion disappeared from their lane, and in how many seconds you should retreat to stay safe?

Curious how this will impact E-Sports and wether it will be treated like doping in some form.

LizardWizard444 t1_jdjewdy wrote on March 24, 2023 at 8:38 PM

#2,342,008

That's concerning

ThirdMover t1_jdjf69i wrote on March 24, 2023 at 8:40 PM

#2,342,042

Replying to plocco-tocco (#2,341,467)

I am not sure. Exactly how does inference scale with the complexity of the input? The output would be very short, just enough tokens for the "move cursor to" command.

frequenttimetraveler t1_jdjhb4b wrote on March 24, 2023 at 8:54 PM

#2,342,298

Replying to itsnotlupus (#2,340,814)

Automatic tech support will be huge. Print screen, then 'computer, fix this problem'.

frequenttimetraveler t1_jdjhhz8 wrote on March 24, 2023 at 8:55 PM

#2,342,323

Replying to ginger_beer_m (#2,337,560)

It will also render chatGpt plugins obsolete. The chat will replace them by simply using the browser.

frequenttimetraveler t1_jdjho9s wrote on March 24, 2023 at 8:56 PM

#2,342,346

Replying to nixed9 (#2,338,219)

I believe the full gpt4 can already do that https://mobile.twitter.com/gdb/status/1638971232443076609?s=20

But wait until they hook up robot arms to it

TikiTDO t1_jdjibnv wrote on March 24, 2023 at 9:01 PM

#2,342,429

Replying to harharveryfunny (#2,341,464)

My point was that you could pass all the information contained in an embedding as a text prompts into a model, rather than using it directly as an input vector, and an LLM could probably figure out how to use it even if the way you chose to deliver those embeddings was doing a numpy.savetxt and then sending the resulting string is as a prompt. I also pointed out that you could if your really wanted to write a network to convert an embedding to some sort of semantically meaningful word soup that stores the same amount of information. It's basically a pointless bit of trivia which illustrates a fun idea.

I'm not particularly interested in arguing whatever you think I want to argue. I made a pedantic aside that technically you can represent the same information in different formats, including representing embedding as text, and that a transformer based architecture would be able to find patterns it it all the same. I don't see anything to argue here, it's just a "you could also do it this way, isn't that neat." It's sort of the nature of a public forum; you made a post that made me think something, so I hit reply and wrote down my thoughts, nothing more.

[deleted] t1_jdjk1iy wrote on March 24, 2023 at 9:12 PM

#2,342,627

Replying to MjrK (#2,338,239)

[removed]

[deleted] t1_jdjm918 wrote on March 24, 2023 at 9:28 PM

#2,342,879

Replying to nixed9 (#2,338,219)

[removed]

H0lzm1ch3l t1_jdjntxm wrote on March 24, 2023 at 9:39 PM

#2,343,065

Yes but why let the AI use a GUI when we can just give it an API …

simmol t1_jdjq815 wrote on March 24, 2023 at 9:55 PM

#2,343,320

I think for this to be truly effective, the LLM would need to take in huge amounts of computer screen images in its training set, and I am not sure if that was done for the pre-trained model for GPT-4. But once this is done for all possible computer screen image combinations that one can think of, then it would probably be akin to the self-driving car type of algorithm where you can navigate accordingly based on the images.

But this type of multi-modality would be useful if you have the person actually sitting in front of the computer working side-by-side with the AI, right? Because if you want to eliminate the human from the loop, then I am not sure if this is an efficient way of training the LLM since these type of computer screen images are what helps a human navigate the computer, and not necessarily optimal for the LLM.

simmol t1_jdjsvuh wrote on March 24, 2023 at 10:15 PM

#2,343,649

Replying to frequenttimetraveler (#2,342,298)

Wouldn't it be more like the tech support is constantly monitoring your computer screen so you don't even have to print screen?

Puzzleheaded_Acadia1 t1_jdjx3oq wrote on March 24, 2023 at 10:46 PM

#2,344,180

I have questions can I fine-tune the gpt-neo-x 125m parameters on chat dataset to give me a decent answer like human because when I run it give me random characters

plocco-tocco t1_jdjx7qz wrote on March 24, 2023 at 10:47 PM

#2,344,197

Replying to ThirdMover (#2,342,042)

The complexity of the input wouldn't change in this case since it's just a screen grab of the display. Just that you'd need to do inference at a certain frame rate to be able to detect the cursor, which isn't that cheap with GPT-4. Now, I'm not sure what the latency or cost would be, I'd need to get access to the API to answer it.

DisasterEquivalent t1_jdk10wf wrote on March 24, 2023 at 11:15 PM

#2,344,649

Replying to BinarySplit (#2,333,436)

I mean, most apps have accessibility tags for all objects you can interact with (it is standard in UIKit) - The accessibility tags have hooks in them you can use for automation. so you should be able just have it find the correct element there without much searching.

MyPetGoat t1_jdk8b7t wrote on March 25, 2023 at 12:09 AM

#2,345,469

Replying to Puzzleheaded_Acadia1 (#2,344,180)

How big is the training set? I’ve found small ones can generate gibberish

MyPetGoat t1_jdk8icb wrote on March 25, 2023 at 12:10 AM

#2,345,494

Replying to simmol (#2,343,320)

You’d need the model to be running all the time observing what you’re doing on the computer. Could be done

simmol t1_jdkd4pf wrote on March 25, 2023 at 12:45 AM

#2,345,975

Replying to MyPetGoat (#2,345,494)

Seems quite inefficient though. Can't GPT just access the HTML or other type of codes associated with the website and just access the websites via the text as opposed to image?

SatoshiNotMe t1_jdkd8l5 wrote on March 25, 2023 at 12:46 AM

#2,345,990

Replying to farmingvillein (#2,335,864)

I’m curious about this as well. I see it’s multimodal but how do I use it with images? The ChatGPTplus interface clearly does not handle images. Does the API handle image?

farmingvillein t1_jdkdjye wrote on March 25, 2023 at 12:48 AM

#2,346,018

Replying to SatoshiNotMe (#2,345,990)

> I see it’s multimodal but how do I use it with images?

You unfortunately can't right now--the image handling is not publicly available, although supposedly the model is capable.

rePAN6517 t1_jdkinrg wrote on March 25, 2023 at 1:28 AM

#2,346,513

Replying to nixed9 (#2,338,219)

> This is quite literally what we hope for/deeply fear at /r/singularity

That sub is a cesspool of unthinking starry-eyed singularity fanbois that worship it like a religion.

Runthescript t1_jdknxkl wrote on March 25, 2023 at 2:11 AM

#2,347,037

Replying to BinarySplit (#2,333,436)

Are you trying to break captcha? Cause this is definitely how we break captcha

[deleted] t1_jdknyfe wrote on March 25, 2023 at 2:11 AM

#2,347,039

Replying to rePAN6517 (#2,346,513)

[removed]

ExcidianGuard t1_jdkrsnj wrote on March 25, 2023 at 2:43 AM

#2,347,417

Replying to rePAN6517 (#2,346,513)

Apocalyptic cults have been around for a long time, this one just has more basis in reality than usual

modcowboy t1_jdkz6of wrote on March 25, 2023 at 3:49 AM

#2,348,085

Replying to MjrK (#2,338,239)

Probably would be easier for the LLM to interact with the website directly through the inspect tool vs machine vision training.

k3iter t1_jdkz7ey wrote on March 25, 2023 at 3:49 AM

#2,348,089

Wholly

MassiveIndependence8 t1_jdl9oq9 wrote on March 25, 2023 at 5:41 AM

#2,349,096

Replying to ThirdMover (#2,336,051)

You’re actually suggesting putting every single frame into gpt-4? It’ll cost you a fortune after 5 seconds of running it. Plus the latency is super high, it might takes you an hour to process a “5 seconds” worth of images.

MassiveIndependence8 t1_jdl9s3u wrote on March 25, 2023 at 5:42 AM

#2,349,107

Replying to Single_Blueberry (#2,335,746)

The problem is that it can’t do math and spatial reasoning that well

MassiveIndependence8 t1_jdla8px wrote on March 25, 2023 at 5:48 AM

#2,349,134

Replying to H0lzm1ch3l (#2,343,065)

Not all api are public and LLM aren’t fined tune to process api

ThirdMover t1_jdlabwm wrote on March 25, 2023 at 5:49 AM

#2,349,143

Replying to MassiveIndependence8 (#2,349,096)

What do you mean by "frame"? How many images do you think GPT-4 would need to get a cursor where it needs to go? I'd estimate four or five should be plenty.

skaag t1_jdli80o wrote on March 25, 2023 at 7:38 AM

#2,349,750

I have not seen a way in the GPT-4 UI by OpenAI to submit an image? How do you do it?

fiftyfourseventeen t1_jdlm1n7 wrote on March 25, 2023 at 8:35 AM

#2,350,068

Replying to rePAN6517 (#2,346,513)

Lmao it seems everyone used chatGPT for a grand total of 20 minutes and threw their hands up saying "this is the end!". I have always wondered how the public would react once this tech finally became good enough for the public to notice, can't say this was too far from what I envisioned. "What if it's conscious and we don't even know it!" Cmon give me a break

thePaddyMK t1_jdlqyng wrote on March 25, 2023 at 9:49 AM

#2,350,458

Replying to dankaiv (#2,335,052)

I think so, too. IMO this will open new ways for software development. There has already been work looking towards RL to find bugs in games. Like climbing walls that you should not. With a multimodal model there might be interesting new ways to debug and develop UIs.

thePaddyMK t1_jdlr6bp wrote on March 25, 2023 at 9:52 AM

#2,350,470

Replying to plocco-tocco (#2,341,467)

There is a paper that operates a website to generate traces of data to sidestep tools like Selenium: https://mediatum.ub.tum.de/doc/1701445/1701445.pdf

It's only a simple NN, though, no LLM behind it.

SeymourBits t1_jdlwrgi wrote on March 25, 2023 at 11:09 AM

#2,350,900

Replying to itsnotlupus (#2,340,814)

This is the most accurate comment I've come across. The entire system is only as good and granular as the CLIP text description that's passed into GPT-4 which then has to "imagine" the described image, often with varying degrees of hallucinations. I've used it and can confirm it is currently not possible to operate anything close to a GUI with the current approach.

MjrK t1_jdm4ola wrote on March 25, 2023 at 12:37 PM

#2,351,743

Replying to modcowboy (#2,348,085)

For many (perhaps these days, most) use cases, absolutely! The advantage of vision in some others might be interacting more directly with the browser itself, as well as other applications, and multi-tasking... perhaps similar to the way we use PCs and mobile devices to accomplish more complex tasks

alexmin93 t1_jdmocbw wrote on March 25, 2023 at 3:18 PM

#2,353,937

The problem is that LLMs aren't capable to make decisions. While GPT-4 can chat almost like a sentient being, it's not sentient at all. It's not able to coprehend the limitations of it's knowledge and capabilities. It's extremely hard to make it call an API to ask for more context. There's no way it will be good at using a computer like a user. It can predict what wappens if you do something but it won't be able to take some action. It's a dataset limitation mostly, it's relatively easy to train language models as there's almost infinite ammount of text on the Internet. But are there any condition-action kind of datasets? You'd need to observe human behavior for millenias (or install some tracker software on thousands of workstations and observe users behavior for years)

Professional_Price89 t1_jdn6lhn wrote on March 25, 2023 at 5:28 PM

#2,356,023

Lets integrate winSpy with it

nixed9 t1_jdnm1qx wrote on March 25, 2023 at 7:17 PM

#2,357,576

Replying to fiftyfourseventeen (#2,350,068)

> Sparks of Artificial General Intelligence: Early experiments with GPT-4

https://arxiv.org/pdf/2303.12712.pdf

fiftyfourseventeen t1_jdnqlqc wrote on March 25, 2023 at 7:50 PM

#2,358,092

Replying to nixed9 (#2,357,576)

That's really cool, but I mean, it's published by Microsoft which is working with openAI, and it's a commerical closed source product. It's in their best interest to brag about it's capabilities as much as possible.

There are maybe sparks of AGI, but there are a lot of problems that are going to be very difficult to solve that people have been trying to solve for decades.

Single_Blueberry t1_jdnyc2d wrote on March 25, 2023 at 8:46 PM

#2,358,961

Replying to MassiveIndependence8 (#2,349,107)

Hmm I don't know. It's pretty bad at getting dead-on accurate results, but in many cases the relative error of the result is pretty low.

snylekkie t1_jdo5afc wrote on March 25, 2023 at 9:38 PM

#2,359,649

Replying to Jean-Porte (#2,341,551)

Absolutely mental

signed7 t1_jdoy969 wrote on March 26, 2023 at 1:19 AM

#2,362,635

Replying to MassiveIndependence8 (#2,349,134)

> LLM aren’t fined tune to process API

GPT-4 isn't. If plugins becomes a success, I reckon GPT-5 will be.

[deleted] t1_jdp4ndn wrote on March 26, 2023 at 2:07 AM

#2,363,327

Replying to rePAN6517 (#2,346,513)

[removed]

zaidbhat t1_jdrriab wrote on March 26, 2023 at 5:59 PM

#2,374,177

But far-fetched

Qzx1 t1_jdv429m wrote on March 27, 2023 at 12:49 PM

#2,389,847

Replying to RustaceanNation (#2,338,122)

Source?

Suspicious-Box- t1_jdzj7wr wrote on March 28, 2023 at 10:16 AM

#2,412,659

Replying to BinarySplit (#2,333,436)

Just need training for that. Its amazing but what could it do with camera vision into the world and a robot body. Would it need specific training or could it brute force its way to moving a limb. The model would need to be able to improve itself real time though.

shitasspetfuckers t1_je1v7pf wrote on March 28, 2023 at 8:21 PM

#2,426,977

Replying to reditum (#2,338,602)

Can you please clarify what specifically about their approach wasn't great?

reditum t1_je29q9y wrote on March 28, 2023 at 9:54 PM

#2,429,484

Replying to shitasspetfuckers (#2,426,977)

From a comment on HackerNews they made a Chrome extension, gathering all the training data from it, and it runs super slowly as well.

emissaryo t1_je4jvzt wrote on March 29, 2023 at 11:19 AM

#2,443,973

Now I'm even more concerned about privacy. Governments will use it for surveillance and the more modalities we add, the more surveillance there will be.

shitasspetfuckers t1_jed796l wrote on March 31, 2023 at 3:52 AM

#2,514,970

Replying to Qzx1 (#2,389,847)

> Google's Spotlight paper

https://ai.googleblog.com/2023/02/a-vision-language-approach-for.html

shitasspetfuckers t1_jed7vuu wrote on March 31, 2023 at 3:58 AM

#2,515,150

Replying to SeymourBits (#2,350,900)

Can you please clarify what specifically you have tried, and what was the outcome?

Comments