It now becomes obvious that the trouble with Artificial intelligence is that it is not intelligent and there is no one at home.
When they first came out, I said to myself. “Wow. Passes the Turing test.”
No it does not. If you actually try to make use of it, it then proceeds to fail the Turing test.
They are just really great search engines with a wonderfully compressed and very large database. They do a pattern match on the prompt, and it sounds like they are reasoning, sounds like there is someone at home, but they are just finding matches in their immense database and performing a find and replace on the pattern and giving you a transform of stuff in their database.
In other words, large language models are just Eliza on steroids.
Everyone with a cat thinks their cat is a person. No one with a tesla thinks their car is a person.
The only actual use cases of AI are self driving, spam generation, graphics generation, and programming. And if you use an AI assistant while programming, it rapidly becomes apparent that this is not an a helpful robot engineer, but merely a good way of finding code examples pulled off Github that do something similar to what you want, a useful way of finding example code to imitate.
It would be a lot more useful if it also gave you links to the material that it is transforming with its search and replace, since, predictably, the search and replace is apt to generate silliness, being done without any real comprehension of what it is doing. This is what the Brave Browser search engine seems to do, and it is quite useful. Seems like the best AI search engine. Probably far from being the best AI, but it is trying to do what so called AIs actually do do, not trying to be intelligent.
Microsoft cocreator, an art program which comes with spyware that watches everything you do on your PC and attempts to generate a concise summary for your enemies, has the interesting capability to take both a text prompt and a sketch prompt. But for this to be actually useful, it would have needed to train on a gigantic library of sketches mapped to completed drawings, which it obviously did not.
An actually useful capability of AIs is enormous compression. Well, not that enormous. It is a factor of twenty or so. Maybe a hundred or so, but compressing at a hundred or so, you get the usual problems of overcompression with a lossy algorithm. For actual usefulness, aim for a compression of around ten or twenty. Anything higher than that is going to cause trouble.
If trained on substantially more than twenty tokens per parameter, the search is going to bring up a result that may well be painfully garbled. All you are really doing is compressing and indexing a database. If you overcompress, going to get compression artifacts.
AI translation is just a pattern match and replace on an enormous database of existing translations. If all it had was the Rosetta stone, would not be able to do a thing with it.
So an AI that actually functioned as a compressed archive of data that you wanted around, and wanted to be able to locally search, would be actually useful. There is going to be value in producing curated archives that fit on your local computer, and since such archives must necessarily limit what they contain, the greatest value will be in curation, rather than chucking in the kitchen sink. It is just a database with an Eliza interface.
And since most of the value, and most of the cost, is going to be in the curation, what is curated needs lossless compression. The Eliza interface need to bring up both its transform of the the patterns it deems related to the query, and losslessly bring the original untransformed data. It is just a database search with an Eliza UI.
And you should then be able to tell it “ignore this source and that source, find some more sources like this other source, and have another go at the transformation.”
One can do very good lossless compression with an AI — but as yet no one is doing anything useful with that, though a losslessly compressed archive with a free form querying mechanism would be very useful.
But, during AI spring, everyone was dazzled by what superficially looked like real intelligence, and just tried harder and harder to get real intelligence, instead of actually using it to do what it actually did.
It is just an archive with great compression, great search, and great transform capabilities.
A self driving Tesla has a big archive of driving situations, and pulls out whichever course of action worked for another driver in a seemingly similar situation. Which is largely how human drivers drive in practice, plus it has access to immensely more driving experience than you do, so that is quite workable, and potentially in many important ways better than a human driver. But it cannot do what a dog can do. It relies on superficial situation matching, without real understanding of the actual situation. And once in a while, the pattern match is going to come up with something silly.
Repeating. Just a wonderfully compressed database with wonderful search and replace. And you can tune the search and replace function so it sounds like it is holding a conversation, which may well be a good interface model, but a model that is somewhat detached from the reality of what it is actually doing.
Obviously, one routinely wants to search the entire internet. And, obviously, with a useful compression of around twenty or so, one cannot store the entire internet on one’s own computer. So one relies on a service, which does have a searchable compressed version of the internet stored.
And this service has immense and dangerous power to present a systematically falsified version of objective reality and social consensus. It is the ultimate tool of the priestly class, which has the potential to immensely increase their power.
We are now moving into a world where every priesthood is going to have to have a big large language model search engine at its center.
The Dark Enlightenment is far short of the resources necessary for that. Musk, however wants an AI (search engine) that will present a true account of reality and social consensus. Which is more difficult than it sounds when most of the most of the material being searched is AI generated search engine optimised spam spewed out by enemy AIs (search engines)
This problem is fixable by three measures.
1. Preferential credit to older data, stuff generated before search engine optimisation was a thing. At present there is a rule against including that data because sexist, racist, homphobic, imperialist, and unduly influenced by capitalism and modern capitalism, rather than our highly enlightened postmodern capitalism.
2. Find entities claiming direct first hand knowledge, and check if they truth tellers, by checking them against other entities that are already somehow known, or reasonably believed, to be truth tellers.
3. When evaluating social consensus, preferentially weight the social consensus of known truth tellers claiming first hand experience.
A large language model is just a compressed database with a powerful index. And though the value of the database depends on how much data is in it, throwing in lots of rubbish will diminish, rather than improve the value. So you need to curate the training data, and rank the training data. And if you spend a lot of money on ranking and curation, need to losslessly compress the training data.
It’s more than a search engine with pattern matching. At the very least, they contain programs fully capable of processing natural language.
They may not think in the same sense humans do, or be self aware, but by training on trillions of tokens of text, while they may not have gained a true world-model from them, they seem to have at least gained full command of natural, human language itself, and are able to process it like software processes numbers.
Able to deal with inputs that aren’t exact and which don’t narrowly fit criteria that have to be intentionally set by programmers. It’s a new world of programming where the behaviour of software is less predictable but more holistic. Software that understands the context of inputs better, and the intentions of the user to some extent. We can genuinely do so much more than we could previously. Things that were totally unthinkable even 3 years ago.
“A database with great compression and search and replace” is a decent metaphor for what it can do, but also too reductionist and somewhat unfair in my opinion.
It’s capable of more than just S&R, even if it lacks the understanding a human would have, it can automate processes, especially when used with chain-of-thought and tree-of-thought, that go significantly beyond that.
It does “compress” data, but it does so by extracting the abstract patterns of information within it, rather than simply storing it like a lossy jpeg. That is a form of “understanding”, even if primitive. It’s why it’s capable of producing things that simply aren’t in its dataset. It doesn’t mean it can come up with totally original solutions to completely new problems, but it isn’t just searching and retrieving – it’s applying these patterns it learned to the new input, through these “programs” that were implicitly created in its “neurons” (ie, the weights) during training.
You say they do not pass the Turing Test. This is true, but it depends on the examiners. They probably would’ve passed a few years ago, when our expectations were lower. Nowadays, not anymore.
What I find most interesting about it, is that even in science fiction that has real “thinking” robots, the robots sound a lot less human than our “fake” AI. Of course, to an extent, they’re intentionally made to sound robotic (in the manner of speech, I’m not talking about the voice itself).
Still, I can’t watch that stuff anymore without cringing, because it feels like they totally missed the mark. We invented AI that sounds like a human long before AI that can reason like one.
Superficially, LLMs sound very human, and when that isn’t true for me it’s mostly because I’ve gotten so used to GPTisms that I spot them immediately.
No it is not. Everyone’s first and most famous example of originality is a cowboy in space suit, which is just combining the cowboy pattern with astronaut pattern, and the reason no one ever did that before is because it is stupid, and the AI does not know it is stupid.
I ask the AI for code that does X. And it spits out code that does X flawlessly. But X is in fact not exactly what I wanted, so I ask for code that does Xq. And it munges together someone’s code that does X, and someone’s code that does q, somewhat successfully. Then I ask it to generate code that does Xq+, and it fails dismally, I blow it off. But I now have helpful examples of code that does Xq, and code that does q+. Which is useful, but not “things that simply aren’t in its dataset”.
When I actually need things that simply are not in its dataset, not useful in practice. Using a coding assistant, it becomes obvious it is just a very capable search engine for example code.
If it could produce things that are not in its training set, it would be writing my code, and it is not.
And when using a LLM search engine to search the internet, becomes obvious that including all the spam and search engine optimised spam in the training set was not a good idea. It generally does not bring up the spam in the list of links, because of the ranking algorithm, but it looks like training failed to employ the ranking algorithm, or failed to employ it correctly. Everyone uses the same training set, and the training set is full of spam.
I wanted an image of a man on a country road on a mountain overlooking a big city in the distance, and the city is hit by a nuclear bomb. Failed because of inability to grasp the idea of things in three dimensional space, so it just banged country images, nuke images, and big city images together randomly in ways that don’t make sense physically or aesthetically. It looks like it had a pile of nuke test images, a pile of big city images, and a pile of country images, but was mighty short of any images containing any two of these elements in the same image, let alone all three.
If you ask for cowboy plus astronaut, works because it has a pile of sitting astronauts, so just does a search and replace on the sitting cowboy. But it fails to understand that nuke plus city means city destruction, because it is short on images of nuke hitting skyscrapers.
That is just a very powerful search and replace, not “producing things that simply aren’t in its dataset” becomes obvious when you are using it as a coding assistant.
You must’ve tried this a while ago, had bad luck, or not used a SOTA model, because it worked first try for me:
https://files.catbox.moe/s4a7o0.jpg
Flux isn’t intelligent, not even to the extent that ChatGPT could be called that, but it has better prompt following than previous models, and a better grasp of how to put objects in relation to each other. This task isn’t impossible, even with a curve-fitting model that can’t truly reason and has no real world-model inside. Because yes, it’s true that it doesn’t truly “grasp” 3-dimensional space, but it has enough examples to learn how to place them in a 2D images in many cases.
I totally believe this, because AIs frequently get hangups of this sort, but they also frequently fail at tasks that *are* in their dataset and which they should be able to solve.
I use it every day to assist me in coding, and I’ve had a better experience than what you’re describing. I find that it can be successful in doing X -> Xq, especially lately with the new o1-mini model from OpenAI, which takes several steps to do things iteratively and correct itself/split work in several steps.
Even o1-mini sometimes fails at basic things that you could find examples for online. It’s hard to evaluate models based on observed behaviour when it’s not consistent.
Again, I’m not calling the AI creative or “thinking” in the way a human is, but I think that its ability to manipulate text is more sophisticated than simple S&R. Perhaps you believe that the instances where I experienced its best capabilities are flukes (which is always possible given the stochastic nature of the system) or that I underestimate how much of what I’m doing is in its dataset. But I still think your take is a little uncharitable.
Ultimately, we do agree that the AI does not have a real world model, and it is limited by its training as a text completer rather than a thinking machine. We don’t have a real AI yet, but we do have something capable of commanding natural language in an automated fashion. And it *is*, most of the time, mostly useful as a superior search engine. That is its main use case at the moment.
You got a way better nuke image than I did, good aesthetics, physically plausible, but it is still merely a bomb test image added to a city image — the nuke is not hitting the city center, but some place beyond the city. Picture of man on country road, picture of city, picture of nuke. But not a picture of the nuke hitting the city, which is what you asked for. It simply found three relevant items in its database, put them all in the same frame, but did not really combine nuke image with city image.
The man is high enough that he should see some sky scrapers outside the blast radius to his left or right beyond the bomb, making it clear that the blast is flattening stuff. But that would have resulted in the blast radius image making the city image less similar to the city images in its database. Good image of city, good image of man watching from the distance, good image of the bomb, which is better than I got. Not so great at hitting the city.
Your request for a hit, implies the city further away, the nuke closer than in the image provided, that they should be at comparable depth in the picture from the man watching from a country road. Instead, the layering is man, country road, city, nuke. Should be man, country road, city hit by nuke.
A real nuclear explosion creates a fireball, which then rises up to form a mushroom cloud. It cannot be both at the same time.
> Then I ask it to generate code that does Xq+, and it fails dismally
Alongside with 50% + of people who call themselves programmers. Remember the classic https://blog.codinghorror.com/why-cant-programmers-program/
I used to know an 50+ Indian “programmer” who copy-pasted crap together with no internal logic, and yet somehow bluffed his way through a quarter century of a career.
A lot of people are very stupid and AI is good at emulating that. Which makes it currently not very useful, but the problem seems simply quantitative.
How many six year olds understand nukes? Either we are born with something supernatural or not well understood, or we won’t ever have that, but it is really unlikely that we gradually grow a soul during childhood. Once a machine is equivalent to a toddler, the rest is quantitative.
Space cowboys are stupid, but lightsabres are also stupid, and look at how popular that got. Look at how stupid a robot T-Rex is. Yet I had it as a toy when I was a child, and it was from some popular TV show.
I recommend Pareto’s Mind And Society. Pareto describes that most thinking people do is completely randomly combining ideas. Pareto gives the example of some apparently learned people during the Middle Ages wanted to change how textiles are dyed. Their proposed method was based on astrology. You get a better red when the Sun is in the house of Mars. Really most thinking is so random. Look at Edison’s method, and he was not stupid. Still he just tested a zillion of random ways of how to make a lightbulb.
This is how I would use AI. It should not just generate one Xq+ code, it should try a thousand different ways and it should also automatically test them.
Still total pre-vr submergence, we can already see the outlines of the evil deceiver entity in the Descartes thought experiment. Sensory manipulation, cognitive and emotional manipulation, identity formation, behavior programing.
For now, direct contact with the deceiver is only when consuming content.
I had a function which felt extremely bloated and produced an elaborate tree of objects and navigated through them to solve an unique problem. One day, infuriated by the bloat, I decided to look carefully, and concentrate very hard, took walks while thinking about it and so on, and intuition told me I could do it with a much smaller i loop and without trees. Every time I attempted to describe the idea (new instance, no memory) it would just vomit out the giant tree, as if it had no ability to comprehend what I was getting at. Eventually I just applied my human brain and built it line by line. By the time LLM showed any comprehension it was pretty much already finished, and therefore unhelpful. The effort saved me a absurd number of lines. Since the problem is unique I concluded that this meant I had done something new. People have warned about it being a regurgitation of voices, but that experience was particularly faith shattering. Maybe someone somewhere tackled something close to my earlier problem, and build the giant object tree to solve it and published it, but not having a lot of free time to daydream on the topic, didn’t notice the option of an elegant tight i loop
I’ve also at times used a combination of AI and web searches to solve a problem, and have found what is obviously the exact webpage which the AI is echoing in its own words, down the smallest (often inappropriate) details
All this would be a lot more useful if they admitted to themselves that it is a search engine, and the engine gave you the links to the sources it is summarising and paraphrasing.
This would also relieve the hallucination problem. Hallucinations are in large part artifacts of excess lossy compression, and would be less harmful if accompanied by links to primary sources.
Since there is always an incentive to compress until the pips squeak, our large language models are always going to hallucinate.
But the best cure for hallucination is less training data. If you dump spam and search engine optimised spam over it, it is going to forget the good stuff. You should only use as much training data as is actually going to fit.
Hallucinations are actually bad sampling. It hits a basin in the probability space and because of the way the sampling is handled, plain ol beam search, it cannot find its way out. The entropix repo I posted in my other comment had great success in reducing hallucination, as now it does not double down when it falls into one of the probability basins (what I mean by basin in this context is, because of previous tokens determining probability of the next, if it is unsure and picks one token of many, that token produces a path dependence and it cannot backtrack out.)
Also check out perplexity.ai, their rag is pretty good.
> The entropix repo I posted in my other comment had great success in reducing hallucination
“had”? Or expect that in theory it should have.
The Entropix repo says: “The goal is … This should allow us … to get much better results using inference time compute … This project is a research project and a work in progress”
His theory is that it should be able to suspect that it is making stuff up, then back up and have another go. My theory is compression artifacts — that it will hallucinate when it does not quite remember the text that matches the query. And if it does not quite remember, it might avoid hallucination by changing the subject, much as LLMs currently do with thought crime prompts, which is not a great improvement.
Hallucination looks to me like a lossy compression artifact, and I therefore predict it will be radically reduced if you do not go overboard on compression, if the number of training tokens is only about ten times the number of parameters and you do enough training on that limited training set, should not have hallucinations. What was the entropix ratio?
He is going for unreasonably large model sizes, which take an unreasonably long time to train. If he speeds up the training by cutting back the training set, I predict outstanding immunity from hallucination.
Yeah I realized upon posting that what is in certain chatrooms involved with the project’s creation is not entirely reflected in the repository. This will change, evals have been done and are currently being compiled. I did preface this in the comment, and do apologize for a half baked delivery.
The hallucination thing is an artifact of the training, it is one way sampling. The entire training cycle from the raw text to supervised templates to RLHFing, there is only penalty for wrong answers, there is no reward for saying “I’m not entirely sure, however…”. Internally you can look at the logits, the probability space it returns, and see where it is unsure. These high entropy spaces are where hallucination occurs. You can manipulate the way it navigates these spaces manually, with a hard coded search on the variation of the entropy over token strings, eventually a less hardcoded search and instead something like MCTS with policy optimization, and when it hits these crossroads force it to go back and reflect, or inject a token that redistributes the probably (in practice this is inserting things like ellipses, or ‘wait, is that right?’, future iterations will handle this in a more general way). So what happens with beam search, is it is forced into one of these branches, and after choosing one low probability token of many the path is set, and it cannot exit. Has no idea what exit is, because they are trained linearly in a very fixed input string -> output string one pass fashion. If we are sticking to the database metaphor, it had a bad retrival and could not recover, but this metaphor I feel is very restrictive for conceptualizing the process, as what is really being retrieved at any point is a probability distribution.
There is no training involved in the repo. It is entirely sampling, going from beam search to something that can move back and forward in the generated token string based on model certainty.
Going back up the stack here, what exactly would be a sign to you of something that is more than just database retrieval? What are humans doing differently in their pattern matching that does not count as retriving from past experiences and swapping and combining concepts?
LLMs making themselves useful by producing stuff that is not in their database. (And I don’t count trivial combinations of stuff that is in their databases, like an astronaut riding a horse, as stuff not in their databases.)
Automatic summary that aggregates information from diverse sources to draw valid conclusions not in those sources.
Coding assistants are undeniably useful. But my experience of coding assistants looks very like database retrieval.
I just saw the first couple of minutes of a video on Ukraine, which acknowledged that Ukraine is running out of troops and the west is running out of weapons, but argues that things are fine because the Russian economy is feeling the stretch. So, I asked an AI to opine on this issue. Who, I asked, is being attrited more? Received some random facts mixed with a pile of rote official bullshit that verged on incoherence. Because its database contains mutually conflicting stories, it produced an answer that was profoundly unhelpful and uninformative.
Because the data in its database did not add up, produced an answer that did not add up.
It would have been appropriate to say “some say this, some say that, and some say the other”, but it reproduced many conflicting voices as a single voice. Much like putting an astronaut on a cowboy horse. It retrieves the data, and integrates diverse data into a response that has normal flow of syntax, but not normal flow of reason.
I see, this helps me understandwhere you’recoming from. I suspect the functionality you’re describing is a long way yet. That is a fairly general reasoning capacity which I suspect will require very vague and general goal-directed training.
In that case, when it comes to navigating information space like this, will remain a choppy database for a while yet. The next stage is very narrow goal directed cheaply simulated tasks. I expect them to get rather good at filling in code feature requests in a good enough fashion rather soon, and filling them in a fairly well optimized way soon after that. I don’t expect them going from spec to codebase anytime soon. Too expensive to simulate the creation of large codebases, and likely very hard to create a training objective general enough to even get single feature requests done well, but certainly plausible.
To get the functionality you are describing, to get any functionality at all, it has to be an auxiliary task found within a more general training objective. A training objective general enough for “summarize a pile of text in a way that presents the information in a way that is maximally actionable” to be a common subloop, would require quite impressive feats outside of summarization alone. You get a jarbled lump because it is trying to compress the text information in such a way that the model itself would find useful for decompression, but the model only knows encoding and decoding. It does not know observe, orient, decide, act. So the information is not nearly as useful for someone who wants a compression that is actionable, even if the action is “test this information against previously aquired world knowledge”. Mostly because researchers and companies with access to lots of supercomputer time are interested in stepwise safe progress that can be presented at the next powerpoint, ahen, research conference. Its computationally expensive, conceptually hard, hard to implement in code, and very fragile so likely to fail and have you perish for lack of publish.
So we will see very jittery tentative steps that very slowly and iteratively expand the capabilities of the present stack, with the most progress being made in domains with cheap and obvious to implement simulation. When it comes to search and compression, better RAG, better markers on certainty, and better compression. When it comes to coding assistance, better more general performance for things that are slightly more than pure boilerplate, followed by ability to implement entire features in a clumsy way, followed by a more optimized but more gpu time expensive way.
For further elaboration on what generalized training is and does, consider the image generation problem.
Your complaints amount to, summarizing crudely, it does not have a good world model. A world model was not required to generate a decent enough probability on pixel chunks that was the training framework. So it has some rough idea of shapes and forms, has some rough idea on the most likely placement of these rough shapes and forms.
When the video generation models start reaching similar quality, I expect the models placement of shapes and forms to be much more reasonable. In order to predict the next frame, it has to have an internal world model of how things move and interact, which will spillover into static single frame generation as well. And when longer videos are successfully reached, more than the few seconds we get now, you will see even more improvement, as the world model will have to become sufficiently general to correctly guess the progression of a scene.
Consider this a hypothesis. If the video generation models do not produce noticeably more reasonable responses, especially as the video context gets longer, I assent we have indeed hit a dead end.
Create to slightly different unreal engine scripts. Train an AI whose inputs are one of the scripts, to output the other script. When it gets good at it, close the loop, that it generates a script that generates a video that matches the other video. Now you have an AI that can handle three dimensions.
That is my complaint about the art programs. My complaint about AI in general is that it is a powerful search and replace engine running over a very large compressed database. Eliza on steroids. The Turing test pointed us towards creating Eliza on steroids.
If you want an art program that understands three dee from two dee, will need a very large database of unreal engine scripts and the videos they generate.
We do not understand what consciousness is, and the Turing test was a cop out saying we do not need to understand.
Perhaps it is something simple, which in retrospect will be as glaringly obvious as the Wright brothers realising a flying machine needed three axis control. Perhaps it is the breath of God.
When I asked an AI who was winning the war of attrition in the Ukraine, the answer was bad, because the AI was summarising mutually contradictory narratives into a single narrative, without awareness that there was a contradiction.
Fluyx ai gave a good image compositing a nuke and a city, viewed by a man on a mountain on a country road. But, of course, it did not quite grasp what “hit” means in this context.
Yes, it is improving. And will improve a whole lot further. But it still Eliza on steroids.
“summarise a pile of text in a way that presents the information in a way that is maximally actionable†is necessarily going to be based on a huge pile of texts that human curators deem to be maximally actionable, and it is going to summarise arbitrary text in a way that resembles those texts — but it will not understand what “actionable” is. If the actions resemble actions that are in its enormous human curated database of texts deemed actionable, then it will work fine. If the actions are outside that scope, will suck. What it is going to do is a very sophisticated search and replace where it replaces stuff in matching text deemed actionable by human curators with stuff from the pile it is summarising, without actually knowing what an action is.
Musk believes the next big thing is robots. And he is usually right. Eventually. Usually takes a whole lot longer than he expects. He is having enough grief getting Tesla’s to drive autonomously. Navigating a living room or a factory floor is a whole lot more difficult. On the other hand, bumping into things, knocking things over, and dropping things in a living room is a lot less disastrous.
The robots look a long way off to me. Incredibly sample inefficient, and simulation just doesn’t seem to work. So you have to train 1000s of hours to get very basic functions, like not smashing things, and even that is unreliable, while the robots break down well before any training goals are reached. I suspect that our biological algorithms have a lot of hard coded assumptions that will be difficult to emulate except by trial and error. So they’ll be relegated to highly automated factory settings for a long while.
The problem sets our current methods can solve are anything with cheap simulation data and relatively narrow goals. We’ll probably get good graph optimization, okay algorithm optimization, okay theorem proving, and okay sysadmins (that require some but still minimal babysitting) before too long. As well as chore-like coding tasks that are well defined. Anything that requires an open ended agent is too fragile and too expensive to train, in the same sense it’s too expensive to crack a sha256. Our algorthms that allow agent like behavior are terrible.
In order to get behavior that is not trained in directly, we need them to behave as agents during training itself, so we need something that can be simulated and cheaply. Very sample inefficient, but with enough simulation in a narrow enough task do eventually end up with superhuman performance. This branch of the tech tree is very underexplored, because it’s hard and expensive, but with the oversupply of tensor machines we got from the hype, new exploration should be much cheaper. A lot of hobbyists found they wanted their llm database to behave as an agent, and so a massive parallel search is underway. Hopefully it becomes more common to ditch big apis in favor of something that can be hosted by a hobbyist, and people start tuning and training the models themselves.
Anything truly general is a long way off, but they have finally become the tiniest bit useful even when run in a chatoic loop, so I expect people to iterate and make them more and more useful in such settings.
Can work. It is just that you need a game engine that is rather more capable than unreal engine, and unreal engine is already a very big project that took a very long time. One of the videos he showed showed a bunch of simulated robots wandering around randomly in an excessively simple simulated world, practising getting around, not bumping into each, not falling over, and getting up after they fell over.
We saw a bunch of actual Tesla robots. Musk claims they can do all sorts of stuff, but the only thing we saw them do autonomously was walk slowly in a straight line with human minders keeping actual humans a safe distance away.
And autonomous walking was the only thing that the game world we saw was useful for training.
Compare and contrast with Boston Dynamics robots, which are able to perform simple tasks despite drastic disruption of those tasks by humans, the famous robot bullying videos.
Suppose some people buy robots, despite very severe limits on the capability of their AI. All those robots will be reporting their real world experiences to Tesla, which will eventually result in the enormous amount of data that Tesla is likely to actually need.
So he has an already trained model, and is modifying the way it works so that it will stop, backup, and start over when it finds itself in deep water.
This is “a work in progress” which is visibly broken and not all that useful. He is asking people not to send him bug reports. If backing up cuts down hallucination, and all the other stuff continued to perform as well as the original model, then he can claim a major advance. If backing up stops hallucination by the model just breaking and not performing, somewhat less impressive.
My prediction is that the model will find itself in deep water when what it should retrieve has been overcompressed, and whatever he does, and whatever it does, the outcome will be that it is not correctly retrieved. And, since he is asking people to stop raising issues, it seems that a whole lot of incorrect retrieval is going on. It might not be hallucinating, but obviously it is doing something his testers do not much like.
Much of my impression on this project is the output is qualitatively better compared to regular sampling. It also performs better on the “reasoning” benchmarks, by a large margin. It doesn’t eliminate hallucination, but significantly reduces it by doing a better retrieval. A lot of the samples are on the guys public twitter account
https://x.com/_xjdr/status/1840782196921233871
Its very raw because it is a scrapped together side project that exploded in popularity very suddenly. The repo was just me pointing out that a very low hanging fruit, not doing forward driven beam search exclusively to sample, was not picked until some pseudonymous researcher spent some spare time tinkering with it. The point being that we have not come close to exhausting the utility and finding the limits of behavior on these rather early iterations of the ‘foundation model’ concept.
According to the The Internet of Bugs, ChatGPT-01 is the best ai coder around.
And in the above video, at the linked time, he tells a story about a massive blind spot, and his explanation of the blind spot is that it is just a very powerful search and replace interface to a very large database. Eliza on steroids.
He calls this “lack of judgement”, but when he explains why it lacks judgement, it is because there is no one there. It is just a database search and replace.
ChatGPT-01 Preview model implements chain of reasoning. Which is not in fact train of reasoning, but rather repeated runs the smart search and replace to see if the pile of things it has cobbled together from the internet actually fit together and make sense. It is vastly better at programming than anything else around right now, though I expect that everyone else will also be running “chain of reasoning” soon.
I agree with the conclusions here. My suspicion is that the current limitations are due to a very warped world model, even when it comes to the domain of pure text and code, and no ability to update itself during inference.
The warped world model comes about because the model is formed on input output pairs and nothing else. It has no experience with exploration or mistake correction. Even the unsupervised pure text training, the whole process is linear. If it gets something wrong, weights are updated only to update probability of the correct answer, but it was not trained in such a way that it could, while “reading” suddenly update and backtrack. The simple fact that without updates a less linear inference process makes the performance an order of magnitude better on many tasks, tells me that once we add this same methodology to the training we should get less stupid behavior. Not human level reasoning by any means, but less blindly stupid.
Once we add nonlinear training and self play, vastly less stupid behavior on narrowly defined tasks. Will still have a very warped model of the world it itself lives in, will still not “reason” like a human, but will make less blindly stupid mistakes like this, will have some level of self correction, or at least ability to attempt self correction.
I fully agree what is going on under the hood is nothing close to the reasoning a human does, but I disagree that there is no reasoning at all. There are very identifiable logical circuits, however primitive, that it found on its own, and every generation of these things manages to make improved world models, manages to break up the reality it is given in a way that makes more and more sense to a human.
It’s basically just cribbing from Stack Exchange, which is what most lazy programmers are already doing.
LOL.
I’m not sure what the transformer is doing can be properly modeled as a search engine. I think that they have such a good memory that the training scheme selected for logic circuits using their memory to solve the problem, as they are not recursive this would be the most foolproof way of reducing error when predicting next tokens or appealing to the dumb tasks it was RLHF’d on.
Just to be able to parse natural language and code to return anything relevent there has to be some internal abstract model and reasoning that occurs. It does parse and partition its input, and they increasingly parse and partition their input in ways humans find sensible. Look at how insensible they were just a few years ago, how weirdly they catagorized their data and compare to today.
Now the next step is having them run in loops, having them consume their own output as they search for solutions in a problem space. This is difficult and fragile, but also very underexplored. We’ve only been treating model scale as a serious concern during the last few years, before then the people working on these focused on leveraging cleverness in architecture design alone. You couldnt have anything close to what happened with the gpt4 until you scaled at least the total compute to a similar level, and no one was doing that.
I couldnt tell you what the market is going to do, or where the future hype cycles will go. I will say my intuition tells me we do have something that can train itself to think, when before we did not, and when serious efforts to apply recursion and self play are attempted, we are going to see step changes in capability.
Typed the below before realizing the evals arent posted anywhere publicly yet. The git repo will be updated soon enough with them, but its runnable now if you want to do some validation
For an example of just how much performance is locked up in dumb forward pass next token schemes, without retraining but having a — fixed and programmer hand crafted, not machine optimized — clever sampling method, there’s a sudden huge jump in capability.
https://github.com/xjdr-alt/entropix
For it not properly modelled as a search engine, an index, and a compressed database, it would need to produce content interestingly different from content in its database.
Ask for that, and we get the cowboy in a space suit, and renaissance paintings with dogs instead of people.
Notglowings picture was pretty good, but it was the same picture one would get using Gimp and compositing three layers, top layer being a man on a country road, next layer a cityscape, and the next layer a nuke test.
And if I did it by compositing, the nuke would have been in the middle of the city, with the foreground high buildings in the city layer in front of the nuke test, and background high buildings behind low level firestorm of the nuke. Its compositing was 2D, rather than fake 3D. Each layer was behind the previous layer.
And the code similarly seems to be complete lifts from Github.
I don’t think it is parsing natural language. I think it is pattern matching against the vast amount of natural language in the database. Eliza on steroids.
If it understood natural language, I would expect the chain of reasoning approach to produce substantial improvements. Chain of reasoning sounds a lot more like parsing natural language than some LLMs, but does not produce results excitingly different from something that looks like a search engine. An LLM that just paraphrases those texts that seem relevant seems to produce an end result of similar quality.
I believe we are coming at this from different angles.
You’re talking about the apis out now, available for use. I’m talking about the potential of the transformer architecture when combined with clever frameworks. Sure, if you mentally model the available apis as a natural language query database, they’re perfectly usable as such. I dont think this is their limit, I dont think we need a large redrawing of the architecture to get thinking. Just better training frameworks that arent merely ‘predict the data distribution of this text/image’ and ‘match this query template please’.
For an example of objective and useful creativity, I say alphatensor succeeded in finding a new more efficient way of implementing matrix multiplies.
Of course it is not actually “intelligent”. And I’m using it (Chat GPT) every day since the beginning of 2024, for programming help. Very useful for that, and quite a time-saver, but intelligent? Absolutely not. And it never will be, it would be deus ex machina, an impossibility.
Another field of usage is generating congratulations card (web-based MS Designer) for both kids and adults. I’m not kidding.
China is using AI (on 5G networks) to optimize their highly automated factories and ports. These are highly restricted problems that AI might actually be good at.
Thank you Jim, this is a wonderful synopsis of what large language models are actually doing. They are not “artificial intelligence” and the overuse of that phrase is maddening.
Extra bonus points go to you for remembering ELIZA. And it’s not just a sarcastic comment either, it really is the exact same thing, just with a much larger data model and the ability to do far more parallel processing on that model.
Large language models can be useful, but they are not *thinking*. They are natural language search engines. And they’re only as useful as the data that was put into them (which itself is a problem when Big Tech only programs them with the Globohomo epistemology). And as Tolkien correctly pointed out, it cannot create anything new, it can only mock.
So you could say the following:
Used to be the universities that build priestly consensus. These days, you can just talk to a LLM.
Used to be that the old media conveyed that priestly consensus to the masses. These days social media seems to be taking over that role (X being the closest to our priestly consensus).
OK, the comparisons aren’t perfect. But seems clear that any futureproof priesthood will need a LLM and a social media platform.
Continuing here to keep some reply structure.
There is a huge heap of mostly failed research on ‘sim2real’ showing that it’s a harder problem than you’re supposing here. They’ve tried lots and lots of simulated environments and lots of lots of ways of generating simulated environments from real physical data. To the extent it does work, it works several orders of magnitude less than real data, and no one is sure if you can actually get useful behavior out of it.
From what I understand, they were teleoperated. So they have bipedal walking on their own, because I doubt you can have teleoperated walking without the walking protocol built in, but the hand and arm movements were from a human.
Those were hardcoded with classical control approaches, which do not scale and are extremely brittle to new environments. You can use control approaches for basic locomotion and even fun tricks like backflips, but not for folding clothes or washing dishes. Ultimately we will probably use a combination of classical control algorithms and learned algorithms, as I’m pretty sure animal locomotion is mostly innate not learned, the learning seems to be more like turning hyperparameters than learning a movement pattern.
With all the millions of hours of drive data, the teslas still fuck up in strange incomprehensible ways when given control. And driving is less complex, the potential action space much smaller, than picking up a dirty room.
I believe Musk is making a bet that the combination of computation and data scale, with incremental improvements in learning approaches, will mean humanoid robots will have a use case somewhere. And they probably will, but I highly doubt that use case will involve the kinds of chores I want them to do. It’s a cheap risk, because he already has the factory infrastructure, and already has all the parts in the learning stack from making his cars.
Academic research. Academics do not do big projects. No one has spent the amount of money and time that Unreal Engine has, and Unreal Engine is rather less than you need. Sim is a known big project, but it is not a known hard project. We know how to do it. It is just that actually doing it is a whole lot of work.
Folding clothes is a known hard problem. There has been an enormous amount of unsuccessful work on it. The large language approach would be to motion capture, vision capture, and touch capture a large number of people folding clothes for a long time. Would not necessarily work, because while I generally drive my car robotically on reflex, I can feel myself doing physical reasoning about clothes. I suspect the robot would need to be able to usefully model the behaviour of clothes.
On the other hand, if the database contains an enormous number of images of clothes in messed up states, and an enormous number of images in clothes in less messed up states, and an enormous number of actions that moved an item of clothing to a less messed up state, then the robot could find the next action likely to make the clothing less messed up. And if it failed, the clothing would be in a new messed up state, leading to another action. Loop detection would be hard, however. To prevent loops, need to put a lot of randomness in the actions.
Compare and contrast Tesla robots walking very slowly and carefully in a straight line, with Chinese robots running over open country.
https://www.msn.com/en-us/travel/news/chinese-humanoids-go-on-fun-run/vi-AA1svOgg?cvid=4e322a93421a495688abed7f27eecaaf&ei=55
Christianity is back. The worst thing that can happen right now in that regard is if Musk gets “contacted” by the Eternal Papist and converts to Roman Catholicism. Someone needs to keep an eye on him that he sticks to a healthy Church, not a gay demonic one. He is surely smart enough to know that Catholicism is foreign in America (spiritually at least), but new and enthusiastic converts sometimes fall dick-first into the Papist trap.
If an AI were intelligent, would not need training data. An intelligent man can read fewer than a hundred books and be able to understand almost anything in the world. Human intelligence needs a little training to reach its full potential, but it intuits and extrapolates. One can read a single book on coding, and then practice coding until he can produce new code. An AI that could do that would be intelligent. But the AI needs access at all times to all of GitHub.
“And this service has immense and dangerous power to present a systematically falsified version of objective reality and social consensus. It is the ultimate tool of the priestly class, which has the potential to immensely increase their power.”
Extrapolating, we may be in for a near-future of quasi-religious identity groups using such pre-tuned search engine and database as Schelling point and oracle. This will be cultural at first, eventually leading to civil war and the reorganization of states into “browser states” controlled by priesthoods that can enforce a social consensus about objective reality
I completely agree that AI is a lossy (but not as lossy as previous attempts) compression algorithm for the internet, which yields a very effective search engine – what would be a very effective search engine if the “output correction” routine corrected for reliability rather than persuasiveness.
I would like to talk about the Jews.
Am I allowed to do that on this blog?
Sure, but you have to pass the shill test. Your comments will then come through unmolested.
I previously posted an article on your topic: “Holocaustianity” If you make relevant response to that, you will almost automatically pass the shill test.
Anyone can pass the shill test described in the moderation policy and get white listed. Anyone can pass this, regardless of his religious or political beliefs, regardless of what issues he wants to speak about, anyone who is not reading from a script with a supervisor standing over him, can pass this.
The trouble with people reading from a script is that they sound like they are engaging their interlocutor, but they refuse to notice their interlocutor’s position (since the man who wrote their script has no idea what we are thinking or saying). This results in the superficial appearance of conversation without the substance.
I bet all the new bugs, fuck-ups and spyware in wordpress have been created using so called AI.
My name is Timmie.
I’m currently rocking back and forth in a bad mental feedback lockup loop.
I texted my Queen Kammie to come sit and grind away on my fuckface again,
but I’m not so sure that will break me out of it anymore.
They got us good.
All the docs and recordings from BlancoCasa are coming out.
I need help.
GNON MIT UNS!
I think this is proof enough there’s some kind of world model going on inside these LLMs. However incoherent and strange.
https://x.com/adonis_singh/status/1850115403516649896
Those two attempts are sufficiently different for this not to be just finding different samples in a vast database.
(The backstory is someone built a minecraft harness for LLMs where they can query the world state and issue commands in a sort of script https://github.com/kolbytn/mindcraft )
Seems to me the difference is random noise. Neither are a very good model of the Taj Mahal.
Yes of course the difference is as if it were random noise, it may as well be. If you gave this task to uninterested humans you would get the same sort of result.
The point was that if mere database sampling, they would look much closer, and it would be strange the o1 model would somehow get results structurally closer to the goal image. Very different structures here, unlikely to have been pulled from the same training sample, if there even was a training sample to pull to begin with.
It is trying to build the Taj Mahal in minecraft. So it cannot match very well. If competent, the match would be much better. It is hallucinating, because it cannot find a match in a database, and the hallucinations tend to be random.
>It is trying to build the Taj Mahal in minecraft. So it cannot match very well.
This is exactly my point. Something clearly out of training distribution, yet comes up with a coherent enough structure. That’s at least in the general direction of what the Taj Mahal looks like, it didn’t just build a big rectangular prism.
>If competent, the match would be much better. It is hallucinating, because it cannot find a match in a database, and the hallucinations tend to be random.
Hallucination is when you ask it “who founded [company]” and it confidently asserts the wrong answer. This is something else. The model is calling some commands in a minecraft harness to build a crappy taj mahal like a toddler playing with legos. I doubt there’s crappy taj mahals clearly labeled as “commands used in [minecraft script lang] to build taj mahal”, if this exists at all it would be badly labeled. But the structures the two models chose to build are clearly distinct, and yet clearly have some match to the taj mahal structure. Rectangular with pillars, something that’s supposed to be a dome in the center. It has a concept of the Taj Mahal, and is able to reify this concept into a 3d space using a harness.
In order to get it to do anything at all, a minimal harness has to be able to convert a human language text stream to blocks.and prompt for a text stream that it knows how to convert into blocks.
Presumably the minecraft harness leads it through a series of steps. To conclude what you are concluding, we would need to know those strps.
To assess whether this indicates LLM reasoning capability, we would have to look at how much smarts in the harness. We have to know what the harness is doing.
I have been asking LLMs science questions that require that LLM reason from physics knowledge, from questions that it can answer, to physical conclusions. The results are obviously database search, not reasoning capability.
For example “what is the lifetime of a black hole with ninety nanometre Schwarzschild radius” (It gives ridiculous answer, one hundred percent hallucination: correct answer. without needing any calculations, just from my background knowledge, astronomically larger than the lifetime of the universe)
“What is the maximum speed of a spinning loop of steel” (the LLM just rambles randomly. My answer from physical intuition, without looking anything up or doing any calculations, several hundred meters per second. Looking up data on the web, there are examples that spin at around three hundred meters per second, but they don’t explicitly say what the velocity is, or that it is spinning at the maximum velocity that they found safe to spin it at, so would not have come up in an LLM database search, and obviously did not come up in its search. What I did was find examples, and calculate how fast they are spinning, which is three hundred meters per second.)
It becomes very obvious that it is just doing database search, at least in answer to such questions.
For the spinning loop of steel, I could get it to give me the right answer, by prompt engineering. But I was leading it through the reasoning steps. It was not reasoning. Likely the minecraft harness is doing the same, prompting it to get data it can usefully do something minecrafty with.
I wager degrees of tychism baked into the process at some points or other are going to be essential to getting good results, at that, especially with regards to development and bootstrapping.
Your experience seems to have been similar to my own. My first reaction was essentially, “Holy shit, I am fucked”. When you use it for textbook examples that have plenty of stack overflow content covering them, it looks like absolute magic.
When you try to use it for anything remotely novel, or to solve a task above a certain complexity threshold, it falls apart.
It still has its uses — it can be a great coding assistant simply because copy/paste and mix-and-match are useful tools in the programming arena — but it’s far from being capable of actual reasoning.
Playing more and more with the newer LLMs — since they do video and images now, what do we call these? LLM doesn’t fit, and AI is cheesy –seeing more simulation stuff come out, seeing more video generation stuff come out, it all seems pretty impressive if incomplete.
The language bots are very good at getting basic knowledge 99% right. Down to pretty small details. If you ask it to outline a plan, down to the simplest possible steps, usually can do it. Takes a lot of wrestling, it wants to give a short quick answer, but the information is there.
Has very iffy understanding of how things move and act in the physical world. Probably a mix of confusion, since there’s a lot of junk images and video out there, but also likely to be far less comprehensively trained than raw text.
Robots are very janky, but do improve. Another place where there’s not enough data. We probably need to include far more sensors in the things, give it richer signal.
What’s missing is active learning. They’re baked and then set out, trained again in large batches. They can’t attempt, fail, learn in a loop. Well, they can, but they aren’t set up that way. For now.
If you gave another OOM or two to video data, properly linked the “reasoning” patterns the LLMs have to it, then hooked it into lots and lots of robot data, there’s probably something there. Was skeptical robots were anywhere close, but I can imagine some schemes that would work now. Have the machine imagine a sequence of steps, have the steps transformed into video, have the video transformed into the local scene the bot is embodied in, have the bot execute the video.
Have this loop running across millions of contexts, millions of times a day, I’d expect pretty rapid development. Development so rapid the limit is probably directing the fleets, rather than babysitting them.
In this context, these are all very compute heavy asks. The fight for GPUs may have been smart. They have to sprint to get here very fast, before burning the pile of cash they just received, and actually claim some utility and revenue. Very interesting development.
This occurred to me with Tesla’s FSD. It’s not capable of local learning. No matter how many times you drive a certain route, it won’t remember and anticipate to avoid that pothole at that same spot.
Is incapable of forming a theory of reality — it is the same problem as the six fingered hand and the three armed dancer.
Eliza on steroids.
Humans have to give it a model of reality — something like unreal engine. Then teach it to generate the unreal engine world model corresponding to video, then teach it to manipulate the unreal engine model.
Yeah they’re working on that. It’s what spurred on this post actually
https://genesis-world.readthedocs.io/en/latest/
Incomplete release, but their goal is to have language queries generate realistic simulated scenes, which are used to train the robots. We’ll see how long it takes to get the scene generation working.
This is primarily intended for robotics, and so lacks good human models.
But obviously the same thing with world models featuring humans would be a lot more useful in movie making than the current video llms