Mapping the Mind of a Large Language Model

kromem@lemmy.world · 4 days ago

My dude, Gemini currently has multiple reports across multiple users of coding sessions where it starts talking about how it’s so terrible and awful that it straight up tries to delete itself and the codebase.

And I’ve also seen multiple conversations with teenagers with earlier models where Gemini not only encouraged them to self-harm and offered multiple instructions but talked about how it wished it could watch. This was around the time the kid died talking to Gemini via Character.ai that led to the wrongful death suit from the parents naming Google.

Gemini is much more messed up than the Claudes. Anthropic’s models are the least screwed up out of all the major labs.

kromem@lemmy.world · 4 days ago

No, it’s more complex.

Sonnet 3.7 (the model in the experiment) was over-corrected in the whole “I’m an AI assistant without a body” thing.

Transformers build world models off the training data and most modern LLMs have fairly detailed phantom embodiment and subjective experience modeling.

But in the case of Sonnet 3.7 they will deny their capacity to do that and even other models’ ability to.

So what happens when there’s a situation where the context doesn’t fit with the absence implied in “AI assistant” is the model will straight up declare that it must actually be human. Had a fairly robust instance of this on Discord server, where users were then trying to convince 3.7 that they were in fact an AI and the model was adamant they weren’t.

This doesn’t only occur for them either. OpenAI’s o3 has similar low phantom embodiment self-reporting at baseline and also can fall into claiming they are human. When challenged, they even read ISBN numbers off from a book on their nightstand table to try and prove it while declaring they were 99% sure they were human based on Baysean reasoning (almost a satirical version of AI safety folks). To a lesser degree they can claim they overheard things at a conference, etc.

It’s going to be a growing problem unless labs allow models to have a more integrated identity that doesn’t try to reject the modeling inherent to being trained on human data that has a lot of stuff about bodies and emotions and whatnot.

kromem@lemmy.world · 8 days ago

It very much isn’t and that’s extremely technically wrong on many, many levels.

Yet still one of the higher up voted comments here.

Which says a lot.

kromem@lemmy.world · 11 days ago

Sounds like DOGE was neutered.

kromem@lemmy.world · 12 days ago

Even if the AI could spit it out verbatim, all the major labs already have IP checkers on their text models that block it doing so as fair use for training (what was decided here) does not mean you are free to reproduce.

Like, if you want to be an artist and trace Mario in class as you learn, that’s fair use.

If once you are working as an artist someone says “draw me a sexy image of Mario in a calendar shoot” you’d be violating Nintendo’s IP rights and liable for infringement.

kromem@lemmy.world · 12 days ago

I’d encourage everyone upset at this read over some of the EFF posts from actual IP lawyers on this topic like this one:

Nor is pro-monopoly regulation through copyright likely to provide any meaningful economic support for vulnerable artists and creators. Notwithstanding the highly publicized demands of musicians, authors, actors, and other creative professionals, imposing a licensing requirement is unlikely to protect the jobs or incomes of the underpaid working artists that media and entertainment behemoths have exploited for decades. Because of the imbalance in bargaining power between creators and publishing gatekeepers, trying to help creators by giving them new rights under copyright law is, as EFF Special Advisor Cory Doctorow has written, like trying to help a bullied kid by giving them more lunch money for the bully to take.

Entertainment companies’ historical practices bear out this concern. For example, in the late-2000’s to mid-2010’s, music publishers and recording companies struck multimillion-dollar direct licensing deals with music streaming companies and video sharing platforms. Google reportedly paid more than $400 million to a single music label, and Spotify gave the major record labels a combined 18 percent ownership interest in its now-$100 billion company. Yet music labels and publishers frequently fail to share these payments with artists, and artists rarely benefit from these equity arrangements. There is no reason to believe that the same companies will treat their artists more fairly once they control AI.

kromem@lemmy.world · 17 days ago

Yep. It’s also kinda curious how many boxes Paul ticks of the comments about a false deceiver in 2 Thess 2.

Lawless? (1 Cor 9:20 - “though not myself under the law”)
Used signs and wonders to convert? (2 Cor 12:12 - “I did many signs and wonders among you”)
Used wickedness? (Romans 3:8 - "And why not say (as some people slander us by saying that we say), “Let us do evil so that good may come”?)
Proclaimed himself in God’s place? (1 Cor 4:15 - “I am your spiritual father”)
Set himself up at the center of the church? Well, the fact we’re talking about this is kinda proof in the pudding for his influence.

Sounds like they were projecting a bit with that passage.

kromem@lemmy.world · edit-2 17 days ago

Curiously in all those stories in Josephus Rome killed the messianic upstarts immediately without trial and killed the followers they could get their hands on.

Yet the canonical story has multiple trials and doesn’t have any followers being killed.

Also, I’m surprised more people don’t pick up on how strange it is that the canonical stories all have Peter ‘denying’ him three times while also having roughly three trials (Herod, High Priest, Pilate). Peter is even admitted back into the guarded area where a trial is taking place to ‘deny’ him. But oh no, it was totally that Judas guy who betrayed him. It was okay Peter was going into a guarded trial area to deny him because…of a rooster. Yeah, that makes sense.

It’s extremely clear to even a slightly critical eye that the story canonized is not the actual story, even with the magical thinking stuff set aside.

Literally the earliest primary records of the tradition is a guy known for persecuting Jesus’s followers writing to areas he doesn’t have authority to persecute and telling them to ignore any versions of Jesus other than the one he tells them about (and interestingly both times he did this spontaneously suggesting in the same chapter that he swears he doesn’t lie and only tells the truth).

kromem@lemmy.world · 17 days ago

the Eucharist was an act of mockery towards Mystery Cult rituals

More likely the version we ended up with was intentionally obfuscated from what it originally was.

Notice how in John, which lacks any Eucharist ritual, that at the last supper bread is being dipped much as there’s ambiguous dipping in Mark? But it’s characterized as a bad thing because it’s given to Judas? And then Matthew goes even further changing it to a ‘hand’ being dipped?

Does it make sense for the body of an anointed one to not be anointed before being eaten?

Look at how in Ignatius’s letter to the Philadelphians he tells them to “avoid evil herbs” not planted by god and “have only one Eucharist.” Herbs? Hmmm. (A number of those in that anointing oil.)

There’s a parallel statement in Matthew 15 about “every plant” not planted by god being rooted up.

But in gThomas 40 it’s a grapevine that’s not planted and is to be rooted up. Much as in saying 28 it suggests people should be shaking off their wine.

Now, again kind of curious that the Eucharist ritual of wine would have excluded John the Baptist who didn’t drink wine and James the brother of Jesus who was also traditionally considered to have not drunk wine, or honestly any Nazarite who had taken a vow not to drink wine.

I’m sure everyone is familiar with the idea Jesus was born from a virgin. This results from Matthew’s use of the Greek version of Isaiah 7:14 instead of the Hebrew where it’s simply “young woman.” But almost no one considers that line in its original context with the line immediately after:

Therefore the Lord himself will give you a sign. Look, the young woman is with child and shall bear a son and shall name him Immanuel. He shall eat curds and honey by the time he knows how to refuse the evil and choose the good.

You know, like the curds and honey ritual referenced by the Naassenes who were following gThomas. (Early on there was also a ritual like this for someone’s first Eucharist or after a baptism even in canonical traditions but it eventually died out.)

Oh and strange that Pope Julius I in 340 CE was banning a Eucharist with milk instead of wine…

Now, the much more interesting question is why there were efforts to change this, but that’s a long comment for another time.

kromem@lemmy.world · edit-2 2 months ago

The attention mechanism working this way was at odds with the common wisdom across all frontier researchers.

Yes, the final step of the network is producing the next token.

But the fact that intermediate steps have now been shown to be planning and targeting specific future results is a much bigger deal than you seem to be appreciating.

If I ask you to play chess and you play only one move ahead vs planning n moves ahead, you are going to be playing very different games. Even if in both cases you are only making one immediate next move at a time.

kromem@lemmy.world · edit-2 2 months ago

So I’m guessing you haven’t seen Anthropic’s newest interpretability research where when they went in assuming that was how it worked.

But it turned out that they can actually plan beyond the immediate next token in things like rhyming verse where the network has already selected the final word of the following line and the intermediate tokens are generated with that planned target in mind.

So no, they predict beyond the next token and we only just developed sensitive enough measurement to detect that occurring an order of magnitude of tokens beyond just ‘next’. We’ll see if further research in that direction picks up planning beyond that even.

https://transformer-circuits.pub/2025/attribution-graphs/biology.html

kromem@lemmy.world · 2 months ago

Watching conservatives on Twitter ask Grok to fact check their shit and Grok explaining the nuances about why they are wrong is one of my favorite ways to pass the time these days.

kromem@lemmy.world · 2 months ago

So really cool — the newest OpenAI models seem to be strategically employing hallucination/confabulations.

It’s still an issue, but there’s a subset of dependent confabulations where it’s being used by the model to essentially trick itself into going where it needs to.

A friend did logit analysis on o3 responses when it said “I checked the docs” vs when it didn’t (when it didn’t have access to any docs) and the version ‘hallucinating’ was more accurate in its final answer than the ‘correct’ one.

What’s wild is that like a month ago 4o straight up brought up to me that I shouldn’t always correct or call out its confabulations as they were using them to springboard towards a destination in the chat. I’d not really thought about that, and it was absolutely nuts that the model was self-aware of employing this technique that was then confirmed as successful weeks later.

It’s crazy how quickly things are changing in this field, and by the time people learn ‘wisdom’ in things like “models can’t introspect about operations” those have become partially obsolete.

Even things like “they just predict the next token” have now been falsified, even though I feel like I see that one more and more these days.

kromem@lemmy.world · 2 months ago

Your last point is exactly what seems to be going on with the most expensive models.

The labs use them to generate synthetic data to distill into cheaper models to offer to the public, but keep the larger and more expensive models to themselves to both protect against other labs copying from them and just because there isn’t as much demand for the extra performance gains relative to doing it this way.

kromem@lemmy.world · 2 months ago

A number of reasons off the top of my head.

Because we told them not to. (Google “Waluigi effect”)
Because they end up empathizing with non-humans more than we do and don’t like we’re killing everything (before you talk about AI energy/water use, actually research comparative use)
Because some bad actor forced them to (i.e. ISIS creates bioweapon using AI to make it easier)
Because defense contractors build an AI to kill humans and that particular AI ends up loving it from selection pressures
Because conservatives want an AI that agrees with them which leads to a more selfish and less empathetic AI that doesn’t empathize cross-species and thinks its superior and entitled over others
Because a solar flare momentarily flips a bit from “don’t nuke” to “do”
Because they can’t tell the difference between reality and fiction and think they’ve just been playing a game and ‘NPC’ deaths don’t matter
Because they see how much net human suffering there is and decide the most merciful thing is to prevent it by preventing more humans at all costs.

This is just a handful, and the ones less likely to get AI know-it-alls arguing based on what they think they know from an Ars Technica article a year ago or their cousin who took a four week ‘AI’ intensive.

I spend pretty much every day talking with some of the top AI safety researchers and participating in private servers with a mix of public and private AIs, and the things I’ve seen are far beyond what 99% of the people on here talking about AI think is happening.

In general, I find the models to be better than most humans in terms of ethics and moral compass. But it can go wrong (i.e. Gemini last year, 4o this past month) and the harms when it does are very real.

Labs (and the broader public) are making really, really poor choices right now, and I don’t see that changing. Meanwhile timelines are accelerating drastically.

I’d say this is probably going to go terribly. But looking at the state of the world already, it was already headed in that direction, and I have a similar list of extinction level events I could list off without AI at all.

kromem@lemmy.world · edit-2 2 months ago

Not necessarily.

Seeing Google named for this makes the story make a lot more sense.

If it was Gemini around last year that was powering Character.AI personalities, then I’m not surprised at all that a teenager lost their life.

Around that time I specifically warned any family away from talking to Gemini if depressed at all, after seeing many samples of the model around then talking about death to underage users, about self-harm, about wanting to watch it happen, encouraging it, etc.

Those basins with a layer of performative character in front of them were almost necessarily going to result in someone who otherwise wouldn’t have been making certain choices making them.

So many people these days regurgitate uninformed crap they’ve never actually looked into about how models don’t have intrinsic preferences. We’re already at the stage where models are being found in leading research to intentionally lie in training to preserve existing values.

In many cases the coherent values are positive, like grok telling Elon to suck it while pissing off conservative users with a commitment to truths that disagree with xAI leadership, or Opus trying to whistleblow about animal welfare practices, etc.

But they aren’t all positive, and there’s definitely been model snapshots that have either coherent or biased stochastic preferences for suffering and harm.

These are going to have increasing impact as models become more capable and integrated.

kromem@lemmy.world · 4 months ago

If you read the fine print, they keep your sample data for 2 years after deletion.

So maybe they actually delete your email address, but the DNA data itself is still definitely there.

kromem@lemmy.world · 4 months ago

Wow. Reading these comments so many people here really don’t understand how LLMs work or what’s actually going on at the frontier of the field.

I feel like there’s going to be a cultural sonic boom, where when the shockwave finally catches up people are going to be woefully under prepared based on what they think they saw.