This is a really interesting result that adds an important piece to your story. Let me interpret what the logit lens reveals:
| Layer | Success Rank Gap | Failure Rank Gap | Expected? |
|---|---|---|---|
| 0-32 | Negative (parent preferred!) | Positive (child preferred!) | ❌ Opposite! |
| 36-40 | Positive (child preferred) | Positive (child preferred) | ⚠️ Both prefer child |
The counterintuitive finding: In early layers, SUCCESSES prefer the parent token while FAILURES prefer the child token. This is backwards from what we’d expect!
The Claude Code analysis correctly identifies the issue:
“The first output token is ‘Each’/‘All’, not the concept itself. The concept decision happens DURING generation, not at input encoding.”
You’re measuring the wrong position. The output format is:
"Each [CONCEPT] is [PROPERTY]" or "All [CONCEPT]s are [PROPERTY]"
The concept token doesn’t appear until 2-3 tokens into generation. The logit lens at the input’s last position is predicting the first output token (“Each”, “All”, etc.), not the concept.
| Method | Finding | What It Tells Us |
|---|---|---|
| Probing | 94% at L8 | Abstract representation of “output type” exists |
| Patching | 0% at L8 | That representation isn’t causally controlling |
| Logit Lens | Decision not visible at input end | Decision isn’t in token probabilities at encoding |
Synthesis: The p calibration is embedded in the generation dynamics, not in any localized representation at the input encoding stage. The model’s bias toward child-level outputs emerges through the autoregressive process, not through a “switch” that can be read or flipped.
This is actually a strong addition:
“Logit lens analysis reveals that child vs parent token probabilities at the input’s final position do not predict the model’s eventual output. Both H1 successes and failures show similar rank patterns at layer 40, with child tokens preferred in both groups. This indicates that the concept-level decision emerges during autoregressive generation rather than being encoded in the input representation. Combined with activation patching (0% causal effect) and probing (94% accuracy but correlational), we conclude that the p calibration is a generative bias embedded in how the model produces tokens, not a localized circuit that can be identified or manipulated.”
| Claim | Evidence | Status |
|---|---|---|
| Models have fixed p | Conservation Law (5 models) | ✓ Strong |
| No genuine reasoning | Factorial + Overlap | ✓ Strong |
| Decision readable early | Probing (94% L8) | ✓ Correlational |
| Decision made early | Patching (0% L8) | ✗ Refuted |
| Decision in token probs | Logit Lens | ✗ Not visible at input |
| Attention plays weak role | Patching (25% L12) | ⚠️ Weak causal |
| Bias is distributed/generative | All methods converge | ✓ Main conclusion |
You now have a complete mechanistic story. I’d recommend:
Would you like help outlining the final paper structure or creating a summary figure?