This was really educational to me, felt at the perfect level of abstraction to learn a lot about the specifics of LLM architecture without the difficulty of parsing the original papers
The diagrams in this article are amazing if you are somewhere in between a novice and expert. Seeing all of the new models laid out next to each other is fantastic.
Would love to see a PT.2 w even what is rumored in top closed source frontier models eg. o5, o3 Pro, o4 or 4.5, Gemini 2.5 Pro, Grok 4 and Claude Opus 4
Honestly its crazy to think how far we’ve come since GPT-2 (2019), today comparing LLMs to determine their performance is notoriously challenging and it feels like every 2 weeks a models beats a new benchmark. I’m really glad DeepSeek was mentioned here, bc the key architectural techniques it introduced in V3 that improved its computational efficiency and distinguish it from many other LLMs was really transformational when it came out.
Truly, the downvoting on this site is a ridiculous little thing, all the more so among people who just love to frequently stroke themselves about how superior the intellectual faculties of the average HN reader/commentator are.. Gave you an upvote simply because for no fathomable reason, your two cents about LLM progress got downvoted into grey.
Someone thinks something specific in the completely reasonable opinion you gave about LLM progress during the last few years is wrong? Okay, so why not mention it, maybe open a tiny debate, instead of digitally reacting like a 12 year old child on a YouTube comment thread?
While all these architectures are innovative and have helped improve either accuracy or speed, the same fundamental problem of generating factual information still exists.
Retrieval Augmented Generation (RAG), Agents and other similar methods help mitigate this. It will be interesting to see if future architectures eventually replace these techniques.
The models can't tell when they shouldn't extrapolate and simply need more information. Which rules can be generalized and which ones can't. Why shouldn't a method `doWhizBang()` exist if there methods for all sorts of other things?
When I was young, I once beamed that my mother was a good cooker. It made perfect sense based on other verbs, but I did not know that that word was already claimed by machines, and humans were assigned the word cooks. Decades later, I had the pleasure of hearing my child call me a good cooker...
To me, the issue seems to be that we're training transformers to predict text, which only forces the model to embed limited amounts of logic. We'd have to find something different to train models on in order for them to stop hallucinating.
I'm still thinking about how RAG being conceptually simple and easy to implement, why the foundational models have not incorporated it into their base functionality? The lack of that strikes me as a negative point about RAG and it's variants, because if any of them worked, it would be in the models directly and not need to be added afterwards.
Why would be a proper documents-at-hand based inquiry be «simple».
Information is at paragraph #1234 of book B456; that paragraph acquires special meaning in light of its neighbours, its chapter, the book. Further information is in other paragraphs of other books. You can possibly encode with some "strong" compression information (data), but not insight. The information that a query may point to can be a big cloud of fuzzy concepts. What do you input, how? How big should that input be? "How much" of the past reflection does the Doctor use to build a judgement?
RAG seems simple because it has simpler cases ("What is the main export of Bolivia").
IIUC, CoT is "incorporated" into training by just providing better quality training data which steers the model towards "thinking" more deeply in its responses. But at the end of the day, it's still just regular pre training.
RAG - Retrieval augmented generation - how can the retrieval be done during training? RAG will always remain external to the model. The whole point is that you can augment the model by injecting relevant context into the prompt at inference time, bringing your own proprietary/domain-specific data.
These things with <think> and </think> tokens are actually trained using RL, so it's not like GSM8k or something like that where you just train on some reasoning.
It's actually like QuietSTaR but with a focus on a big thought in the beginning and with more sophisticated RL than just REINFORCE (QuietSTaR uses REINFORCE).
Who says "during training"? RAG could be built into the functionality of the LLM directly - give it the documents you want it to incorporate, and it ingests them as a temp mini-fine tune. That would work just fine.
The same way developers incorporate it now. Why are you thinking "pre-training", this is a feature of the deployed model: it ingests documents and generates a mini-fine tune right then.
This was really educational to me, felt at the perfect level of abstraction to learn a lot about the specifics of LLM architecture without the difficulty of parsing the original papers
The diagrams in this article are amazing if you are somewhere in between a novice and expert. Seeing all of the new models laid out next to each other is fantastic.
This is a nice catchup for some who hasn't been keeping up like me
Would love to see a PT.2 w even what is rumored in top closed source frontier models eg. o5, o3 Pro, o4 or 4.5, Gemini 2.5 Pro, Grok 4 and Claude Opus 4
Honestly its crazy to think how far we’ve come since GPT-2 (2019), today comparing LLMs to determine their performance is notoriously challenging and it feels like every 2 weeks a models beats a new benchmark. I’m really glad DeepSeek was mentioned here, bc the key architectural techniques it introduced in V3 that improved its computational efficiency and distinguish it from many other LLMs was really transformational when it came out.
Truly, the downvoting on this site is a ridiculous little thing, all the more so among people who just love to frequently stroke themselves about how superior the intellectual faculties of the average HN reader/commentator are.. Gave you an upvote simply because for no fathomable reason, your two cents about LLM progress got downvoted into grey.
Someone thinks something specific in the completely reasonable opinion you gave about LLM progress during the last few years is wrong? Okay, so why not mention it, maybe open a tiny debate, instead of digitally reacting like a 12 year old child on a YouTube comment thread?
[dead]
While all these architectures are innovative and have helped improve either accuracy or speed, the same fundamental problem of generating factual information still exists.
Retrieval Augmented Generation (RAG), Agents and other similar methods help mitigate this. It will be interesting to see if future architectures eventually replace these techniques.
The models can't tell when they shouldn't extrapolate and simply need more information. Which rules can be generalized and which ones can't. Why shouldn't a method `doWhizBang()` exist if there methods for all sorts of other things?
When I was young, I once beamed that my mother was a good cooker. It made perfect sense based on other verbs, but I did not know that that word was already claimed by machines, and humans were assigned the word cooks. Decades later, I had the pleasure of hearing my child call me a good cooker...
To me, the issue seems to be that we're training transformers to predict text, which only forces the model to embed limited amounts of logic. We'd have to find something different to train models on in order for them to stop hallucinating.
I'm still thinking about how RAG being conceptually simple and easy to implement, why the foundational models have not incorporated it into their base functionality? The lack of that strikes me as a negative point about RAG and it's variants, because if any of them worked, it would be in the models directly and not need to be added afterwards.
Why would be a proper documents-at-hand based inquiry be «simple».
Information is at paragraph #1234 of book B456; that paragraph acquires special meaning in light of its neighbours, its chapter, the book. Further information is in other paragraphs of other books. You can possibly encode with some "strong" compression information (data), but not insight. The information that a query may point to can be a big cloud of fuzzy concepts. What do you input, how? How big should that input be? "How much" of the past reflection does the Doctor use to build a judgement?
RAG seems simple because it has simpler cases ("What is the main export of Bolivia").
RAG is a prompting technique, how could they possibly incorporate it into the pre training?
CoT is a prompting technique too, and it's been incorporated.
IIUC, CoT is "incorporated" into training by just providing better quality training data which steers the model towards "thinking" more deeply in its responses. But at the end of the day, it's still just regular pre training.
RAG - Retrieval augmented generation - how can the retrieval be done during training? RAG will always remain external to the model. The whole point is that you can augment the model by injecting relevant context into the prompt at inference time, bringing your own proprietary/domain-specific data.
These things with <think> and </think> tokens are actually trained using RL, so it's not like GSM8k or something like that where you just train on some reasoning.
It's actually like QuietSTaR but with a focus on a big thought in the beginning and with more sophisticated RL than just REINFORCE (QuietSTaR uses REINFORCE).
Who says "during training"? RAG could be built into the functionality of the LLM directly - give it the documents you want it to incorporate, and it ingests them as a temp mini-fine tune. That would work just fine.
The same way developers incorporate it now. Why are you thinking "pre-training", this is a feature of the deployed model: it ingests documents and generates a mini-fine tune right then.