Our latest Reading Group Session took a deep dive into DeepSeek's latest advancements—what they mean for LLM training, inference efficiency, and the evolving MLOps landscape. From scaling strategies to open-weight alternatives, the discussion was packed with insights.
🌍 📖 Missed it? Catch the recap here: https://lnkd.in/g7pPwZVj
🔍 We're doing this again on the 20th. This time, we’re unpacking "Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations." If you’re interested in how AI is reshaping economic tasks at scale, this is a must-read.
📜 Paper: https://lnkd.in/eZ6dW6it
📅 Date: 03/20
⏲️ Time: 11 am - 12 pm ET
🔗 Register now: https://lu.ma/26yqobcv
Read ahead if you can—excited to hear everyone’s insights!
It makes you wonder what tooling do they need to have in place to do that effectively? Like the degree of collaboration, the degree of false starts and different directions and you know, even like preparing the data and iterating on that before they even try to train a model. They don't say anything about this in the paper at all. I think they focus entirely on methodology and outcomes. But it makes you wonder, you know, what does it look like? Because while most people aren't going to do this, most businesses that want to do some kind of AI, they're not going to be trying to train their own reasoning model but they might be doing something of similar complexity. So that's kind of interesting to think about. Yeah, I think in the niches people are fine tuning so the process looks kind of similar there. But you're right like a lot mllops to actually get to this level of quality. Also think like it's. It's also mentioned a bit later in paper, I'm not sure if someone else will mention it that they have a couple of unsuccessful attempts. So what you're saying that like they obviously didn't get there straight is actually highlighted in the paper. It's very interesting to see what they tried that didn't fit in the pattern. Can I. It's in the, in the chat if from a high level, as I interpret what you've said, if I'm looking at this as a big picture, what it seems that Deep SEQ has done is instead of using billions of parameters, they've synthetically generated parameters from the chain of thought and substituted them back into the model. Would you say that's the basic notion of what's been done here? Yeah, I think so. I mean that's what they were doing with RL as well. But for R1, the only difference is, yeah, they have. Some of it is not just synthetic data. Like some of it is also data sets that already existed which they have used. So it's a combination of the two. But there's a lot of synthetic data that is being used to get to the good quality accuracy. Yeah, it's basically they use the output of the chain of thought as parameters. Basically in my mind as I'm understanding what you said. That's cool. Yeah. And so this is what I was interjecting, Adam. This, this was my takeaway. Like long chain of thought is all you need to make amazing reasoning or at least as of today. And all of that shows up at inference time which is all the rage about hey, can we get more money on that side, on experimentation side, there were some things that are useful just because I run a lot of giveaways and I think a lot of the community here would be interested to know that as well of how, how they evaluated things. And well, the first thing is how do you even evaluate thinking? That's not straightforward. And so they shared some stuff of what they were doing, how that can be useful for evaluating thinking. So the first thing they did was let's just turn on all of the maximum token output so that the model is provoked to see it longer and generate a lot more tokens and then that will capture all the detailed thinking as well. The other thing they did was they tried greedy respons, which is like they will take whatever the next token is and evaluate that. But that didn't work out well. So what they did was then they generated multiple outputs and they took the average of the correctness on all of those. And so that. And while generating multiple outputs they didn't have a temperature zero. Of course they wanted variety in it so that the LLM can do different thinking, so that then it can figure out like okay, you know, this is the right answer versus this and how often is it able to think correctly. That's kind of how they evaluated NK in this case, like how many responses are they generating per query? So that's how they evaluated it. They did it across everything. In the end with R1, R1,0 was mostly again coding and math. But then R1 they wanted to be more general purpose. And so here they took all the different coding benchmarks. There are so many, I'm not familiar with a lot of them as well. But the overall gist is that they evaluated against different things and then the quality is pretty good almost in all cases this process got them to be better than the base model, which was V3. Right. And so that was already a win where like with very little data and adding this like long inference time, generation of tokens, you can surpass the accuracy of what you can produce for complex queries. So that was super cool to see. It was the old rages. It was competitive with one latest update with much open source and all that. So that was the other thing.