Abstract
This article, which takes a specific example in order to look at how concepts shape developments in AI models, claims that the deployment of the concept of distillation is facilitating a particular reframing of reasoning within artificial intelligence. The concept of distillation is circulating through AI development cultures and has taken a certain shape in recent developments. The launch of DeepSeek-R1 reasoning model represents a pivotal and potentially disruptive moment in the direction of AI. This article explores how the concept of distillation has been central to its framing and impact, and how that concept will continue to shape AI into the future. Through a close reading of the research paper that accompanied the launch of DeepSeek-R1, this article looks at how distillation, as an existing idea within AI research, is reanimated as a concept. The article explores the way in which distillation is involved in the advancement of what we have called here power without scale. We then look directly at how distillation is applied to reframe reasoning, before then looking at how measures and benchmarking of reasoning are established in order to validate the transformative effect of distillation in AI models. The article deals with the role of the concept of distillation in shaping perceptions, infrastructures, planning, and expectations of AI.
DeepSeek-R1 reasoning model was released on the 20th of January 2025 with the promise of a ‘
Bonus: Open-Source Distilled Models!’ (DeepSeek API Docs, 2025; emoji in the original). What could have been a routine tech industry launch of yet another artificial intelligence product, instead brought with it a seemingly radical disruption to what was becoming a nascent yet established order of AI. By reshaping expectations and reconfiguring the futures projected onto AI, DeepSeek's launch of R1 rapidly came to represent a notorious moment. As will be explored in this article, at stake is what we call a reframing of reasoning. Perceptions of the future of artificial intelligence are shaped by the deployment and animation of concepts that frame their form and direction, the DeepSeek case represents one such moment. These shifts materialized most directly during the early part of 2025 in the sudden adjustment of value attributed to a host of AI related companies, coupled with a change in the planned direction of AI development, including the underpinning infrastructures, chips, capacities, data availability, and training requirements. Some unfathomable scales of lost value were reported in the immediate period. The shares of US-based chip maker NVIDIA, for instance, famously fell 17%, the equivalent of $589 billion in market capitalization, following the release of DeepSeek. This was considered the largest one-day drop in company value in history (Saul, 2025). Whatever the valuations that occurred after that day, the launch of the new model clearly represented a shock.
The release of DeepSeek's reasoning model also generated significant coverage and opinion, with it quickly coming to be regarded as a pivotal moment, which was the impression at least. It created seeming ‘bewilderment’ amongst those engaging in a so-called ‘AI race’ (Taylor, 2024: 10). The proclamations that followed were indicative of the presence of international competition in the sector (see Bloom, 2025). US President Donald J. Trump called it a ‘wake-up call’ for US tech companies, reinforcing the geopolitical narrative of an AI race between US and China (Koromi, 2025). Furthering a similar line of argument, Silicon Valley venture capitalist Marc Andreessen pronounced that ‘DeepSeek-R1 is AI's Sputnik moment,’ in reference to the Soviet Union's launch of the first satellite into orbit and the start of the space race in the 1950s. Following former US President Joe Biden's chip and hardware export restrictions on China in 2022, as Karen Hao has put it, DeepSeek ‘innovated because of, not in spite of, its constraints’ (cited in Hawkins, 2025). It seemingly managed to do as much with far less than US-based companies such as OpenAI. However, as the dust settles, Paul Taylor (2024: 11) has concluded that, from the picture so far, it remains very ‘hard to know exactly how innovative DeepSeek is,’ a point which was also echoed by the CEO of Google DeepMind, Demis Hassabis (Kharpal, 2025). Yet, its importance, we suggest, is connected to the conceptual framing of the DeepSeek-R1 system itself and, more specifically, to the use of ‘distillation’ as a conceptual and computational framing device. We take this as an example to look at the way that concepts frame and envision the conditions and forms of AI models.
While there has already been much social science scholarship that has examined the politico-economic conditions of algorithmic systems (Cheney-Lippold, 2024; Crawford, 2021; Luitse, 2024), and in particular LLMs (Luitse and Denkena, 2021), in this article we do not unpick the political economy of the DeepSeek-R1 model. Nor, despite their clear importance, do we focus on the relative efficacy of the model or on the circulating narratives and ‘cultural politics of artificial intelligence’ (Cai, 2025) associated with that moment. This is not about the specific functioning of the model itself. Rather, we look at the underpinning concepts that are acting to reframe and redraw the whole direction, understanding and imaginary around AI. We specifically look at how DeepSeek has used conceptions of distillation to actively pursue a reframing of AI reasoning.
Published on the 22nd of January 2025, the release of the R1 model was accompanied by an explanatory research paper entitled ‘DeepSeek - R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning’. That research paper sought to narrate, explain and then justify a set of claims regarding this new R1 model. As such, by looking closely at its content, we can see how those narratives, explanations, and claims are built around certain conceptual anchors that themselves are of importance for the elaboration of that model and for setting up a wider imperative of transformation in AI. Instead of seeing concepts like intelligence, reasoning or distillation as stable or fixed reference points for artificial intelligence development, we want to examine them ‘as in construction, as being organized into particular narratives and methods’ through an exploration of how the DeepSeek-R1 model ‘involves particular ways of organizing or ordering’ (Ahmed, 1998: 12). In other words, it is in those conceptual explanations for this particular model that we can locate the ongoing reframing of reasoning that will continue to shape AI. This is not just limited to the technical aspects of reasoning models, with their step-by-step formats (Caballar and Stryker, 2025), as we will explore here we are thinking more of the way that these models are informed by particular forms or modes of reasoning. Indeed, rather than exploring the technical aspects of models in order to understand them better, the approach of this article is to think about how concepts, and their connotations and purposes, come to shape and be shaped by AI models. This is part of an attempt to think about the relations between conceptual formulations and AI architectures. As such, this article seeks to explore the tensions between concepts and their technical applications. Reasoning is an example of where this occurs, as is distillation. As such, we explore here how the concepts are defined and emerge within this particular instance and how this fits into wider conceptualization of AI.
In this article, we closely read the DeepSeek research paper and the ideas around distillation that at it operationalises, looking at how that concept is deployed and developed as a means or mechanisms for altering how AI models are understood. We consider how the concept of distillation is deployed, and how it may continue to influence and shape the direction of AI beyond that particular R1 model. As such, we do not start here with a definition of distillation, instead we explore how the term comes to be defined and redefined in its usage and deployment. This is an exercise in reading computer science, by treating the paper as a text that can reveal how meaning-making is unfolding within machine learning. In a recent piece on reading computer science texts from within the humanities and social sciences, Amoore et al. (2023: 2) have argued that: ‘researching machine learning involves reading the texts of computer science and, in the act of reading them, necessarily and inescapably engaging with the world-making that is taking place. Beyond the matter of what one reads, there is the question of how reading takes place and what kind of reading could offer ways to engage machine learning as it makes meaning in the world.’
We begin by looking at how distillation is associated with a move from large to more concentrated models. We refer to this pursuit of density as an attempt to perceive a form of power without scale. By this we do not mean power with no-scale, but power without the necessity for scaling-up. This is power produced at a reduced and reducing scale of architecture - a scaling down with the retention of power. From this wider shift depicted in DeepSeek's accounts, we then focus on how these processes are understood by looking directly at the distilling of processes and the reframing of reasoning itself. Finally, we move to an examination of how distillation is rendered comparable as an approach with other alternative models, and the utilization of the benchmarking of power and potential to assess such systems. In that section, we consider the way that reasoning itself is treated as something to be measured, and how those measures facilitate a contrasting of the capacity and capabilities of different models. Informed by a benchmarking of capabilities, notions of distillation are used to emphasize the concentration of power. By applying specific metric properties, the measurement of reasoning inevitably alters how reasoning is defined and conceptualized, and how its parameters are set. Overall, accepting that there are other possible readings and focal points to be found within it, we argue that the DeepSeek-R1 moment represents an attempt to reconfigure and reframe AI around a core concept of distillation, which nonetheless stretches beyond that single model. We do not regard the DeepSeek case as novel in that regard, but as a particular reanimation of a concept that is taken to perform a role in envisioning these systems. The impact of the DeepSeek model is connected directly with this underpinning concept of distillation, which in turn will bring to the fore its future influence on the field as well as in defining how AI models are to be understood, conceptualized and framed. This, we argue, is where we can see an active reframing of reasoning taking place.
Power without scale: From large to dense models
In the 2025 research paper, DeepSeek explained how they found that a prior model trained on large-scale reinforcement learning (or RL) ‘naturally emerges with numerous powerful and intriguing reasoning behaviors’. Yet, because of the challenges still encountered, the DeepSeek researchers sought to develop DeepSeek-R1, ‘which incorporates multi-stage training and cold-start data before RL’ (DeepSeek-AI, 2025: 3). A change in the staging or steps of processing led with the R1 model, they claim, to breakthroughs in terms of reasoning. This was to add stages prior to reinforcement learning, and to increase training using ‘cold-start data’. Cold-start data is a term used widely to refer to a ‘lack of knowledge about attributes and features’ for new data (Patel and Patel, 2020: 4.1; see also Camacho and Alves-Souza, 2018). Particularly prominent in DeepSeek's account of the shift being made here is the notion of ‘dense models’ (DeepSeek-AI, 2025: 3). The reduction in size of larger models does not simply create a smaller model but, as they claim, the perceived concentration creates a model that is denser. Density is equated with the retention of processing power and performance levels at the smaller scale. We use the term density here not in its directly technical application but in terms of the conceptual and rhetorical framings of the models being offered in DeepSeek's account. It is a term, we note, that is suggested by the presence of distillation processes. This points to a wider issue with exploring the relations between concepts and AI models, which is that these concepts can have specific technical applications, whilst they also have connotations and envision these systems in particular ways. In this article, as we are thinking more in terms of the sociological analysis of the conceptualization of AI development, we aim to allow for the concepts to envision as well as to be descriptive of certain technical features of architectures.
The notion of density in AI often refers specifically to the arrangement of connections within the neurons and layers in neural networks. The more connected a node the greater the density (with similarities to conceptions of density in social network analysis, see Crossley, 2008). A ‘dense model’ refers to a neural network algorithm where all the neurons in one layer are connected to all the neurons in the next layer, or where every input in one layer is connected to every output in the next layer (Corbett, 2022). These are also known as ‘fully connected’ neural networks. In the case of DeepSeek, this notion of ‘dense models’ appears frequently as a way of framing the consequence of their more pronounced focus on distillation. So, for instance, they describe how they: ‘further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5- 32B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying RL on it. This demonstrates that the reasoning patterns discovered by larger base models are crucial for improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey et al., 2024) series. Notably, our distilled 14B model outperforms state-of-the-art open-source QwQ-32B-Preview (Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a new record on the reasoning benchmarks among dense models’ (DeepSeek-AI, 2025: 3).
This focus on density and distillation resonates with, but also seems to run counter to, a broader emphasis on scale within deep learning and AI. As Turing Award winners Bengio et al. (2021: 63) have suggested, ‘the performance of deep learning systems can often be dramatically improved by simply scaling them up,’ adding that ‘with a lot more data and a lot more computation, they generally work a lot better.’ Also, according to the 2024 Artificial Intelligence Index Report from the Human-Centered Artificial Intelligence unit at Stanford University, the trends on both ‘parameter count’ (the learned numerical values that facilitate interpretation and prediction) and ‘training compute’ (the allocated computational and hardware resources) in AI models has risen exponentially in the last 20 years, whilst it has been particularly ‘pronounced in the last five years’ (HAI, 2024: 49–51). The idea (or assumption), in short, is that AI models ‘keep improving as they get bigger’ (Bengio et al., 2021: 63). This ‘scalability thinking’ (Pfotenhauer et al., 2022) or imperative for scale has long dominated the AI community. A notable instantiation of the principle that bigger is better is the 2020 release of OpenAI's GPT-3, which had 175 billion parameters, compared to GPT-2 released in 2019, which only had 1.5 billion parameters. Referring to these specific models from OpenAI, researchers from Stanford claimed that this rise in scale made possible ‘in-context learning, in which the language model can be adapted to a downstream task simply by providing it with a prompt… an emergent property that was neither specifically trained for nor anticipated to arise’ (Bommasani et al., 2022: 5).
The problem of scale, as explained by one IBM based writer, is that the: ‘top performing models for a given task are often too large, slow or expensive for most practical use cases - but often have unique qualities that emerge from a combination of their size and their capacity for pre-training on a massive quantity of training data…Conversely, small models are faster and less computationally demanding, but lack the accuracy, refinement and knowledge capacity of a large model with far more parameters’ (Bergmann, 2024).
Making reference to benchmarks for comparison of models and to identify the potential of distillation, DeepSeek goes on to argue that: ‘Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. DeepSeekR1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview. Additionally, DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500, and 57.2% on LiveCodeBench. These results significantly outperform previous opensource models and are comparable to o1-mini. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community’ (DeepSeek-AI, 2025: 4)
DeepSeek's overview of their approach is stated simply in terms of attempts to ‘distill the reasoning capability from DeepSeek-R1 to small dense models’ (DeepSeek-AI, 2025: 5). The notion of distillation then carries significant weight in the envisioning of AI and in demarcating a particular change in direction that this development is intended to represent. The imperative to ‘scale up’ in AI - with more data, more compute, bigger infrastructures, and so on - is fundamentally expansive and extractive (Sadowski, 2019). The problem is that there are inevitable limitations to ‘finite media’ (Cubitt, 2017). Distillation, on the other hand, is understood to work with and through these limitations. The references to distillation project promises of reduction, of concentration, and compression, turning them in productive processes; that is, of a move from larger to smaller yet powerful models. This generates a vision of power without scale. It is a vision that, from the outset, seems to problematize scalability thinking. The imperative to scale-up or to expand, that permeates much of the contemporary AI tech landscape, is brought into question. Yet power without scale does not signal simply a move away from the big and voluminous. Instead, it articulates a different kind of relationship between the small and the large in the context of algorithms. It encapsulates the possibilities of the cyclical, circulatory, as well as the recursive in training and fine-tuning AI models (Hui, 2019). This represents an iterative movement between and across different scales. In a sense, distillation displaces this question of scale, and what becomes important is rather what is retained, what is being distilled, and what new models are produced - and can be produced - in the process of distillation. Moreover, in the conclusion of the DeepSeek-R1 paper, the authors point to the importance of distillation. They indicate the need for further distillation in the future, and greater levels of concentration, rather than it being a fixed or one-off process. There is something recursive in this, with models feeding into models with the aim of greater distillation and therefore ever more concentrated models, a move which can be seen as symptomatic of a recursive society (see Beer, 2022).
It is important to reiterate that the concept of distillation is used in very specific terms and put to work in very specific ways in relation to the R1 model by DeepSeek. At the same time, it is worth noting that this is a concept that has been circulating within AI research for a longer period. In a much-cited paper from 2015, with over 23,000 citations at the time of writing, AI researchers Geoffrey Hinton, Oriol Vinyals, and Jeff Dean develop the notion of ‘distillation’ based on earlier work in computer science on compression techniques in machine learning (Bucila et al., 2006). In Hinton et al.'s paper titled ‘Distilling the Knowledge in a Neural Network,’ the notion of distillation is seen as a ‘kind of training,’ in which a large model (or assembly of models) is trained, with the knowledge it has learned from data then transferred from this larger model to ‘a small model that is more suitable for deployment’ (Hinton et al., 2015: 1). Hinton and colleagues from Google argue that distillation is an effective way to ‘transfer the generalization ability of the cumbersome model to a small model’ (Hinton et al., 2015: 2). In its simplest form, they define distillation as when: ‘knowledge is transferred to the distilled model by training it on a transfer set and using a soft target distribution for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax’ (Hinton et al., 2015: 3).
Despite potential differences in the use of the concept of distillation, Hinton et al.'s conceptualization is worth dwelling on as they take ‘a more abstract view of the knowledge’ learned by machine learning algorithms (Hinton et al., 2015). In their view, knowledge is free and abstracted from any particular instantiation and signifies ‘a learned mapping from input vectors to output vectors’ (Hinton et al., 2015: 1–2). In short, distillation is about the model's capacity to learn insights from seen input data and apply it to unseen data in the world after training. But more than that, it is about a kind of transfer, or as they state in the paper, the challenge is to know how to ‘change the form of the model but keep the same knowledge’ (Hinton et al., 2015: 1). Yet, they also note that training a large ‘cumbersome’ model poses a challenge since it not only assigns probabilities to what it is tasked with detecting (as with a discriminative machine learning model) but also assigns probabilities to ‘all of the incorrect answers and even when these probabilities are very small’ (Hinton et al., 2015: 2). This has an impact on the extent to which a model is able to learn structures and patterns from a seen dataset and generalize well to unseen input data. This therefore poses the question: how can a small algorithmic model learn to generalize well to unseen data? Or more specifically, how can a small model be trained ‘to generalize in the same way as the large model’? (Hinton et al., 2015: 2).
Hinton and colleagues sought to showcase the efficacy of this approach by training a large neural network on the full MNIST dataset - 60,000 training cases of handwritten digits. Afterwards, they omitted the digit 3 from the dataset or ‘transfer’ set. This means that ‘from the perspective of the distilled model,’ they state, ‘3 is a mythical digit that it has never seen’ (Hinton et al., 2015: 4). But with knowledge having been transferred from the large neural network, they showcased how ‘the distilled model gets 98.6% of the test 3 s correct despite never having seen a 3 during training’ (Hinton et al., 2015: 4). In other words, by using the technique of distillation, a small model can be trained to generalize in much the same way, and with much the same accuracy, as a larger neural network model. Although their work differs from Hinton et al. (2015) in some respects - such as training approach and testing dataset - the prior work on ‘model compression’ by computer scientists from Cornell University, Bucila and colleagues (2006: 1) touched upon the same core idea as that of distillation, explaining that ‘in this paper we show how to compress the function that is learned by a complex model into a much smaller, faster model that has comparable performance’. Also commenting on how ‘the main idea behind model compression is to use fast and compact model to approximate the function learned by a slower, larger, but better performing model’ (Bucila et al., 2006: 1). Together, however it is deployed in these different ways, both papers are suggestive of how the concept of distillation itself is circulating beyond individual cases and is reframing how reasoning, and therefore how AI and machine learning, are approached and developed. DeepSeek, it should be emphasized, make specific use of the term distillation whilst it is circulating more widely through AI research and development.
What, then, is the wider relevance of distillation for our understanding of contemporary AI and how it is used in the case of the DeepSeek model? As Kate Soule, Technical Director of Product Management for IBM, recently put it, when commenting on the broader socio-economic implications of DeepSeek and its emphasis on distillation, ‘the incentive is to build really big models to help you build really small models’ (Soule cited in IBM, 2025). She added that ‘we are going to converge on to a point where we've got powerful enough tools to craft the smaller models that we need that are going to run 80–90% of our workflows for generative AI in the future’. Scalability and scaling AI models, in this perspective, is seen as valuable only insofar as it provides the potential for further processes of model distillation. In this sense, DeepSeek embodies an emergent logic of power without scale which promises that, through iterative processes of distillation, any small model can be made to behave like a big model. In contrast to the idea of ‘Big AI’ (van der Vlist et al., 2024), DeepSeek seems to be making the case for this type of ‘Small AI’. And similar to what Pfotenhauer et al. (2022: 6) say about scale as a concept, distillation has the potential to become ‘an imperative and framing device’ for tech companies and AI researchers that ‘prescribes what seems worth doing, what the rules of engagement are and how we define problems or solutions’. It creates an imperative and might then become prescriptive. DeepSeek's use of distillation, therefore, envisions a future where increasingly models beget other (albeit smaller) models, fit for a range of different real-world tasks. All AI models, regardless of their scale, become potentially distillable or, at least, follow the logic of an imperative to distill.
The reframing of reasoning
At this point we need to approach reasoning not as a fixed technical term, but as a wider set of logics and modes of thinking. This represents another example of where a tension can open-up between the use of the concept, with connotations and visions, and the more rigid technical applications in AI architectures. As the previous section emphasized, distillation is framed as an ongoing and iterative process. This is also reiterated in the framing of DeepSeek-R1 and in the accompanying research paper's concluding references to ‘the teacher model’ (DeepSeek-AI, 2025: 16). As they put it: ‘we further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1 as the teacher model to generate 800 K training samples, and fine-tune several small dense models’ (DeepSeek-AI, 2025: 16). ‘One of the most common is called the teacher-student technique. In a nutshell, the teacher model is large, and is trained using standard training data with known, correct target values. The student model is trained by using the predictions of the teacher model instead of using training data. Put another way, the student model learns to mimic the teacher model’ (Pure AI, 2025).
The notion of the teacher model is important in this reframing of reasoning around the concept of distillation. In an overview of distillation in AI for IBM, Bergmann (2024) addresses the problem of scale mentioned previously whilst also facilitating mobility: ‘most commercially viable LLMs are too large and computationally demanding to be used locally on mobile phones or other edge devices. This presents various logistical, computational and privacy complications that would otherwise be circumvented with a smaller model that could be run directly on mobile devices. KD's model compression thus presents a promising means to transfer the emergent qualities of large models to models small enough to be run on-device’ (Bergmann, 2024). ‘Knowledge distillation is a machine learning technique that aims to transfer the learnings of a large pre-trained model, the “teacher model,” to a smaller “student model.” It's used in deep learning as a form of model compression and knowledge transfer, particularly for massive deep neural networks… the primary objective in distilling knowledge is to train the student network to match the predictions made by the teacher network’ (Bergmann, 2024). ‘Knowledge distillation techniques aim to not only replicate the outputs of teacher models, but to emulate their “thought processes.” In the era of LLMs, KD has enabled the transfer of abstract qualities like style, reasoning abilities and alignment to human preferences and values’
As well as representing a moment of realization for global capitalism; it was also one for the developers at DeepSeek and, they claim, for the system itself. On page 8 of the accompanying research paper appears a passage where the authors briefly outline what they call the ‘Aha moment of DeepSeek-R1-Zero’ (DeepSeek-AI, 2025: 8). It is an intriguing insight into the meshing of reasoning that occurs in the entanglements between AI models with the people involved in creating them. Effectively, the ‘aha moment’ is a label they have applied to a sudden moment of realization shared by both those developing the system and the system itself. The DeepSeek paper explains that ‘a particularly intriguing phenomenon observed during the training of DeepSeek-R1-Zero is the occurrence of an “aha moment”’ (DeepSeek-AI, 2025: 8). This is clearly a part of an attempt to emphasize the reasoning power of the model, and should be regarded in those terms. Interestingly, carving a space outside of the process, the authors are positioned as the ‘observer’ of the moment. This places them both inside and outside of the process, projecting a type of agency onto the system of which they are both a part and external to. There is a tension in the accounts regarding the human position within the system. An audience member rather than an active protagonist - situating agency within the machine. Occurring at an intermediate moment, where the development is still uncertain and where a particular act might be seen as pivotal to the next stages of the project. The Aha moment is also something to be recognized just before and whilst it is occurring. As, ‘during this phase,’ the pace slows, and ‘DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach’ (DeepSeek-AI, 2025: 8). The Aha moment is actually an in-built part of the processing, it is where reflection and reevaluation take place, it is a sharing of the surprise of exploration. The DeepSeek authors suggest that this is part of the learning of the model, suggesting that ‘this behavior is not only a testament to the model's growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes’ (DeepSeek-AI, 2025: 8).
The Aha moment is about process and outcome. It is relevant to the exploration of distillation because it is suggestive of how a form of reasoning power or even a hesitant agency is being attached to the model despite the distillation. This line of argument in DeepSeek's account emphasizes the moment of transition that they thought to be occurring. Yet, in the words of N. Katherine Hayles (2012: 18), it can also be conceptualized as a moment of ‘technogenesis’, where the unexpected ‘Aha moment’ is both produced by and highlights how humans and algorithms are always already enmeshed in complex adaptive systems, where technologies are ‘constantly changing as well as bringing about change in those whose lives are enmeshed with them’. Even if the developers might position themselves outside of the reasoning model they are enmeshed with it. The suddenness of the Aha moment is about discontinuity, which serves as a reminder of the ways in which humans and algorithms intermingle and co-emerge. The accounts of DeepSeek also provide an example of what Luciana Parisi (2019a, 2019b: 32) has called ‘the alien subject of AI’, where algorithms can be thought to exhibit their own albeit denaturalized or ‘alien mode of thought stemming from within the instrument’. Appearing mid-way through, this moment of realization is associated with a particular set of ideals for the reformulation of AI models around the concept of distillation.
The notion of ‘distillation’ that the paper advances constitutes a kind of power coming from the reasoning capability and reinforcement learning of a distilled version of the larger models. As we have seen, DeepSeek's focus on distillation does not come from nowhere. Instead, it is a concept that was already circulating around AI developers, and taking-on differing forms, expressions and applications in varied contexts. The concept is cut free of its original reference points, and repurposed in different ways, such as in the DeepSeek-R1 paper, shaping understandings of AI and AI futures. Indeed, this is DeepSeek's stated objective with the claim that ‘the open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future’ (DeepSeek-AI, 2025: 4). One of the similar subheadings within DeepSeek's article reads: ‘Distillation: Empower Small Models with Reasoning Capability’ (DeepSeek-AI, 2025: 11). The thing being distilled in this model is reasoning capability. And so what DeepSeek's paper represents is an attempt to reframe reasoning in particular terms and in line with the features of the R1 model and its so labeled Aha moments. We can then ask what ‘regime of recognition’ (Amoore, 2020) is present within this reframing of reason, and what attributes of reason such a notion includes and is built-around. There are clues to this in DeepSeeks account of the R1 model.
Terms like fine-tuning and refining are used to emphasize the particular repurposing of existing approaches and ‘base models’, which is followed by the passing back of those ideas to developers of AI more broadly. They explain how in order to: ‘equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly fine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using the 800k samples curated with DeepSeek-R1… Our findings indicate that this straightforward distillation method significantly enhances the reasoning abilities of smaller models’ (DeepSeek-AI, 2025: 11). ‘For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader research community.’ (DeepSeek-AI, 2025: 11)
The benchmarking of power and potential: Comparing and evaluating distillation effects
The reframing of reasoning in AI through the concept of distillation is underpinned and justified by certain forms of measurement, that is, by ways of determining the comparative efficacy of distilled models and their reasoning capabilities. The benchmarks become a means of measuring reasoning, thus recognizing certain features that frame what reasoning constitutes. As such, a complex question that is made to seem simple could be: which reasoning is better? We have already mentioned the use of benchmarking to show the power at reduced scale that was seemingly enabled through processes of distillation in AI. Benchmarks commonly refer to datasets that ‘represent certain tasks or technical challenges and are routinized for comparing, replicating, and reproducing model results’ (Orr and Crawford, 2024: 4956). More than that, benchmarks have been considered as a particularly ‘major driver of progress in AI’ because, as AI researcher Francois Chollet (2019: 8) states: ‘they are reproducible (the test set is fixed), fair (the test set is the same for everyone), scalable (it is inexpensive to run the evaluation many times), easy to set up, and flexible enough to be applicable to a wide range of possible tasks’.
The section of DeepSeek's paper entitled ‘Distilled model evaluation’ includes a table providing a ‘Comparison of DeepSeek-R1 distilled models and other comparable models on reasoning-related benchmarks’. In analysing the table DeepSeek expand that: ‘As shown in Table 5, simply distilling DeepSeek-R1's outputs enables the efficient DeepSeek-R1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform nonreasoning models like GPT-4o-0513 across the board. DeepSeek-R1-14B surpasses QwQ-32BPreview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed o1-mini on most benchmarks. These results demonstrate the strong potential of distillation. Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here.’ (DeepSeek-AI, 2025: 14)
Take the example of ImageNet, considered one of the main catalysts for the deep learning revolution in the early 2010s. ImageNet was an influential benchmark dataset for computer vision research. More specifically, it figured as a crucial resource of images for training and testing AI models on object detection and classification (Denton et al., 2021). Yet while the ImageNet dataset provided the AI community with a large dataset of diverse categories of images, it also constitutes a kind of standard which helped shape the norms of AI development, with machine-learning based models almost entirely replacing rules-based systems in future ImageNet competitions. As Will Orr and Crawford (2024: 4956) argue, benchmark datasets ‘not only set the problems that are deemed worthy of solving by the community, but they also constitute the very metrics of success’. This means, they continue, that ‘for a model to be considered state-of-the-art, it must correctly predict the outputs of popular benchmarks’. In other words, what is considered ‘good performance’ is heavily predicated on and shaped by the benchmark dataset against which a model is evaluated and measured. As it has been shown since, such benchmark standards can be heavily skewed in terms of their representation of gender or race (Crawford and Paglen, 2019; Noble, 2018), creating benchmarks that result in algorithmic models that perpetuate ethico-political biases in society.
In the DeepSeek paper, distillation is seen in similarly standardizing ways, setting ‘formats’ (Koopman, 2025) in which different distillation rates can be compared across different models. The power of distillation is seen predominantly through the lens of the model performance and what it makes possible against some already-known benchmark dataset. In the section ‘Distillation v.s. Reinforcement Learning,’ they claim that ‘we can see that by distilling DeepSeek-R1, the small model can achieve impressive results’ (DeepSeek-AI, 2025: 14). The benchmarking performs the role of legitimating such claims; it is what makes forms of model evaluation possible - whilst also then setting the parameters of what distillation and reasoning are thought to represent. This benchmarking is what makes AI reasoning capabilities in certain models seem more reasonable than others. Yet this again is not considered an endpoint in the transformation of AI being proposed.
From this combination of benchmarking with a sense of ongoing reframing, DeepSeek concludes that: ‘RL training, achieves performance on par with QwQ-32B-Preview. However, DeepSeek-R1- Distill-Qwen-32B, which is distilled from DeepSeek-R1, performs significantly better than DeepSeek-R1-Zero-Qwen-32B across all benchmarks’ (DeepSeek-AI, 2025: 15).
The comparison (or competition) between distillation and reinforcement learning led DeepSeek to draw two conclusions: ‘First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and largerscale reinforcement learning.’ (DeepSeek-AI, 2025: 15)
Conclusion
It is necessary to develop an understanding of how concepts and AI interrelate - with concepts shaping models and architectural developments, and then AI also shaping the conditions of emergence and circulation of concepts. Our contention is that the tensions between concepts and architectures need to be maintained for the sociological aspects of AI to be understood. Here we took a particularly notable case to see the conceptual formations within it. In this article we have explored the related conceptualizations of distillation, density and power within the AI sector as emergent within accounts of AI developers, in this case by closely reading DeepSeek's key explanatory research output associated with their influential R1 reasoning model. Whatever happens with DeepSeek in the future, our point is that the concept of distillation is likely to endure from this juncture. By focusing on a close reading of an AI developer output, in the form of the specific AI company research document that supported the release of the DeepSeek's R1 model, we have shown how, in this specific case, the concept of distillation has become central in these particular transformations of AI reasoning. This, we have suggested, represents a reframing of reasoning within the logics of AI, and in line with indicators or benchmarks of reasoning power. In particular, in this instance it is a reframing of reasoning around the core concept of distillation. The article has explored how distillation is used to present a difference between large and smaller more concentrated AI models. Those seemingly denser AI models are framed as offering a type of power without scale (thus seeking to apply that concept to reverse wider expectations around the necessity of scaling-up processing). This reframing enables a notion of power to be attached to the ‘reduced’ and compressed models that have undergone distillation processes by removing the processing demarcated as surplus to requirements. The article showed how reasoning capability is understood to be distilled and yet retained, and then how that capacity is compared and benchmarked to reinforce or evidence the perceived power of distillation, both as a concept and as a computational technique.
As we have shown, distillation is a concept that has been circulating in AI research and development for some time. Yet it has been given new specificity and purpose by the impact of the release of the DeepSeek-R1 model in early 2025. Putting aside the economic and technical impacts of that moment, the concept of distillation as an organizing and framing presence is central to that development and to the perceived shift in direction of AI now and into the future. With an eye on that future-making, DeepSeek offer a future plan in the conclusion to their paper. As such, the concept of distillation has particular importance for understanding the entire imaginary of AI and also therefore, its material directions. In other words, closely reading the DeepSeek research paper that accompanied its R1 Model shows the importance of a repurposed concept of distillation as a central notion in the development of AI. Distillation is associated with the reduction in size of AI models, with the consequent associations of increased density and the achievement of a perceived concentration of processing power. Power is achieved here through a kind imperative to subtract. That is to say, in these visions, the unnecessary is removed to leave a denser version of the model. The image is of a leaner, less energy and resource thirsty version of AI that still retains the reasoning capabilities of a much larger model. The broader concentration of power around the development of AI is usurped by the concentration of power within a single model. The consequence is that the concept of distillation is guiding this innovation, with its influence meaning it is likely to frame other innovations.
We have focused on one important and influential example in this article, yet our key point is that the concept of distillation, amongst other concepts, has a life before and after that particular DeepSeek model. We suggest that it is by examining such concepts and their relations with AI modeling that we can understand the sociological dimensions of AI and also the processes and logics that shape its form and conditions of development. The concept of distillation itself also then has consequences for the future political economy of AI more broadly, with the pursuit of models that are seen to represent a blend of power and efficiency of architecture. Concepts shape the models and also then the models will shape the life of the concept. We cannot know what direction this will go in, but, as we have shown, the concept of distillation has been reanimated in the development and framing of the DeepSeek reasoning model discussed here, and will continue to be reanimated in the framing of future models. This then is where an attentiveness to the life of concepts within AI is potentially revealing of the interface between the technical and the social. As reasoning models develop and adapt, attempts to distill and produce concentrated and smaller models that facilitate greater levels of power without scale will remain a framing objective. The concept of distillation and the reframing of reasoning that it brings has a structuring potential in AI development and in how those logics will then be integrated into social worlds.
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
