Abstract
On the verge of the AI Age—the Mechanocene—we tend to reproduce past narratives where humans are replaced by machines. This leads us to identifying skills and activities that will not be automated by AI soon, and prepare both current and future generations for those ‘safe’ occupations. However, there is the risk that the set of non-automatable tasks becomes empty sooner than expected, preparing new generations for skills that will not be needed in the end. Instead of this incremental route, in this paper we imagine a final destination where AI has all the skills humans have. We argue that in an age where humans can be intensively assisted, augmented and coupled with AI, we need to rethink enhancement and assessment under the extended mind thesis—a philosophical theory suggesting that technological tools can become integral parts of our cognitive processes. Under this perspective, we can still identify skills that are useful in this AI age, especially if humans want to understand and influence their world. Then, if these skills are set as the goals for education, we need to reliably assess they are achieved. However, a world of ubiquitous AI extenders generates enormous challenges for the assessment of individuals, when humans will be mostly operating as part of AI-human hybrids and collectives. In this context, the skills of the individual, human or machine, must be evaluated in terms of their contribution to the human-machine teams they are expected to be embedded in. This further takes education and assessment from the traditional aspiration of achieving fully autonomous skills to the reality of more integrated and interdependent human-machine scenarios of the AI age.
Keywords
Introduction
One of the goals of education is to prepare younger generations for the future. 1 But how are we going to do this for an AI future? One way of narrowing this question is determining what skills will be needed in ten or twenty years’ time, and ensure the new generation achieves those skills. But what skills will be needed? 2
This question is further narrowed down to determining the skills that are needed for the future of work, or associated with the question of what AI will be able to automate in the workplace. And here, despite the efforts of economists, AI researchers and labour specialists, among others, the uncertainty is palpable. Our predictions have been failing systematically. In 1995, Steven Pinker argued that Most fears of automation are misplaced. As the new generation of intelligent devices appears, it will be the stock analysts and petrochemical engineers and parole board members who are in danger of being replaced by machines. The gardeners, receptionists, and cooks are secure in their jobs for decades to come. (Pinker, 1995, p. 193)
This was the time when AI could not tell cats from dogs or distinguish types of tree leaves from a close-up photo. Hence the gardeners were safer for decades to come. But about two decades later, once deep learning was solving many of these perception problems, then it was the reasoning problems that were challenging for AI. The conclusion from Frey and Osborne's analysis in 2017 was that white-collar jobs were safe. Indeed, the “financial analysts”, “chemical engineers” and “judges, magistrate judges, and magistrates” had probabilities of automation of 0.23, 0.017 and 0.4 respectively, where the “landscaping and groundskeeping workers”, “receptionists” and “cooks” had values of 0.95, 0.96 and 0.95 (Frey & Osborne, 2017). The gardeners were going to lose their jobs. Just the opposite to Pinker.
Only about five years later, new analyses have changed the situation again (Brynjolfsson et al., 2024; Cazzaniga et al., 2024; Eloundou et al., 2024; Tolan et al., 2021; Wiles et al., 2024). This is especially the case after the new generation of multimodal models have taken the world by surprise, and as per today, they can do complex mathematics, chemistry, video editing, summarising, information retrieval, and so on, and the so-called generative AI can create sound, music, image, video and full computer programs from a single prompt. Basically, all the knowledge-intensive occupations are in danger. The “financial analysts”, “chemical engineers” and “judges, magistrate judges, and magistrates” are back on top of the list. And the top of the list is already being strongly affected by automation. The illustrators and generalist translators are simply in the course of being automated. It is not only those occupations that handle knowledge and process multimedia inputs and outputs; new AI models that reason much better are around the corner (Zhong et al., 2024), and then agents (Shavit et al., 2023), and so on. The pace is accelerating.
So, should we remove knowledge and reasoning from the curriculum of our students? Clearly not. There are two major mistakes in this simple rationale. First, education not only prepares for professions, but for many other activities, and above all, basic education is a fundamental right that must allow us to understand the world and interact in a society, in the spirit of the Enlightenment. We learn what a logarithm is because it is useful to understand the world (decibels, earthquakes, etc., are expressed in logarithmic scales), not because most of us need to calculate them anymore. Second, even if we do not make the first mistake of thinking that education is just professional training, we can make another mistake: we may think that, in a future where AI can do almost everything humans can do, new generations will ask our AI assistants about how the world works and how to act in the society, and, as a consequence, new generations will not be understanding how the world works and why those actions are taken. The error here is more subtle: we tend to have the perception that, because we are using a calculator to get the square root of the 256 metres of the area of a factory roof, it is the calculator and not us who is solving the problem. The truth is that we select the input, the operation and verify some elements of the output in a more global context than the operation itself. We feel we are solving the problem.
The confusion comes from a simplistic interpretation of Figure 1. As AI skills increase, we tend to think that human skills will recede (children do not learn how to solve square roots these days) and be displaced by novel skills (such as humans “prompting” or supervising AI).

Evolution of human skills and AI skills as decoupled entities. The black arrows represent that AI skills are growing and intersecting more and more with human skills, but also introducing novel skills in other directions. The blue arrows represent that human skills are evolving too, partly because of the push of AI, exploring new areas but also receding in some others, because those skills are no longer needed.
Disentangling this dilemma will require an understanding of the “extended mind thesis” (Clark & Chalmers, 1998). This thesis postulates that those cognitive processes that take place outside of our skull are a genuine part of our cognition, provided there is sufficient coupling between what the brain does and the external tools, such as a person making a calculation with pen and paper. When this thesis is brought to the realm of AI, we need to understand what kind of AI systems can work as extenders. We will revisit the notion of AI extender in the following section.
From here, we will be able to explore possible scenarios of AI, including the one in which AI can do everything a human can do. In the possible trajectories to this maximalist scenario, we will not ask the question of what skills will be left for our children to learn, but how we want to enhance children and adults, in such a way that we, as extended minds, understand the world and interact with a society. And, if this is clear, how will we determine that the enhancement is going in the right direction, that is, how must assessment change?
From the perspective of the education of the next generations in the age of AI, there are many open questions, but there is a palpable tension between students using AI and keeping their cognitive and metacognitive autonomy (Yan et al., 2024), especially in evaluation: “how can we assess learning to reflect genuine knowledge and skill development rather than an AI-created performance illusion?” One perspective that can soften this tension is the extended mind thesis, and through it we can transition from the questions of what students should learn and how, to the thornier question of how they should be evaluated. This paper navigates this space, without the intention of being comprehensive or authoritative about this matter.
AI extenders
While the extended mind thesis has always been understood in the context of the relation of humans with tools and technology, including computers and AI (e.g., Carter & Nielsen, 2017), it is J. Hernández-Orallo and K. Vold (2019) who analyse this specific thesis in the context of different kinds of AI, and situate it in comparison with different uses of AI that are also referred to when talking about human enhancement due to AI. They present a continuum between fully externalised systems, loosely coupled with humans that may become redundant, and fully internalised processes, with operations ultimately performed by the brain, making the tool redundant. We can see this continuum in Figure 2.

Kinds of AI systems and how they affect human cognition in the externalisation–internalisation continuum. Extended cognition happens out of the brain but can be considered as a way of empowering humans. However, in contrast with internalised cognition, the effect is lost when the AI is not available or malfunctioning.
The four levels
3
can be described as follows:
Autonomous AI: This was the most common interpretation until the recent AI chatbot revolution. AI was considered to be incarnated as an agent, not only robots but also digital assistants that would perform tasks on their own. The dominant narrative associated with autonomous AI is that the whole process is automated (replaced). For instance, you order a robot to clean a room. Externalised AI: This is a very common interpretation today. Here, we split tasks or questions into chunks, and we give them to an outsourced service. This is sometimes known as cognition or AI as a service. The dominant narrative is partial automation (subprocesses are replaced). For instance, you ask a large language model to translate the instructions of a lamp into multiple languages before shipping it. Extended cognition: The tool here is highly coupled and needed for the task, in such a way that the result would have been different, very inefficient or impossible without it. The dominant narrative is that people are empowered (machines and humans are coupled). For instance, you use large language models to generate a bullet list with advantages and disadvantages of genetically modified food, you go through them, asking the model to refine some of them, and you end up writing an essay using the final list. Internalised cognition: Here, the task is done externally initially, but then mimetised internally by a process of teaching or explanation from the AI to the human. AI generates culture (new words, concepts, ideas, representations etc.) that can be transmitted to humans. The dominant narrative is that people are enlightened, with computers that discover new things that are taught to us afterwards. For instance, a language model invents a new recipe that you learn and prepare regularly for dinner.
Of course, ideally, we would like to always have internalised cognition. When we use an AI-assisted app to learn a new language, and we actually learn a language, then we become autonomous. However, as AI can do more and better, it would not be practical or even possible to internalise everything AI can do.
If we do not want to do the task ourselves, autonomous and externalised AI is fine. There are situations where autonomous AI is ideal (e.g., mine exploration), and also with externalised AI (e.g., gathering relevant readings for a trip). But if we really want to be empowered, feeling we are doing the task ourselves, then we can choose the extended cognition paradigm. However, this is sometimes difficult to distinguish from externalised AI, because in all these cases there is dependency. The key element, according to Hernández-Orallo and Vold (2019), is the coupling between the human and the AI system: “a cognitive extender is an external physical or virtual element that is coupled to enable, aid, enhance, or improve cognition, such that all—or more than—its positive effect is lost when the element is not present” (p. 509). The same paper claims that cognitive extenders using AI should be treated as distinct from other cognitive enhancers by all relevant stakeholders, including developers, policymakers and human users.
Interestingly, it seems that more coupling with the system, in contrast to the detachment of the first two levels (autonomous AI or externalised AI), is the key to empowering humans. This interpretation will have important implications in the analysis of education and assessment in the following sections. But let us explore some of the future scenarios first.
Fully transformative AI scenarios
When thinking about the future of skills in the age of AI, or any other future, we can think incrementally from the present and see how far we can forecast. Typically, the uncertainty will increase as we move far from the present time. But there are forecasting problems for which we are more certain about the destination than the route or the timeframes to the destination. The future of AI may be one of those problems. It is quite plausible to believe that AI will one day have all the skills human have. Then, the exercise becomes a backward movie to the present day, or if the destination is expected in a lifetime, directly prepare for that destination from right now. What approach—route or destination—seems more appropriate to analyse the future of human enhancement and assessment with AI?
How successful or insightful has the route approach been so far? In terms of incremental forecasting, some of the most challenging and volatile questions of our time deal with the future of AI 4 . For instance, the forecasts about human-level machine intelligence used to have confidence bands of decades (Armstrong et al., 2014; Zhang et al., 2022a, 2022b). Today, some experts say genuine AGI (a shortcut for Artificial General Intelligence but meaning many different things such as human-level intelligence) will never happen with the current paradigm (Marcus, 2022), while others claim it will happen in a couple of years (Aschenbrenner, 2024). Disregarding the timelines, the anticipation of what skills will be automated first has not been especially insightful either, according to the examples given in the previous section.
So let us take the optimistic stance that AI technology reaches an idealistic point, the destination, which we will define more precisely than by the term AGI. Let us assume that in less than twenty years there is an AI system that (1) can do every cognitive activity a human can do, (2) can be deployed safely, (3) is allowed to be deployed by the existing regulation and (4) is economically cheaper than humans. All these conditions together in less than two decades is perhaps unlikely for many, but not totally implausible. The important thing is that this maximalist scenario helps us understand the questions about what skills humans would need, how to achieve them and how to evaluate them. Also, the answer to these questions affects the current cohort of small children, who will become adults around that time. In other words, under this scenario, we should be implementing this now. As usual in any analysis of the future, there is another advantage of considering this scenario for education and the future of work: it is useful to prepare for the most disruptive scenario, because it needs more preparation, while the less disruptive ones require fewer paradigm shifts in conceptions and policies.
A similar maximalist scenario of human-level capabilities is considered by (OECD, 2024c) only for scientific reasoning and problem solving, not because it is necessarily the most plausible, but as a thought experiment used to clarify how education should be overhauled according to that scenario, to the point of revisiting the purpose of education itself (e.g., “scientific literacy as a fundamental right”, OECD, 2024c).
Let us start by looking at how tasks are solved by humans and by AI. A naïve vision of automation (by machines, not only by AI) is represented by a process of “substitution”. The human is replaced by the AI system, but the task is performed in a very similar way. Whereas this is the most common analogy, this situation almost never happens. Even calculators, which seem to do the same operation we would do without them, perform the calculations better and faster than us. We have an augmentation of the operation, not simple replacement. But in many other cases, the process is modified or totally redefined, even with new human operations involved. For instance, passport control is usually performed by scanning the document first and then recognising the person's face, with some human operators moving around to give instructions and help people when they get stuck at the machine. These operations are not something that only happen with AI or computers, they have happened with tools in the history of human technology: agriculture is the best example of new tools augmenting, modifying and redefining a task. Frey and Osborne (2023) illustrate this with another well-known example: we did not “automate away the jobs of lamplighters by building robots capable of carrying ladders and climbing lampposts”.
This diversity of ways in which technology affects tasks is well captured by the Substitution Augmentation Modification Redefinition (SAMR) model, illustrated in Figure 3. The cleanest operation is at the bottom, where a human activity or task is replaced by a tool (e.g., an AI system) and nothing else changes. Then, moving up in the model, we see that the human is also replaced with an AI solution, but the operation becomes better or enhanced. On the top part of the model, we see modification and redefinition, with different degrees in which the task and its context can be altered. The connection of this model with the kinds of system in Figure 2 is not direct, for example, an autonomous AI could be used in these four levels of the model, but of course the internalised cognition is incompatible with the “substitution” level.

The substitution augmentation modification redefinition (SAMR) model (Hamilton et al., 2016; Puentedura, 2009). Image taken from Puentedura (2014). Much of the debate turns around the two operations at the bottom, but most of what is happening with the penetration of AI technology is happening at the top two operations.
Looking at Puentedura's SAMR model, we may have the impression that modification and redefinition, placed at the top of the model, are the ‘poor’ cases. Under this perception, if we cannot do substitution or augmentation because AI cannot replace humans exactly, then we must use some tricks in the process, instead of creating an AI system that does really have the capability that humans were employing to do the task (as the lamplighters). Of course, exact replacement in an economical way may seem ideal organisationally, because a simple swap human by machine would be enough, but the potential is limited and the machine will require maintenance, so it is never a perfect replacement. The reality in many cases, and it will become more and more in the future, is that AI solves a task in a different way because the way humans solve it is quite inefficient or suboptimal. Actually, for tasks that have already been automated, what we see is that as AI technology improves, the tasks are iteratively transformed further for the following generations. This process is seen in logistic centres; initially a few robots moved parcels and products around the warehouse, with plenty of painted lines and codes here and there, given the limited perception capabilities of these robots. After a few years, the whole warehouse is redesigned for the next generation of robots, in a permanent co-evolution.
But because most of the transformation today happens in the redefinition or modification of the tasks from humans to AI, it is no surprise that the penetration of AI is not as fast as the technology would allow. The penetration of generative AI in companies grows more slowly (McKinsey, 2023) than what is perceived in society, and especially in comparison to education and research, where the pace is accelerating. One explanation of these different speeds may be found in education and research being knowledge-intensive, where large language models are especially good at. Another explanation is the profile and demographics of the users in these areas. For instance, children and young people are more open to changes in the way they do things. Especially in education, it may be the first time students do a task, and they are open to do it differently from the way it has been traditionally performed by other humans. Indeed, some preliminary evidence shows that students use generative AI in a different way from their teachers (Clos & Chen, 2024).
In sum, a more realistic and wholistic view is the perspective of Transformative AI (Gruetzemacher & Whittlestone, 2022), seen as a general-purpose technology fuelling chains of transformation. So, let us imagine a situation where everything humans do can be done by robots, with warehouses, cities, shops, hospitals, schools, and so on constantly being redesigned according to more efficient ways of performing their tasks. Would that world be incomprehensible to us, and with no possibility of interaction? Are these two the main reasons for education, to understand and interact with the world, and are they equally well suited for externalisation and extension?
Enhancement
The expectation in this maximalist scenario using a traditional view of cognition is that humans may not be able to understand the world and interact with it effectively, at least in those areas where AI becomes significantly involved with interactions beyond human speed and capabilities. The only way to keep the pace is that humans get enhanced as well. This does not mean necessarily to implant chips in our brains; there is a lot we can do by using AI as coupled assistants, especially if we consider, as we have claimed, the human-AI hybrid system as an extended mind.
There is in fact a long tradition of the analysis of human enhancement as the result of technology (Bostrom & Sandberg, 2009; Racine et al., 2021). This can happen through external tools or through internal interventions (drugs, surgery, genetics, etc.). Here we will focus on the former, and in the context of the degrees of coupling of the enhancement. We can contemplate activities where the task is performed by the enhanced human in the same way as it was (augmentation) but more often as a transformation of the task, moving up in Puentedura's SAMR model.
While the scenario we are considering is extreme and years ahead, cognitive enhancement is not only an option for an ageing population (Ziegler et al., 2022); it is a situation that we see already, especially in students. Students are enhanced by smartphones, enhanced by computers, enhanced by their social network of contacts, and now enhanced by generative AI. And this is accelerating.
Students, especially at higher levels of education, use or will use grammar editors, translators and other NLP tools, multimedia editors and generators, generic or domain-specific assistants (e.g., programming assistants), and this will go beyond Q&A or conversational interaction, as more complex tasks and modes of interaction are possible with these tools. This is on par to the evolution of these students, once becoming alumni (as adults), using plenty of AI tools.
In many of these interactions, we tend to see positive uses and negative uses. The distinction seems to rely on “knowing what we are doing” or the capability of supervising and intervening in the process. Let us illustrate this with an example. Imagine Alice tells an assistant to do the shopping for her, and seconds after her command she receives the notification of the purchased list, which is exactly what she would get if she had done it manually. This is an externalised situation according to the externalisation–internalisation continuum in Figure 2. But imagine that the bot doing the shopping finds that there is a new brand of toothpaste, or there is an offer on it. Depending on how much the AI system knows about Alice's preferences, it will interact with Alice through questions to know her preferences better or to inform her of new possibilities. This interaction is not, as it is today, because of lack of autonomy of the agent (it gets stuck and needs to ask the human for help); it is more about ensuring that the human is empowered in the activity and gets exactly what they would like to get. Also, as a result of the process, Alice may get a different shopping than the one she would do manually. In addition, she may have learnt (internalised) that it is better to buy the new toothpaste, changing their knowledge (cognition) and behaviour if she were to do the shopping manually the following week.
The example above illustrates this supervision and intervention as key elements of empowerment, but also the possibility of an extended mind experience and even some internalisation. Now the question is rooted in philosophy but has very important implications to our understanding and use of AI in the future. Has Alice been enhanced? Is she more cognitively capable? If we give a positive answer to this question, we can re-understand the evolution of cognition in AI and humans, as represented in Figure 4.

Evolution of human skills and AI skills under the extended-mind thesis in a future where AI skills subsume all human skills. Compare the area of human skills with Figure 1. The black arrows represent that AI skills are growing. The blue arrows represent that human skills are growing following AI skills and not retreating from them as in Figure 1. The more things AI is capable of, these can be internalised or considered actual human skills if obtained through a tightly coupled AI extender (in green).
In this figure, we see that human skills will be increasing, and not decreasing, as far as we consider those activities in the green area where AI is coupled with us (extended cognition) or those that lead to internalisation. It is the green area which pushes the human skills, in a future where AI skills will supersede those of humans. This can include higher levels of existing skills (e.g., translating languages faster and more effectively with the use of AI) or having new skills that humans do not have (e.g., detecting that a person has the flu by the tone of their voice).
This has implications on education, both in what skills to prioritise and how to redesign the learning processes, towards a use of technology that focuses on the two kinds of cognition on the right of Figure 2. From these two, of course, we should try to internalise whenever possible, and that includes many of the skills that are included in the early stages of education, but the new way of coupling with AI systems and telling them our preferences emphasises some other skills, especially in higher levels in the educational process.
In particular, humans will need to give precise commands, supervise quality, understand the ethics of the situation, and the interactions of the process with other humans and AI systems. These are sometimes referred to as “fusion skills”, with the list being expressed slightly differently depending on the source: “intelligent interrogation, judgment integration, and reciprocal apprenticing” (Daugherty & Wilson, 2018; Wilson & Daugherty, 2024) or “critical thinking, problem-solving, self-regulation, and reflective thinking skills” (Yan et al., 2024). These are skills that require metacognition and critical thinking above all, more than knowing facts or knowing how. Knowing why and for what, and dealing with balances including costs and ethics, may become more important in the future.
It is not the goal of this paper to analyse all these skills and situate them in the context of certain educational levels, but to raise awareness that in a world where AI will be able to do our cognitive work, we end up with interactions where the AI systems are present and the extended cognition situations will be the most common in the continuum of human–AI interactions that really empower humans.
Having said this, we need to compare the benefits and risks in even this preferred situation. With the extended mind, we have several positive effects:
Empowering humans. Under the extended mind interpretation, as seen in Figure 4, this makes humans more capable, not less. Dealing with cognitive decline of an ageing population. This can compensate for a shrinking area when ageing (blue area minus the green area). Equalising cognitive abilities. This can help compensate for some profile limitations of capabilities, even if at the moment the results are not conclusive (Brynjolfsson et al., 2023; Choi et al., 2023; Dell’Acqua et al., 2023; Noy & Zhang, 2023; Wiles et al., 2024). Allowing humans to catch up with a world that is more and more redesigned for AI. We may rely on AI to interact with very sophisticated AI systems, in the way we are using large language models to create computer programs to run on other machines. Atrophy and safety: with the continuous use of AI, human cognition may suffer some reorganisation—or even atrophy—of several cognitive functions. For instance, the Google effect (Brabazon, 2006; Sparrow et al., 2011) shows that we do not memorise facts directly but just remember where to find them. As in any other daily use technology that creates overreliance, from pencil and paper to the Internet, when it fails or is no longer available humans cannot easily operate autonomously anymore. AI may exacerbate the effects (Dergaa et al., 2024). Moral status and personal identity: when the interaction with the same or integrated tools intensifies, the attachment may go beyond just overdependency, to the identification of part of the self. Since the tool is considered part of the mind, when it is lost (by an accident, problem or a malicious third party) then the feeling is like when a part of the person is lost. For instance, in case of a robbery, people prefer to be taken the money but not the phone, especially if a stolen phone implies loss of data or privacy. Reliability: even if AI gets better and better in the future, there must still be a problem of misunderstanding or misalignment on what we would expect from AI (Zhou et al., 2024). We may end using AI for things for which the outcome is not satisfactory because we do not have good models of the operation of the AI system. For instance, people are starting to use language models as calculators, even if they are not reliable for medium-sized numbers. Illusion of understanding: being coupled with the AI system and intervening in some of the intermediate steps does not mean that the process is well understood. This is becoming more of an issue in contexts where the AI system discovers or works with complex concepts that have details beyond our understanding, such as education and science (Messeri & Crockett, 2024). Responsibility and trust: this seems to be mostly associated with externalised and autonomous AI systems, but even in an AI extender situation, humans can blame the AI system when things go wrong. Even in the internalisation case, there can be issues, such as “I did this because the machine taught me”. Interference and control: coupling with AI systems may have implications in our privacy and long-term behaviour, such as manipulation or the lack of independence not only at the cognitive level, but broader. For instance, the AI system, given our preferences for food items and our health status, may start blocking some products or making recommendations about what to buy and what not. Education and assessment: finally, AI extenders are regarded in very different ways depending on the context and the level of education, from embracing them to banning them. This is just an evolution of the ongoing debate of information technologies in education, with the cognitive angle exacerbating the position. Even if framed as extended cognition, many professionals will not accept their use unless the benefits greatly outweigh the costs.
But AI extenders have several negative effects and potential risks (Hernandez-Orallo & Vold, 2019):
From the last item, we recognise that one major issue about enhancement in education, even if limited to situations where there is extended cognition, is the assessment of students during the learning process and after it. Usually, the use of enhancement has been assimilated with doping or cheating, as if the drugs or the buddy solving the task for the student were similar to AI tools that can permanently be there. That is why we need to overhaul assessment.
Assessment
Assessment in a technological context has been a controversial issue at least since Plato's Phaedrus: the dialogue debated whether our memory would be spoilt by the use of pen and paper. About 2,400 years later, the debate focuses on whether our memory is spoilt by the use of the Internet (Sparrow et al., 2011). This one view of technology against education is a simplification of a set of more diverse and profound reasons why technology has had a complex integration with education. This integration includes both the learning process and the assessment process. In the latter, some views are very radical, to the extent that for many assessment procedures the technology is forbidden. 5
One common example is the use of calculators. They were quickly introduced in the 1970s (Banks, 2011) but more than fifty years later they are still selectively banned in evaluations at many educational levels and situations depending on the skills and competencies. The right balance is hard to achieve. Calculators are usually permitted when calculation is not the skill to be trained on or evaluated, allowing students to focus on other tasks, especially in a world where calculators are always around us.
Some educational levels are more tech-friendly than others. For instance, vocational education has an intrinsic use of machinery. Also, at the higher levels of education, undergraduate and graduate university degrees, as well as PhDs, the evaluations should be closer to the way students are going to operate in the real world, and the use of the technology seems more welcome.
Let us focus on this stage of higher education given that the presence of AI is more frequently considered acceptable, to see how assessment can be implemented. As a reality check we can start with an anecdotal example: a course of machine learning at the Massachusetts Institute of Technology (MIT), one of the most renowned and advanced technical universities in the world. In this course, professors and students are well aware of the possibilities of AI, but we still find some exams without the technology that is associated with machine learning. It is not that language models or specialised tools such as copilot are not allowed, it is even that Python interpreters, and of course internet access, are forbidden (see Figure 5). This is not necessarily wrong, if complemented with practical evaluation during the course, provided the questions on these written exams focus on how to solve machine-learning problems rather than secondary details such as programming language syntax or memorisation. But these written tests are an example of how intrinsic tools (Python interpreters, help systems etc.) are completely removed for some evaluations.

About 500 MIT students taking the 3-hr final exam in Introduction to Machine Learning (autumn 2021). From Figure 1 in Zhang et al., 2022b.
It is no surprise that with the advent of large language models and other systems with general capabilities, especially in knowledge-loaded domains from history to science, there has been a rushed effort to introduce multiples guidelines stating what and what is not allowed with “generative AI”. The guidelines are different for students at different levels and disciplines, and are updated as AI comes with new features and possibilities.
But let us consider evaluations at pre-professional training courses having ecologically valid assessments. This happens to a greater or lesser degree from secondary education and certification (OECD, 2024b), but it is again more common at vocational training and universities. In that case, we should evaluate students with all their AI extenders. But what does this mean? First, not everybody requires the same AI extenders, and different people may have different preferences. Second, some students could have more means than others, even outsourcing humans through these AI tools.
To be fair, choice should be given to the students to select their AI tools, but resources should be budgeted, otherwise some students could work with more powerful AI tools than others, just because some are expensive. In other words, they should use extenders that would amount to the same price. In many situations in an educational setting, this could be framed as allowing a budget for AI tools plus any other tool that can be used for free. On occasions, especially if the students are performing a complex project and the AI resources needed are high, then solving the problem with lower AI budget could be incentivised in the final score. Cheating in this context would be represented by actions that try to secure unauthorised extra resources, including humans. The use of a budget together with effective monitoring should also limit the possibility that some AI tools outsource to humans, not reflecting an ecologically valid situation where there is total budget for completing an assignment.
Under this scheme, using sophisticated AI tools in clever ways should be incentivised rather than penalised. Also, students specifying the AI tools that are being used should be a sign of pride and transparency. This is, for instance, the personal experience in an undergraduate course in data science (Martínez-Plumed & Hernández-Orallo, 2023), where posters typically include logos of all the tools (AI and non-AI) the students have used for a project. This project-based approach using AI is also becoming common in secondary education in some countries, combined with some other more traditional evaluation methods (written exams).
Of course, using AI in a meaningful way is an idealistic situation, because a proper assessment with AI requires intensive monitoring to avoid cheating. Table 1 shows different options for assessment where the monitoring and grading effort is variable.
Several options for assessment, where the use of AI, cheating risk or monitoring effort cluster them into three columns
In the first column, we see the classical assessment procedures, exams with no domain-related tools (with pen and paper, or a computer with no internet access but some tools, such as a compiler or an office suite). These include a variety of evaluation types, such as practical exams (e.g., making a 3D design on a computer and then 3D-print the piece), but where AI tools would not be allowed. However, when we really want to evaluate skills in an ecologically valid scenario full of AI tools around us, this is just a fall-back option for various reasons (costs or lack of alternatives), given the resources available. The column in the middle has become a very popular option in the past few decades, even at university levels, in a move towards more continuous learning (in contrast to a final exam). However, if the exams and deliverables allow online or offline access to AI, the temptation for the students is going to be too strong. If we regulate with “honesty rules” (the guidelines for the use of generative AI we mentioned above), not only is the exercise going to be hard and controversial to enforce, but it is going to penalise the honest students, given how easy it is to use AI tools these days.
If we discard the first two columns, either because they are no longer ecologically valid or because the risk of cheating is too high, then we are left with the options on the right. Here we have alternatives where there is high supervision (projects and practical assignments) or they are not parallelisable for the instructor (live demonstrations and some other oral examinations). This presents a problem of scalability. This is feasible with a small cohort of students, but not in the usual ratios we face at many universities. We will revisit the problem of scalability of these types of assessment later on. But for the moment, let us assume that we have a sufficient number of tutor hours to frame assessment in this way. What should we assess?
Here it is useful to distinguish between those tasks that are individual, and we expect the student to perform without assistance, with autonomy being rewarded, and those tasks that are collective, where interdependence, coordination and delegation are rewarded. In the middle we find those individual tasks where the student can be assisted by tools, or even the tools are required for the task (e.g., a practical evaluation in a laboratory), where the effective use of those tools is rewarded.
With the use of AI assistants, which will become more agential in the future, the boundaries between tool and partner (Collins et al., 2024) become blurrier, and the scenarios where students are coupled or collaborate with their AI assistants become more similar to collective tasks.
This means that, even if the task was individual, the student will be interacting with a pool of AI assistants. In this situation, determining who has done what (the student or AI) will become more complicated. Responsibility for success, and also for failure, may be affected by the expectations about technology or even bias in favour or against AI. Rubrics and procedures should be carefully devised to take all this into account.
In the end, the evaluation of AI-extended or enhanced individuals takes us closer to the evaluation of collectives in collaborative scenarios (von Davier & Harpin, 2013). These collectives are now seen in a human-AI hybrid context, sometimes referred to as centaurs (Swanson et al., 2021), and with tools that are more powerful than humans in many ways. We can reuse the experience of evaluation in ecologically valid situations (where there are always some other actors) or project-based learning (where the tasks are meant to be collective). In these multi-actor situations, assessing the responsibility and value of a single student for an individual or collective task should consider the following items:
Team formation and tool selection: choosing the right tools and using them well is something that should be evaluated. In the case of AI, we could be talking about assistants and agents that help for many subcomponents of the task, or perform the whole task. If the student finds the right tool that solves the whole task, then this is positive; it shows the potential of the student in the real world, especially if this is systematically evaluated with many tasks and assessments. Similarly, if there is the possibility of team formation including humans, this should also be taken into account, provided there are some mechanisms ensuring fairness. Under the mind-extension thesis, it will be important to determine the tools that are coupled with the human participants and those that should be considered autonomous team members or externalised tools. Evaluating human contribution: apart from team formation (which in some cases can be determined or constrained by the instructor), how a person contributes to the team and their value matters more. The problem of attribution can be addressed from different perspectives, but one angle is to identify the value of each team member, for example, using inspiration from concepts such as the Shapley value, already applied to estimate the value of a “player” in multi-agent scenarios (Zhao & Hernandez-Orallo, 2024). Here, more even than in the team formation case, the mind-extension thesis determines what counts as an individual for the attribution. Using AI for evaluation: of course, this will appear more and more frequently in the future. Evaluation using AI should not be accompanied with a stigma if done carefully, in contrast with rushed ways we have seen in the past (the algorithmic grading during the COVID-19 pandemic in England, Aloisi, 2023). Well done, AI for evaluation is a possible solution to the scalability problem. Here we can think of any level of Puentedura's SAMR model (Figure 3), from redefinition to substitution. In the end, lecturers and other kinds of instructors at higher-education levels are mostly knowledge workers for whom the impact of AI will be felt strongly and quickly. Co-evaluation: as we have humans and AI, we can ask the human students to co-evaluate each other, as done traditionally in project-based learning (Martínez-Plumed & Hernández-Orallo, 2023). We can go much beyond that and use the AI tools as observers and ultimately evaluators, as they have a very coupled integration with the team or individual to be evaluated. This can also be an advantage to avoid co-evaluations that are biased by affinity, friendship (or lack thereof) and son in the team.
While these two dimensions are key, and identifying them is crucial, this does not solve how to do the assessment in this context. For instance, the Shapley value is conceptually attractive, but in practice we cannot play with a sample of team configurations to extract the value of each participant. Two ideas stand out on the side of the evaluators, currently with known limitations:
Overall, the assessment in the context of powerful AI and under the perspective of the mind-extension thesis is challenging, but AI-assisted evaluation can be a solution for interaction-intensive evaluation approaches that would not be scalable otherwise.
Conclusion
Some geologists claim that we entered the Anthropocene sometime between the industrial revolution and the big climate change that fossil fuels are causing on our planet. But it is worth thinking that, if not geologically, from a cognitive point of view, we are about to start the mechanocene, 6 the machine era.
We have started our paper with two tenets: (1) cognition in the age of AI is hypothetically regarded in an externality–internality continuum with AI extenders being a preferred practical option for empowerment, and (2) AI is assumed to be more powerful than humans in a safe, legal and cheap way, in such a way that it can become mainstream for all tasks humans are required now. Under these two tenets, we can think of the destination and realise the major challenges ahead:
What skills matter as individuals are extended by AI and hybridisation becomes ubiquitous? How to develop and assess those skills in humans, machines, hybrids and collectives?
We are not the first to ask these questions, but we hope our perspective reconceptualises them in a way that brings us closer to their answers.
Footnotes
Acknowledgments
I would like to thank Prof. Fang Luo for very fruitful discussions at a summer school in Beijing in 2024, inviting me to give a related talk at the ICCPAE2024 conference, and the suggestion to write this piece. This paper has benefitted from insightful feedback from and discussions with Nóra Revai, Margarita Kalamova, Marc Fuster, Shivi Chandra and Imogen Casebourne. The comments during the reviewing phase were also helpful to improve the clarity of the paper.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
This work was funded by CIPROM/2022/6 (FASSLOW) and IDIFEDER/2021/05 (CLUSTERIA) funded by Generalitat Valenciana, the EC H2020-EU grant agreement No. 952215 (TAILOR) and Spanish grant PID2021-122830OB-C422 (SFERA) funded by MCIN/AEI/10.13039/501100011033 and “ERDF A way of making Europe”.
