Sage Journals: Discover world-class research

Abstract

Background: With the rapid advancement of large language models (LLMs), comparative performance evaluations have become essential to guide both research and real-world deployment. Differences in processing efficiency, cost, and scalability significantly influence their practical applications.

Purpose: The study aims to investigate and compare the performance of four leading LLMs—Llama 3.3 8B Instruct (Meta), Gemma 3n 4B (Google AI Studio), DeepSeek Prover V2, and DeepHermes 3 Llama 3 8B Preview (Chutes)—across selected computational metrics.

Research Design: The research employed a comparative experimental approach, where models were tested under varying token loads. Key evaluation parameters included token count, processing speed (tokens per second), execution duration, and cost efficiency.

Study Sample: The sample comprised multiple performance instances for each of the four selected models, tested under controlled computational conditions to ensure consistency.

Data Collection and/or Analysis: Performance statistics were collected systematically for each model. Descriptive analysis was carried out by comparing throughput rate, average response times, and relative cost differences.

Results: Findings revealed substantial variation across the models. Llama 3.3 and DeepHermes consistently recorded the fastest throughput, exceeding 190 tokens per second, with response times under 4 seconds. Gemma 3n showed slower throughput at higher token counts, reflecting a trade-off between speed and token management. DeepSeek Prover V2 performed moderately across all metrics but lagged behind in speed compared to Llama and DeepHermes. The results suggest that model selection should be context-specific, with Llama 3.3 and DeepHermes offering the best trade-off between speed and cost-effectiveness for practical NLP tasks, while other models may be more suited for research-oriented applications.

Keywords

large language models performance evaluation token processing speed natural language processing Llama

Get full access to this article

View all access options for this article.

References

Ball

Cook

Lahiri

, et al. (2004) Zapato: automatic theorem proving for predicate abstraction refinement. Lecture Notes in Computer Science 457: 457–461.

Chitty-Venkata

Raskar

Kale

, et al (2024) LLM-inference-bench: inference benchmarking of large language models on AI accelerators. SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis: 1362 - 1379 IEEE. DOI: 10.1109/SCW63240.2024.00178.

Ettaleb

Kamel

Moriceau

, et al. (2025) DéfiTextMine2025: utilisation des grands modèles de langue pour l’extraction de relations dans les rapports de renseignement. HAL Open Science Preprints 1–12. hal-04940482v1.

Guidolin

(2024) Machine learning in portfolio decisions. In: Marco Corazza, René Garcia, Faisal Shah Khan, and Davide La Torre (Eds.), Artificial Intelligence and Beyond for Finance, Transformations in Banking, Finance and Regulation. World Scientific Publishing Co. Pte. Ltd. pp. ISBN: 9781800615205. DOI: 10.1142/9781800615212_0001.

Huang

Zhang

Hua

, et al. (2024) Leveraging enhanced egret swarm optimization algorithm and artificial intelligence-driven prompt strategies for portfolio selection. Scientific Reports 14(1): 26681.

Jia

Paulson

(2007) Lightweight relevance filtering for machine-generated resolution problems. Journal of Applied Logic 7(1): 41.

Moniri

Hassani

Dobriban

(2025) Evaluating the performance of large language models via debates. arXiv preprint arXiv:2406.11044v2: 1–28.

Seßler

Rong

Gözlüklü

, et al (2024) Benchmarking large language models for math reasoning tasks. arXiv preprint arXiv:2408.10839v2: 1–30.

Touvron

Martin

Stone

, et al (2023) LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288v2: 1–77.

10.

Xiao

Huang

Chen

, et al (2024) Large language model performance benchmarking on Mobile platforms: a thorough evaluation. arXiv preprint arXiv:2410.03613v1: 1–18.

11.

Yahaya Alassan

Espejel

Bouhandi

, et al (2024) Benchmarking open-source language models for efficient question answering in industrial applications. arXiv preprint arXiv:2406.13713v1: 1–20.

12.

Yang

Jin

Tang

, et al. (2024) Harnessing the power of LLMs in practice: a survey on ChatGPT and beyond. ACM Transactions on Knowledge Discovery from Data 18(6): 1–32.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.25 MB

Performance benchmarking of free LLMs: A practical analysis of token processing efficiency

Abstract

Keywords

Get full access to this article

References

Supplementary Material