Sage Journals: Discover world-class research

Abstract

Background:

Diet interventions often have poor adherence due to burdensome food logging. Approaches using photographs assessed by artificial intelligence (AI) may make food logging easier, if they are adequately accurate.

Method:

We used OpenAI’s GPT-4o model with one-shot prompts and no fine-tuning to assess energy, fat, protein, carbohydrate, fiber, and salt through photographs of 22 meals, comparing assessments to weighed food records for each meal and to assessments of dieticians.

Results:

The model had poor performance overall. For fiber, though, the model achieved an intraclass correlation coefficient of 0.71 (0.67-0.74 95% CI), well above the dietician performance of 0.57.

Conclusions:

The simplest use of current AI via one-shot prompting and no fine-tuning accurately assesses fiber content in meals but is inaccurate for other nutritional parameters.

Keywords

type 2 diabetes LLM nutritional estimation

Introduction

Type 2 diabetes (T2D) affects hundreds of millions of patients worldwide, causing serious cardiovascular and microvascular complications, including blindness, amputation, and death.¹ Changes in diet are a key part of treatment recommendations.² Although dietary changes have proven clinical benefit,³ patients struggle to change their diets. A class of interventions relies on feedback to patients about their food choices, so that they can modify their dietary behavior based on data. Assessing diet throughout the day is challenging. Logging food, such as from within a smartphone application, is notoriously difficult, with generally low usage rates.⁴ For example, our DialBetesPlus intervention⁵ had high usage rates for measuring exercise (68.5–81.6%) but low usage rates for diet recording (37.2%-54.0%).

Anecdotally, our patients report that they often take pictures of meals then record the meal later. Researchers have experimented with assessing meals from photographs.^6-9 For instance, we conducted trials where the photographs were reviewed by licensed dieticians who provided nutritional assessments and feedback.¹⁰ This feedback was not real time, with a delay of 1 or 2 days, and patients generally did not find the feedback to be useful.

Researchers have experimented with using machine learning (ML) to assess the nutritional content of meals. For example, we used custom models to make an ML assessment of “healthiness” trained on the opinions of licensed dieticians.^11,12 By avoiding a human in the loop, these approaches offer near-real-time feedback at very low cost. Until recently, results were not especially good. The recent development of Transformers-based large language models (LLMs)¹³ has changed this landscape. Even if current LLMs cannot provide the accuracy needed to support a medical intervention in diet, it seems likely that soon-to-appear more advanced LLMs will provide this accuracy. One company, Healthify, trained (via fine-tuning) an earlier version of OpenAI’s generative pre-trained Transformer (GPT) to give nutritional feedback based on photographs of meals, albeit in a health/wellness context rather than a medical one.¹⁴ Fine tuning takes time and money, with the likelihood that all the work put into a particular LLM will be outdated by the next release of a far more advanced LLM, and multi-shot prompting worsens user engagement due to increased latency. In this study, our objective was to assess the accuracy of current generation LLMs, without use of fine-tuning and using simple one-shot prompting.

Methods

We used a database of Japanese meals.¹⁵ The meals were prepared with weighing of all ingredients, forming a weighed food record (WFR) with very high accuracy assessment of the nutritional content. We treat these nutritional assessments as ground truth. Meals were also photographed (Figure 1). We provided these photographs to an LLM to get nutritional estimates.

Figure 1.

Example meal photograph.

Estimating nutrition from just a photograph is challenging, though it can provide useful levels of accuracy. In our earlier work,¹⁵ we inserted these photographs into the workstream of dieticians who were assessing the nutritional content of the meals of patients with diabetes using our DialBetics (DB) app to get their nutritional assessments. This performance of expert humans provides a good benchmark for comparisons.

In this study, we used OpenAI’s “gpt-4o-2024-05-13” model, without fine tuning, via application programming interface (API)¹⁶ as a representative leading LLM. This model was released while this work was underway, so it was state-of-the-art at that time. We used image-processing oriented settings (Table 1) with one-shot prompts¹⁷ to task the LLM to generate a JavaScript Object Notation (JSON) object capturing the requested nutritional estimate. We used separate prompts and separate LLM engagements for each of the 6 nutrients of interest (energy, fat, protein, carbohydrates, dietary fiber, and salt.) Because LLM’s show significant run-to-run variability, for each nutrient we ran each set of 22 meals with 11 repeated trials, and report the mean ICC (comparing the LLM’s estimate with ground-truth from the WFR). All ICCs are two-way mixed effects, absolute agreement, single rater.^18,19 We also assessed the mean absolute error (MAE) and root mean squared error (RMSE). All model software was custom Python code (Python 3.12). All statistical analysis was done in R (R 4.2.1).

Table 1.

Settings for LLM Analysis.

Item	Value	Comments
LLM	OpenAI’s GPT-4o	Specifically, “gpt-4o-2024-05-13”
Image format	Base 64	Original images were higher definition
API method	Chat completions
Max tokens	300
Temperature	0	To improve run-to-run repeatability
Top_p	0	To improve run-to-run repeatability
Seed	77	Constant seed to improve run-to-run repeatability

Results

We analyzed the original database¹⁵ and identified 22 meals based on 56 dishes that had complete data. The WFR for these meals (Table 2) shows a diverse range of nutrient values, with energy ranging from 287 to 1115 kcal and dietary fiber ranging from 2.6 to 9.0 g.

Table 2.

WFR Results for 22 Meals.

Meal index	Energy (kcal)	Carbohydrates (g)	Fat (g)	Protein (g)	Dietary fiber (g)	Salt (g)
1	750	96.4	22.0	33.8	5.2	5.3
3	1081	136.9	25.5	65.1	8.6	14.7
4	699	77.1	27.8	29.7	3.1	3.6
5	834	82	37.7	35.4	3.7	5.6
6	839	83.9	39.1	31.3	2.7	3.2
7	746	77.9	36.4	20.3	4.1	3.3
8	1115	115.2	54.6	34.1	5.4	5.8
9	461	53.7	16.6	23.4	7.2	5.2
10	650	93	15.5	29.6	3.5	7.9
12	880	121.6	25.4	34.7	4.6	8.3
13	692	95.7	18.1	31.0	4.8	6.0
14	850	123.6	23.1	29.7	5.6	3.4
15	883	101.5	36.3	32.0	3.4	5.2
16	644	96.2	12.8	30.2	6.2	6.8
17	551	81.3	9.0	31.6	8.0	8.9
18	412	63.7	8.6	18.4	4.7	4.2
19	821	101	25.3	43.8	9.0	4.2
20	700	112.2	18.0	18.1	7.8	6.6
21	287	28.4	16.0	10.5	2.6	3.4
22	664	80.1	19.0	36.8	3.6	3.8
23	705	80.8	24.9	34.9	6.9	5.5
24	758	108.6	15.7	39.7	8.1	4.9

The ICC values for the DB dietician results and LLM results (Figure 2) show that, while the dieticians generally do quite well relative to the WFR, the LLM had mixed and generally poor results. Results were quite poor for 5 of the 6 parameters, energy, fat, protein, carbohydrates, and salt, with ICCs of 0.36 and lower, far below the ICCs of the DB dieticians’ estimates. For fiber, though, the ICC of 0.71 was markedly superior to the ICC from the DB dieticians’ of 0.57. The results have similar trends for MAE and RMSE (Table 3), with fiber the only nutrient with smaller errors than achieved by the DB dieticians. The individual fiber estimates were generally reasonable with similar characteristics as the dietician estimates (Figure 3). Latency was low, averaging 5.4 seconds per photograph. Costs were moderately low, averaging around $0.01 per photograph. Our attempts to use the “gpt-4o-mini” LLM led to universally poor performance for this task (results not shown). Similarly, our attempts to use the updated “gpt-4o-2024-08-06” LLM produced ICC’s that were markedly inferior to those from using the “gpt-4o-2024-05-13” model (results not shown).

Figure 2.

LLM ICCs and DB dietician ICCs for 6 parameters.

Table 3.

Comparisons of Accuracy in Estimating Nutrients With ICC, MAE, and RMSE.

Nutrient	ICC^a		MAE^b		RMSE^c
Nutrient	GPT-4o[95% CI]^d	DB	GPT-4o[95% CI]^d	DB	GPT-4o[95% CI]^d	DB
Carbohydrates	0.242 [0.210, 0.275]	0.744	29.2 [28.2, 30.2]	14.0	33.8 [32.7, 34.9]	16.8
Energy	0.358 [0.344, 0.373]	0.568	158.1 [155.7, 160.6]	121.1	208.9 [205.6, 212.2]	154.1
Fat	0.274 [0.259, 0.288]	0.458	8.10 [7.98, 8.22]	7.13	11.6 [11.5, 11.8]	10.3
Fiber	0.707 [0.674, 0.740]	0.568	1.12 [1.08, 1.17]	1.45	1.35 [1.28, 1.42]	1.86
Protein	0.296 [0.249, 0.343]	0.609	8.90 [8.55, 9.26]	7.09	12.3 [11.9, 12.8]	8.51
Salt	0.043 [0.035, 0.050]	0.632	2.45 [2.43, 2.47]	1.74	3.34 [3.32, 3.36]	2.21

ICC two-way mixed models, absolute agreement, single rater.

Mean absolute error.

Root mean squared error.

95% CI for the mean calculated by 11 runs.

Figure 3.

Scatter plot of DB dietician and LLM fiber estimates versus WFR.

Discussion

Performance was sensitive to minor changes in prompt wording. Additions of instructions laying out steps to use in making the assessment following methods used by dieticians frequently worsened performance. We saw worse performance using a new version of GPT-4o, suggesting that performance on this task is unusually sensitive to details of the specific LLM implementation; further study is needed to understand the reasons and impacts of this. Performance for fiber started out well with our first try at a prompt, with ICCs around 0.55, and incremental changes via trial and error were able to improve the performance. For other parameters, we were not able to make prompt changes that led to acceptable performance. Better prompts could have provided better performance, but we were unable to find such prompts. For most of the nutrients, the current generation of LLM using this simple one-shot prompt engineering without fine tuning does not provide adequate accuracy for most applications. For fiber, though, the level of accuracy is high enough that it may be acceptable for many applications. Performance for all nutrients would likely be much better if multi-shot prompting or fine-tuning were employed.

Predicting salt levels from a photograph is difficult. There is no way to see the salt, and assessments have to be made based on typical content for meals of this type. It also makes some sense that predicting fat and carbohydrates (and thus energy level) is difficult, as some ingredients (such as oil or sugar) are largely invisible. We expected to achieve better estimates of protein, as many sources of protein would be visible in photographs, but we did not achieve good results for protein. Although an assessment from a photograph should never be expected to match that achieved from weighed food records, dieticians are able to make assessments with usable levels of accuracy, and we should expect a stock commercial AI, without fine-tuning or multi-shot prompting, to be able to do it also, some day.

It is interesting that the LLM’s fiber estimates were more accurate than the dieticians’ estimates. Perhaps there is less of an impact from hidden ingredients, with the LLM able to see things like legumes or vegetables in the photograph. Perhaps the task benefits from specificity, as most ingredients don’t contribute much fiber so the model can focus on the high contributors.

The fiber results are very encouraging. Eating an adequate amount of fiber is a key part of diabetes treatment recommendations,² but few patients meet guidelines. For example, in Japan, one cross-sectional study showed an average intake below 13 g among patients with diabetes,²⁰ far below a recommended value of 35 g per day.²¹ Helping patients measure fiber levels with some degree of accuracy opens up interesting possibilities for interventions to increase fiber consumption among patients with diabetes, and we think that the level of accuracy that we have achieved will be useful in such interventions.

This study was a quick look, and it has significant limitations. The number of meals, 22, was low. There is some risk that prompt iteration led to overfitting, though the simple nature of the prompts makes this unlikely. All photographs were taken under good conditions, and actual patient photographs may be harder to evaluate and lead to worse results. All meals were typical Japanese home-cooked meals, so they are not reflective of a full range of cuisines and do not account for snacks, restaurant food, or packaged meals. Variation in meal preparation methods could influence results significantly. Simple one-shot prompts were used, with no fine tuning—more involved methods may produce better results, including better accommodating variation in meal preparation methods. There is a need for follow-up work.

Conclusions

Using simple prompts with a current generation LLM produces a useful level of accuracy in assessing dietary fiber, with accuracy better than previously achieved by dieticians. This measurement technique enables intervention approaches that might have clinically-significant improvements in glycemic control for patients with diabetes.

Performance for other nutritional parameters was poor with the studied approach. The field of LLMs is advancing rapidly, and we expect that the next generation will provide better, and likely acceptable, performance for our one-shot, non-fine-tune approach for more than just fiber. Combined with applications that support a healthy diet for people with T2D, the assessment of dietary fiber using commercial artificial intelligence might lead to better interventions.

Footnotes

Acknowledgements

We thank Daniel Lane for his support in manuscript editing and scientific discussions. We made no use of generative AI in the development of this paper.

Abbreviations

AI, artificial intelligence; API, application programming interface; CI, confidence interval; DB, DialBetics; GPT, generative pre-trained Transformer; ICC, intraclass correlation coefficient; JSON, JavaScript Object Notation; LLM, large language model; ML, machine learning; T2D, type 2 diabetes; WFR, weighed food record.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Yuexiang Ji

Kayo Waki

Toshimasa Yamauchi

Masaomi Nangaku

References

Harding

Pavkov

Magliano

Shaw

Gregg

. Global trends in diabetes complications: a review of current evidence. Diabetologia. 2019;62(1):3-16.

Davies

Aroda

Collins

, et al. Management of hyperglycaemia in type 2 diabetes, 2022. A consensus report by the American Diabetes Association (ADA) and the European Association for the Study of Diabetes (EASD). Diabetologia. 2022;65(12):1925-1966.

Schwingshackl

Chaimani

Hoffmann

Schwedhelm

Boeing

. A network meta-analysis on the comparative efficacy of different dietary approaches on glycaemic control in patients with type 2 diabetes mellitus. Eur J Epidemiol. 2018;33:157-170.

Cordeiro

Epstein

Thomaz

, et al. Barriers and negative nudges: exploring challenges in food journaling. Proc SIGCHI Conf Hum Factor Comput Syst. 2015;2015:1159-1162.

Waki

Nara

Enomoto

, et al. Effectiveness of DialBetesPlus, a self-management support system for diabetic kidney disease: randomized controlled trial. NPJ Digit Med. 2024;7(1):104. doi:10.1038/s41746-024-01114-8.

Zhang

Deng

Zhu

, et al. Deep learning in food category recognition. Inform Fusion. 2023;98:101859.

Shimoda

Yanai

. CNN-based food image segmentation without pixel-wise annotation. Paper presented at the In New Trends in Image Analysis and Processing–ICIAP 2015 Workshops: ICIAP 2015 International Workshops, BioFor, CTMR, RHEUMA, ISCA, MADiMa, SBMI, and QoEM; September 7-8, 2015; Genoa, Italy.

Wei

Wang

. Food image classification and image retrieval based on visual features and machine learning. Multimed Syst. 2022;28(6):2053-2064.

Sahoo

Hao

, et al. Food AI: food image recognition via deep learning for smart food logging. Paper presented at the In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; August 4-8, 2019; Anchorage, AK.

10.

Waki

Fujita

Uchimura

, et al. DialBetics: a novel smartphone-based self-management support system for type 2 diabetes patients. J Diabetes Sci Technol. 2014;8(2):209-215.

11.

Sudo

Murasaki

Kinebuchi

Kimura

Waki

Ohe

. Intuitively estimating the healthiness of meals from their images: image-based meal rating system to encourage self-management of diabetes. Paper presented at the Proceedings of the Joint Workshop on Multimedia for Cooking and Eating Activities and Multimedia Assisted Dietary Management; July 15, 2018; Stockholm, Sweden.

12.

Sudo

Murasaki

Kinebuchi

Kimura

Waki

. Machine learning–based screening of healthy meals from image analysis: system development and pilot study. JMIR Form Res. 2020;4(10):e18507.

13.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Paper presented at the NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems. December 4-9, 2017; Long Beach, CA.

14.

Healthify. OpenAI. Accessed August 12, 2024. https://openai.com/index/healthify/

15.

Kato

Waki

Nakamura

, et al. Validating the use of photos to measure dietary intake: the method used by DialBetics, a smartphone-based self-management system for diabetes patients. Diabetol Int. 2016;7(3):244-251.

16.

Platform. OpenAI. Accessed August 12, 2024. https://platform.openai.com/docs/api-reference/introduction

17.

Marvin

Hellen

Jjingo

Nakatumba-Nabende

. Prompt engineering in large language models. Paper presented at the International Conference on Data Intelligence and Cognitive Informatics; June 27-28, 2023; Tirunelveli, India.

18.

McGraw

Wong

. Forming inferences about some intraclass correlation coefficients. Psychol Methods. 1996;1(1):30-46.

19.

Koo

. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155-163.

20.

Fujii

Iwase

Ohkuma

, et al. Impact of dietary fiber intake on glycemic control, cardiovascular risk factors and chronic kidney disease in Japanese patients with type 2 diabetes mellitus: the Fukuoka Diabetes Registry. Nutr J. 2013;12:1-8.

21.

Reynolds

Akerman

Mann

. Dietary fibre and whole grains in diabetes management: systematic review and meta-analyses. PLoS Med. 2020;17(3):e1003053.

Using One-Shot Prompting of Non-Fine-Tuned Commercial Artificial Intelligence to Assess Nutrients from Photographs of Japanese Meals

Abstract

Background:

Method:

Results:

Conclusions:

Keywords

Introduction

Methods

Results

Discussion

Conclusions

Footnotes

Acknowledgements

Abbreviations

Declaration of Conflicting Interests

Funding

ORCID iD

References