Dietary assessment is a technique for determining an individual’s intake, eating patterns, and food quality choices, as well as the nutritional values of consumed food. However, this technique’s procedures are costly, laborious, and time-consuming and rely on specially trained personnel (such as dietitians and nutritionists) to produce reliable results. Consequently, a strong need exists for novel methods having improved measurement capabilities that are accurate, convenient, less burdensome, and cost-effective []. Instead of relying solely on client self-report, taking food photos before eating has been incorporated into traditional methods, such as a 3-day food record with food images, to reduce missing food records, incorrect food identification, and errors in portion size estimation. However, this technique still requires well-trained staff to translate food image information into reliable nutritional values and does not solve labor-intensive and time-consuming issues.
The application of computer algorithms to translate food images into representative nutritional values has gained interest in both the nutrition and computer science communities. This combination has resulted in a new field called image-assisted dietary assessment (IADA), and various systems have been developed to address these limitations, ranging from simple estimation equations in early systems to more complex artificial intelligence (AI) models in recent years. By applying IADA alongside the increasing use of smartphones and devices with built-in digital cameras, real-time analysis of dietary intake data from food images has become possible with accurate results, reduced labor, and greater convenience, thus gaining attention among nutrition professionals. However, the technical nature of this field can make it difficult to understand for those without a background in computer science or engineering, leading to the low involvement of nutrition professionals in its development. This gap is the rationale for us to conduct this review.
ObjectivesThe objective of this review is to bridge that knowledge gap by providing an up-to-date overview of the gradual enhancement of AI integration in dietary assessment based on food images. The information is presented in chronological order and in a manner that is understandable and accessible to those who may not be familiar with the technical jargon and complexity of AI terminologies. In addition, the advantages and limitations of these systems are discussed. Finally, we proposed auxiliary systems to enhance the accuracy of IADA and its potential adoption within the nutrition community.
To conduct this scoping review, we followed the methodology suggested by Arksey and O’Malley [] and adhered to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines [].
Search StrategyWe searched 2 web-based databases, PubMed and Google Scholar, between February 2023 and March 2023, using the following terms: ((“food image”[Title/Abstract]) AND (classification[Title/Abstract] OR recognition[Title/Abstract] OR (“computer vision”[Title/Abstract]))) and “artificial intelligence,” “dietary assessment,” “computer vision,” “food image” recognition, “portion size,” segmentation, and classification, respectively.
Eligibility CriteriaThis review included studies that focused on AI techniques used for IADA, specifically AI models, systems, or digital methods for food recognition and food volume estimation. For mobile apps or systems, we considered only articles that explain algorithms beyond mobile apps, prototype testing, or conducting clinical research. Studies that used noncomputational techniques, such as using food images as a tool for training human portion estimation, are excluded. Eligible articles were published in peer-reviewed journals or conference papers and written in English.
Selection ProcessWe used Zotero (Corporation for Digital Scholarship) reference management software to collect search results using the add multiple results function. All automatic data retrieval functions were disabled to prevent data retrieval from exceeding Google Scholar’s traffic limitation. Zotero’s built-in duplicate merger was used to identify duplicated records, and unduplicated records were exported to Excel online (Microsoft Corp). In Excel, all authors independently screened article types, titles, and abstracts. The screening process removed all nonrelated titles or abstracts, review and editorial articles, non-English articles, or conference abstracts without full text. For thesis articles, the corresponding published articles were identified using keywords from the title, first author, or corresponding author whenever possible. Each article required 2 independent reviewers’ approval. In cases of conflict, a full-text review was necessary to resolve disagreements. After the initial screening process, the full texts of articles were obtained to assess eligibility. All full-text articles, whether they were excluded or not, and review articles were thoroughly read to identify interesting or related articles. These were classified as articles from other sources.
Data ExtractionA data extraction table was constructed, including the system name, classification algorithm, portion size estimation algorithm, accuracy of classification or portion estimated results, and the system’s noticeable advantages and drawbacks. Data were extracted from full texts.
We retrieved 44 (8.4%) items from PubMed, while Google Scholar provided 478 (91.6%) results from the search terms, giving a total of 522 items retrieved. In total, 122 (23.4%) duplicate items were removed using Zotero’s built-in duplicate merger. The remaining 400 (76.6%) deduplicated items were screened based on their titles and abstracts, resulting in 104 (19.9%) records for full-text review. After the full-text review process, 72 (13.8%) articles were included in this study. In addition, we manually identified and included 12 (2.3%) additional articles from other sources. An overview of the literature identification method and results is shown in , and the PRISMA-ScR checklist is available in .
Figure 1. PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) flowchart of the structured literature search, screening, and selection methodology. Traditional Dietary Assessment MethodsWhen measuring individual food intake, dietary assessment methods are typically divided into 2 sequential processes: methods to obtain dietary intake and methods to estimate the nutritional values of food. Principally, obtaining an individual’s intake can be done by recording all consumed foods, beverages, herbs, or supplements with their portion sizes on a day-to-day basis or within a specific time frame (eg, a week) based on variation in the nutrients of interest. These methods were developed early on and can be performed manually. Due to their simplicity, some methods are frequently used in nutrition professionals’ practices.
The 24-hour dietary recall (24HR) method is the simplest way to measure dietary intake, but accurately obtaining dietary intake information can be very challenging. The participant or their caregiver are asked by a trained interviewer to recall the participant’s food intake within the last 24 hours. This method relies heavily on the client’s memory and estimation of food portion size []. Unintentional misreporting of food intake is common, as clients often forget some foods. Underreporting of portion size is common because clients are not familiar with estimating food portion sizes [,]. In participants who are overweight or obese, intentional underreporting is also common []. Although this method is the simplest for determining dietary intake, it takes approximately 1 hour to complete each interview. Moreover, a single 24HR result does not satisfactorily define an individual’s usual intake due to day-to-day variations in eating habits.
Estimated food records (EFRs) are more reliable but time-consuming. Clients are asked to record all food and beverage intake during eating times for a specified period. Details of food are needed along with the portion sizes estimated by the client and rounded to household units (eg, half cup of soymilk with ground sesame and 4 tablespoons of kidney beans without syrup). To improve accuracy, training in estimating portion size using standard food models is required. The EFR places a burden on the clients, as they need to record all eating times. Moreover, some clients temporarily change their intake habits during recording to minimize this burden, while others may intentionally not report certain foods to cover up certain eating habits. Food portion size estimation errors are sometimes found, but taking food photographs before and after eating can lower these errors [-].
A standardized weighing scale can be used to avoid errors caused by human estimation of portion sizes. This technique is known as weighed food records and is considered the gold standard for determining personal intake. However, it is impractical to weigh all eaten food in the long term because it becomes a burden for the client to measure the weight of food eaten throughout the day []. This technique also only eliminates portion size estimation errors, while other issues with EFRs may still persist.
After retrieving dietary intake information from sources, such as 24HR, EFR, or weighed food records, the next step is to estimate the representative nutritional value of the food using a food composition table. If the recorded foods match the food items and their description in an available food composition table, the nutritional values can be obtained by multiplying the consumed food weight directly. However, if the food items are not found, the food needs to be analyzed and broken down into its components. The nutritional values of each component can then be obtained from the food composition table (or its nutrition label) and multiplied by the actual weight of each consumed component. When the portion size is recorded instead of its actual weight, the estimated weight can be obtained using standardized portion sizes from the food composition table. Nutrient analysis software can easily accomplish this task.
IADA MethodsOverviewDigital devices are often used for dietary assessment. The first well-documented attempt to develop such a digital device was called Wellnavi by Wang et al []. Although the device yielded accurate results, its usability was limited by the technologies of the time, including short battery life, poor image quality, a bulky body, and a less sensitive touch screen [].
Several attempts have been made to use generic devices, such as Palm (Palm Inc) PDAs [], compact digital cameras [], and smartphones [], instead of inventing a specific food recording device. In using these devices, users reported a decrease in the burden of completing food recording when compared with traditional methods [,]. However, these devices still rely heavily on dietitians or nutritionists to analyze the nutritional values of food items.
Recent advancements in mobile phone technologies, including high-performance processors and high-quality digital cameras, have created the opportunity to invent a food image analysis system on smartphones. While the exact origins of applying AI for IADA research are uncertain, one well-documented attempt to develop a simple system on smartphones was that of DiaWear []. The system implemented an artificial neural network, which is a subset of deep learning, a recently advanced technique in the field of AI. Despite achieving an accuracy rate above 75%, which was considered incredible at that time, the system’s usefulness was limited because it could identify only 4 types of foods—hamburgers, fries, chicken nuggets, and apple pie. In addition, the system could not determine the portion size of the taken food image; thus, it gave a nutritional value based on a constant portion size directly.
In this paper, the architecture of IADA is divided into multistage architectures, which were prevalent in the early stages of IADA development, and end-to-end architecture, which has emerged more recently with advancements in AI techniques and food image datasets. The multistage architectures, as implied by their name, include 4 individual processes: segmentation, food identification, portion estimation, and nutrient calculations using a food composition table. This sequential process is consistent across all early-stage IADA systems [-]. These subprocesses are trained independently because they require specific input variables, and optimization can only be done for each step individually, not for the entire process. By contrast, the end-to-end approach, which replaces a multistep pipeline with a single model, can be fine-tuned as a whole process, making it more advanced and increasingly the focus of researchers today.
Nowadays, multistage architectures are becoming obsolete and are often referred to as traditional IADA. They played a significant role in the IADA timeline before the emergence of the end-to-end approach. Therefore, we delve into the multistage architectures, particularly focusing on food identification and portion estimation algorithms in their subsections, and provide details about the end-to-end approach in the Going Beyond the Traditional Approach With Deep Learning section. For better comparison, illustrates traditional dietary assessment methods and the substitution processes of IADA, along with some notable systems that indicate combining certain processes of the multistage architecture into a single model through deep learning [,-].
Figure 2. Comparison of traditional dietary assessment processes and the image-assisted dietary assessment (IADA) substitution processes for the same tasks, including systems that integrate multistage architecture into a single model using deep learning. Systems referenced include DiaWear from Shroff et al [], GoCARB from Anthimopoulos et al [], FIVR from Puri et al [], Im2Calories from Myers et al [], Diabetes60 from Christ et al [], Multitask CNN from Ege and Yanai [], Fang et al [], and technologies-assisted dietary assessment (TADA) from Zhu et al [, ,]. 24HR: 24-hour dietary recall; CNN: convolutional neural network; EFR: estimated food record; GAN: generative adversarial network; ResNet50: residual network; SVM: support vector machine; VCG: visual geometry group; WFR: weighed food record. Food Identification SystemImage recognition systems are one of the milestones in the computer vision field. The goal is to detect and locate an interesting object in an image. Several researchers have applied this technique to food identification tasks that formerly relied on humans only. The early stages in the development of food identification systems were from 2009 to 2015. Most of the existing systems were powered by machine learning algorithms that required human-designed input information, or technical terms called features. Hence, all machine learning-based algorithms are classified as handcrafted algorithms.
The era of handcrafted algorithms began in 2009 with the release of the Pittsburgh Fast-Food Image Dataset [], marking a significant historical landmark in promoting research into food identification algorithms. This dataset consisted of 4545 fast-food images, including 606 stereo image pairs of 101 different food items. In addition, researchers provided baseline detection accuracy results of 11% and 24% using only the image color histogram together with the support vector machines (SVMs)-based classifier and the bag-of-scale-invariant feature transform classifier, respectively. Although these classifiers were commonly used during that time, the results were not considered sufficient and demonstrated much room for improvement. Since then, various techniques have been proposed to improve the accuracy of food classification from images. In later studies, the same team used pairwise statistics to detect ingredient relations in food images, achieving an accuracy range of 19% to 28% on the Pittsburgh Fast-Food Image Dataset []. Taichi and Keiji [], from the University of Electro-Communications (UEC) team, used multiple kernel learning, which integrates different image features such as color, texture, and scale-invariant feature transform. This method achieved 61% accuracy on a new dataset of 50 food images and 37.5% accuracy on real-world images captured using a mobile phone []. In 2011, Bosch et al [] from the Technology Assisted Dietary Assessment (TADA) team achieved an accuracy of 86.1% for 39 food classes by using an SVM classifier. This approach incorporated 6 features derived from color and texture []. These results suggest that including a larger number of features in the algorithms could potentially improve detection accuracy.
After active research, the accuracy of handcrafted algorithms reached a saturation point for improvement during the 2014 period. The optimized bag-of-features model was applied to food image recognition by Anthimopoulos et al []. It achieved an accuracy level of up to 77.8% for 11 classes of food on a food image dataset containing nearly 5000 images for the type 1 diabetes project called GoCARB. Pouladzadeh et al [] achieved a 90.41% accuracy for 15 food classes using an SVM classifier with 4 image features: color, texture, size, and shape. Kawano and Yanai [] (UEC) attained a 50.1% accuracy for a new dataset comprising 256 food classes, using a one-vs-rest classifier with a Fisher vector and a derived feature from a color histogram named RootHoG []. While handcrafted algorithms yielded high-accuracy results for their specific test datasets with fewer food classes, they struggled to effectively handle larger class sets and real-world images. This difficulty arose due to factors, such as challenging lighting conditions, image noise, distorted food shapes, variations in food colors, and the presence of multiple items within the same image. Handcrafted algorithms may reach a limitation in their ability to improve further.
In contrast, the novel approach called deep learning, which can automatically extract features from input data, appears to be more suitable for complex tasks such as food identification. The convolutional neural network (CNN), considered to be one of the approaches in deep learning, was developed for handling image analysis in 1998 []. CNN reads a group of squared pixels of an input image, referred to as a receptive field, and then applies a mathematical function to the read data. The operation is performed repeatedly from the top-left corner until reaching the bottom-right corner of an input image. This operation is done in a similar manner to matrix multiplication or dot product in linear algebra. CNN and deep learning were applied to the food identification task in 2014 by the UEC team []. This system achieved an accuracy of 72.3% on a dataset containing 100 classes of real-world Japanese food images, named UEC FOOD-100, surpassing their previous handcrafted system in 2012, which achieved 55.8% on the same dataset []. This marked the beginning of the era of applying deep learning techniques for food identification. Later that year, the UEC team also released an international food image dataset called UEC FOOD-256 that contained 256 food classes to facilitate further research []. Simultaneously, the FOOD-101 dataset was made available, comprising nearly 101,000 images of 101 different food items []. They also presented baseline classification results from the random forest–based algorithm, one of the handcrafted algorithms, and compared it with CNN. They found that CNN achieved an accuracy of 56.4%, while random forest–based algorithm achieved 50.76% accuracy in this dataset. These food image datasets have become the favored benchmark for subsequent food identification systems.
Another important technique is transfer learning, which is well-known for training many deep learning algorithms, including CNNs. It involves 2 stages: pretraining and fine-tuning. Initially, the model is trained with a large and diverse image dataset, and then it is further trained with a smaller, more specific dataset to enhance detection accuracy. This approach is similar to how humans are educated, where broad knowledge is learned in school followed by deeper knowledge in university. The UEC team applied this training approach to the food identification task in 2015 and successfully achieved an accuracy of 78.77% on the UEC FOOD-100 dataset []. It has been reported that pretraining on large-scale datasets for both food and nonfood images could improve the classification system’s accuracy beyond 80% [-], which is considered to surpass all handcrafted algorithms and be sufficient for real-world applications.
Currently, numerous state-of-the-art object detectors or classifier models, including the pretrain and fine-tune training paradigm, have been developed and are available, such as AlexNet (AlexNet is an object detection model that won the ImageNet Challenge in 2012; it is named after its inventors, Alex Krizhevsky) [], region-based CNN (R-CNN; an object detection model that significantly improved object detection performance by combining region proposals with CNNs) [], residual network (ResNet; a deep learning model that won the ImageNet Challenge in 2015, known for its innovative use of residual learning to train very deep networks) [], You Only Look Once (YOLO; it is an object detection model that introduced a novel approach by framing object detection as a single regression problem, predicting bounding boxes and class probabilities directly from full images in one step evaluation) [], Visual Geometry Group (VGG) [], and Inception (this is an object detection model that won the ImageNet Challenge in 2014, recognized for its use of a novel architecture that efficiently leverages computing resources inside the network) []. These object detectors have been designed to automatically extract features from input images and learn distinct characteristics of each class during the training process. Deep learning-based object detection models have shown great promise in image recognition tasks, especially in complex tasks such as food identification. These models and their derivatives are commonly found in many of the food identification systems developed later. The use of these state-of-the-art models presents an exciting opportunity for nutrition researchers who may not have a background in computer engineering or data science. They can now create high-performance food identification systems for specific tasks by curating a food image dataset and training the model accordingly. With the various algorithms available, it is crucial to carefully consider their unique characteristics to select the most suitable one for a given application. The notable food identification systems are listed in .
Table 1. Overview of notable food identification systems, classifier algorithms, selected features, number of classes, name of food dataset (if specified or noted as their own dataset if absent), and accuracy resultsa.Study, yearProjects or teamClassifierFeatureClass (dataset)Accuracy results percentagesShroff et al [], 2008Color, size, shape, and textureaNote that convolutional neural network–based classifiers do not require the number of features to be shown as they extract features autonomously.
bPFID: Pittsburgh Fast-Food Image Dataset.
cSVM: support vector machine.
dBoSIFT: bag-of-scale-invariant feature transform.
eUEC: University of Electro-Communications.
fMKL: multiple kernel learning. This is a machine-learning technique that combines multiple kernels or similarity functions, to improve the performance and flexibility of kernel-based models such as support vector machines.
gSIFT: scale-invariant feature transform.
hBoF: bag-of-features.
iGabor is a texture feature extraction invented by Dennis Gabor.
jHOG: histogram of orientated gradients—a feature descriptor based on color.
kTADA: Technology Assisted Dietary Assessment.
lTamura is a 6-texture feature extraction invented by Hideyuki Tamura.
mHaar wavelet is a mathematical analysis for wavelet sequence named after Alfréd Haar.
nSteerable filter is an image filter introduced by Freeman and Adelson.
oDAISY is a local image descriptor introduced by E Tola et al [], but they did not describe a true acronym of DAISY.
pHSV is the name of a red-green-blue color model based on hue, saturation, and value.
qDCD: dominant color descriptor.
rMDSIFT: multiscale dense scale-invariant feature transform.
sSCD: scalable color descriptor.
tNot available.
uCNN: convolutional neural network.
vInception is an object detection model that won the ImageNet Challenge in 2014, recognized for its use of a novel architecture that efficiently leverages computing resources inside the network.
wVGG: visual geometry group—an object detection model named after a research group from the University of Oxford.
xAlexNet is an object detection model that won the ImageNet Large-Scale Visual Recognition Challenge (also known as the ImageNet challenge) in 2012; it is named after its inventors, Alex Krizhevsky.
yWISeR: wide-slice residual.
zMSMVFA: multi-scale multi-view feature aggregation.
aaMADiMA: Multimedia Assisted Dietary Management.
Food Portion Size Estimation SystemOverviewFood portion size estimation is a challenging task for researchers as it requires more accurate information on the amount of food, ingredients, or cooking methods that cannot be obtained from only a captured image without additional input, which makes it harder to create a food image dataset with portion size annotation. Furthermore, quantifying an object’s size from a single 2D image is faced with common image perspective distortion problems [,], as shown in . First, the size of the object in the image can change due to the distance between the object (food) and the capturing device (smartphone or camera). The size of the white rice in A is smaller compared with B because the white rice in B is closer to the camera. Second, the angle at which the photo is taken also alters the perceived object size. For example, flattened objects such as rice, that are spread out on a 23-cm (9-inch) circular plate appear in their full size in a bird’s-eye shot (90°), in C, but they appear smaller when taken from approximately 30° from the tabletop as in D. Thirdly, there is a loss of depth in a bird’s-eye view in E and 3F, making it difficult to compare between food B and food C. The weights of foods A, B, C, and D are 48, 49, 62, and 149 grams, respectively. We use these images for teaching image-based portion estimation for dietetics students.
While pretrain and fine-tune training for CNNs is a silver bullet for food image identification, currently there is no equivalent solution for portion estimation. Many researchers are actively finding ways to calibrate the object size within an image to mediate such an error, and several approaches have been discussed here. Basically, portion estimation can be broadly classified, based on complexity, into four progressive categories: (1) pixel density, (2) geometric modeling, (3) 3D reconstruction, and (4) depth camera. provides an overview of notable systems for volume estimation.
Figure 3. There are common image perspective distortion problems. Firstly, position distortion: the size of the white rice in (A) is smaller compared to (B) because the white rice in (B) is closer to the camera. Secondly, angle distortion: the white rice in (C) is fully visible at 90 degrees, while it appears smaller when taken from 30 degrees, as in (D). Thirdly, there is a loss of depth information in the bird’s-eye view in (E) and (F), making it difficult to compare food B and food C. Table 2. A comprehensive overview of notable publications for 4 volume estimation approaches, arranged chronologically.Approach and study, yearProjects or teamReference objectItemReported errorPixel density approachaNot available.
bN/A: not applicable.
cUEC: University of Electro-Communications.
dTADA: Technology Assisted Dietary Assessment.
eSLAM: simultaneous localization and mapping.
fMADiMA: Multimedia Assisted Dietary Management.
gMARE: mean absolute relative error.
Revisiting the Classic Pixel Density ApproachPixel density is the simplest approach for providing good and effective estimation. After a food image is segmented, the number of pixels in each segmented section is determined. Mathematical equations or other transformations are then used to calculate the portion size of each section that is presented in the image.
However, this approach suffers from image distortion problems, and several approaches have been implemented to combat this drawback. The simplest method is the use of a physical reference object or fiducial marker for calibrating the size of objects in an image. When the real size of the reference object is known, the real size of an object can be determined relative to the reference object. This method was chosen for food volume estimation during its early development stage [,,]. Various physical objects have been used as reference objects in the literature, including a special patterned card [,], a known-size circular plate [] or bowl [], chopsticks [], a 1-yuan coin [], a wallet [], a user’s thumb [,], or even rice grain size [].
Geometric Modeling ApproachAssuming that the food has a cylindrical shape, such as compressed steamed rice (A), its volume can be calculated using the conventional formula 2πr2 × h. The radius r and height h can be determined by counting the pixels in the image. While this approach is effective for geometric shapes, it is less reliable for irregular shapes that lack a specific equation. The demonstration of this approach is shown in B, where the user selects a predefined shape and then manually fits (or registers) the geometric model with the image.
The TADA team reported the use of several predefined shapes of foods, including cylindrical, flattop solid, spherical, and prismatic models [,,,]. Prismatic models were specifically used to estimate portion sizes of irregularly shaped foods. This approach allowed a more accurate estimation of portion sizes by considering the unique characteristics of each food item. The research team at the University of Pittsburgh proposed a similar technique known as wireframe modeling. This technique involves creating a skeletal representation of an object using lines and curves to define its structure to accurately capture the shape and dimensions of food items [,]. However, this approach is also affected by common image distortion problems. Initially, a physical reference object was used for calibration.
Geometric modeling shares a fundamental principle with augmented reality (AR), a technology that transforms 2D environmental images into 3D coordinates in a computer system. As AR has become more widely available on smartphones, many researchers have explored the feasibility of using AR as a calibration method instead of using physical reference objects [,]. AR-based object length measurement is demonstrated in .
Figure 4. This figure demonstrates the various approaches to estimating food volume. (A) A cylindrical shape of 75 grams of brown rice taken from a 60° angle. (B) Geometric modeling with a predefined cylindrical shape, where the user needs to adjust each point manually to fit the object. (C) A predicted depth map from state-of-the-art dense prediction transformation. (D) A 3D reconstructed object using depth information from (C). These images have been adjusted in size for visual comparison purposes. Figure 5. Measuring the size of the same banana can be done using different techniques, as shown in the figure. (A) A standard ruler is used as a ground truth measurement, (B) Samsung augmented reality Zone app, and (C) Apple iPhone Measure app. These apps use the gyroscope or accelerometer sensors in the mobile phone to accurately track the movement of the phone as the measurement line is drawn. 3D ReconstructionThis technique involves using ≥2 images taken from different angles to create virtual 3D objects in 3D coordinates in a computer system. It shares the same principle as both AR and geometric modeling, where reconstructed objects are represented similarly to prismatic models in geometric modeling. Furthermore, this technique allows for the inclusion of shapes beyond traditional geometric shapes.
While several researchers have explored the use of 3D reconstruction [,,], 1 notable example is the GoCARB system []. This system requires 2 images taken from different angles to construct a 3D model of the food, achieving an accuracy within 20 grams for carbohydrate content estimation. This level of accuracy is comparable to estimates made by dietitians when the food is completely visible on a single dish with an elliptical plate and flat base [].
C and 4D demonstrate a similar 3D reconstruction approach but implemented using state-of-the-art dense prediction transformation models to predict depth maps from a single image (A), followed by the reconstruction of the 3D object using the predicted depth map.
Depth Camera ApproachThis method operates on the same principle as geometric modeling and 3D reconstruction, but it requires a special time-of-flight (ToF) sensor (also known as a depth camera) to measure an object’s size in 3D coordinates in a computer system. Initially, the application of depth cameras in food volume estimation was limited, primarily due to their high cost and limited availability []. However, with the introduction of consumer-grade depth cameras, such as Kinect (Microsoft Corp), Intel RealSense, and smartphones equipped with depth sensors, their accessibility increased, leading to wider use in food volume estimation applications [,,,,].
Nevertheless, the availability of depth sensors remains a significant challenge in implementing this system. Currently, only a limited number of mobile phone models are equipped with such sensors. In addition, some manufacturers integrate the sensor with the front camera for authentication purposes, such as Apple’s FaceID, making it impractical for capturing object photos. Moreover, certain mobile device manufacturers have omitted the ToF sensor in their recent models [], further reducing the availability of depth sensors and posing implementation challenges for the depth camera approach.
An example of depth information captured by the Intel Realsense d435i depth camera displayed in RGB (red-green-blue; color model based on additive color primaries) with depth (RGB with depth; RGBD) format is shown in B. Rendered objects from a captured polygon file are demonstrated as freely rotatable 3D objects in C and 6D, with a regular RGB image shown for comparison in A.
Figure 6. (A) A typical red-green-blue image showing 3 Burmese grapes, each weighing approximately 20 grams. (B) A red-green-blue image with depth captured by Intel RealSense d435i from a bird’s-eye view. (C) and (D) 3D reconstructed objects from the polygon file, illustrating the height of each fruit from different angles. Going Beyond the Traditional Approach With Deep LearningAdvancements in deep learning are opening more possibilities to improve the IADA system by merging some steps (or even all steps) of the multistep pipeline into a single model, which can be fine-tuned as a whole process. Due to the rise in IADA research with the emergence of advanced algorithms, we can only highlight a few reports that demonstrate the gradual enhancements in IADA in this paper.
In 2015, Myers et al [] from Google proposed the Im2Calories system, using deep learning for all stages of IADA. The classifiers are based on the GoogLeNet architecture, and the classification results are used to improve the semantic segmentation handled by the DeepLab network. For volume estimation, a new CNN architecture, trained with an RGBD dataset, estimates the depth map from a single RBG image and then converts the depth map to volume in the final step. Although the absolute error for some test foods could exceed 300 ml, the overall volume estimation results were deemed acceptable. The system still requires a food composition database to determine the nutritional values of the food in the final step.
The idea of using deep learning to estimate food volume is gaining popularity, and several systems are transitioning to using deep learning algorithms to estimate food volume without the need for an actual ToF sensor. In 2017, carbohydrate counting algorithms named Diabetes60 were proposed by Christ et al []. The system reported food-specific portions called “bread units,” which are defined to contain 12 to 15 grams of carbohydrates. This definition closely resembles the “carb unit” widely used in the diabetes field or the “exchange unit” in dietetic practice. The system was based on ResNet50 and trained using an RGBD image dataset that contained human-annotated bread unit information. It achieved a root mean square error of 1.53 (approximately 18.4-23 g of carbohydrate), while humans could achieve a root mean square error of 0.89 (approximately 10.7-13.4 g of carbohydrate) when compared with the ground truth. The modified ResNet was also used for fruit volume estimation, achieving an error of 2.04% to 14.3% for 5 types of fruit and 1 fruit model []. Furthermore, Jiang et al [] introduced a system to classify liquid levels in bottles into 4 categories: 25%, 50%, 75%, and 100%. Using their own designed CNN architecture, they achieved a 92.4% classification accuracy when the system was trained with 3 methods of data augmentation. Furthermore, the system could achieve 100% classification accuracy when the bottle images had labels removed.
One challenge in converting a single 2D image into a 3D object is the difficulty in capturing the back side of an object in single-view images due to factors such as view angle or occlusion. Therefore, the food volume may be underestimated. Point2Volume was introduced in 2020 by Lo et al [] to address the limitations. The system builds upon 2 of their previous works: a deep learning view synthesis [] and a point completion network []. When a single-depth image is captured, a Mark region-based CNN—a combination of object dete
留言 (0)