Well-being is a complex and multifaceted concept for which there is “no consensus around a single definition of well-being”, as acknowledged by the US Centers for Disease Control and Prevention, but generally, well-being refers to “judging life positively,” “feeling good,” and the experience of good physical health []. It also includes dimensions such as physical health, emotional health, economic circumstances, life satisfaction, and engaging activities and works []. In the early 21st century, numerous scholars suggested that well-being should be quantified to create indicators that could guide national public policies. Consequently, many scales have been proposed [-]. However, the absence of a universally accepted definition of well-being has made it challenging to establish a single, comprehensive measure []. Current measures often fall into categories such as “life evaluation,” “hedonic well-being,” and “eudaimonic well-being,” each capturing different aspects of the well-being spectrum []. Yet, because these dimensions are deeply interconnected, scales focused on specific aspects may not fully capture the overall well-being of individuals.
The Need for a Comprehensive Well-Being IndicatorHistorically, well-being has been measured primarily by economic indicators such as gross domestic product (GDP). However, a prior study has suggested that economic indicators are “no longer a complete approximation of how well a nation is doing” []. GDP is only one aspect of well-being and does not capture socioeconomic inequalities, life satisfaction, or health status []. For the nation as a whole, previous research suggests that as the GDP or economic activity increases, the standard of living of the population improves []; however, in a well-developed society, there may be some citizens who sacrifice their well-being for economic efficiency. In fact, economic growth can lead to improved living conditions in some areas, but an examination of the United States revealed that despite a 3-fold GDP increase in the past 50 years, life satisfaction levels have not risen [], accompanied by rising depression and anxiety rates []. Therefore, GDP growth is not equal to improved well-being, and more comprehensive well-being indicators that include multiple factors such as health, work, and social connections need to be used in policy evaluation. However, it is not proposed to completely replace traditional economic indicators, but to add new indicators of well-being and look at the relationship between both the economy and well-being [].
Well-Being Indicators Currently Used for Policy Evaluation and Their ChallengesSeveral countries and international organizations use a variety of different well-being indicators, which are a mix of subjective and objective indicators. Organizations such as the Organization for Economic Co-operation and Development (OECD) and United Nations Development Program, as well as countries including New Zealand, the United Kingdom, France, and Italy, use objective well-being metrics encompassing health (eg, lifespan), job opportunities (eg, employment rate), environmental conditions (eg, greenhouse gas emissions), safety (eg, crime rate), and governance (eg, voter turnout) [-]. Subjective indicators involve self-evaluations of individuals of their lives, as exemplified by tools such as the Satisfaction With Life Scale [] and the Subjective Happiness Scale []. In practical policy implementation, dashboards that visualize data for both objective and subjective indicators have been devised to holistically assess social well-being. In countries like the United Kingdom and New Zealand, unique well-being indices have been developed to gauge citizen welfare and guide policy and fiscal decisions [-].
In addition, some international organizations are involved in the development of well-being indicators and are publishing reports on international comparisons. Typical examples include the Better Life Index (BLI) of the OECD [], the World Happiness Report of the United Nations [], and the Human Development Index of the United Nations Development Program []. Moreover, while many well-being indicators target national units, there are initiatives to establish metrics for more localized regional assessments within countries [], illustrating diverse methodologies to evaluate comprehensive well-being.
These preexisting well-being indicators have some limitations and challenges. Conducting large-scale epidemiological surveys, which are necessary for both objective and subjective indicators, incurs significant human, temporal, and financial costs. In several resource-limited nations, conducting surveys can be challenging, or individuals experiencing severe well-being deficiencies are often more prone to nonresponse, thereby hindering the assessment of their well-being. In addition, survey items rely on multiple government surveys, and survey years often vary and can only be evaluated once every few years. It is also difficult to standardize the items surveyed across all countries, making international comparisons difficult. Therefore, a reasonable indicator that is comparable to other regions can be obtained, and evaluated at the right time for policy evaluation is very useful for policymakers.
Web Log Data as a Policy IndicatorWith the spread of the internet, methods for using web log data to predict statistics for policy evaluation have been reported. Statistics related to well-being indicators have also been associated with web log data, including health [], job availability [], environmental quality [], safety [], governance [], and subjective elements, such as emotional well-being [] and life satisfaction []. The web log data used in the previous study varied and included search volume logs from internet search engines, such as Google and Yahoo! Search, as well as log data from social networking services (SNSs), including X (Twitter), Facebook, and Instagram. This study used search volume log data from the search engine Google, which can be collected from Google Trends []. Google is one of the major search engines used in 195 countries around the world, making it easy to ensure reproducibility in other countries. Another advantage of social networking tools is that there is less bias in demographic information such as user age, gender, and race. While information posted on SNSs is often directed toward society and others and may contain only overly idealistic information, search behavior on search engines is an individual’s internal process and might be able to reduce the confirmation bias that amplifies the information that people find favorable.
ObjectiveThis study aimed to develop a model that predicts comprehensive well-being indicators via search volume log data from internet search engines. This approach seeks to bypass the need for large-scale statistical surveys, thereby reducing budgetary and human resource requirements. In other words, it enables policymakers to assess the well-being of the public at low cost and at the right time, thereby facilitating more effective policy decisions.
The Regional Well-Being Index (RWI) for Japan [], structured based on the BLI methodology of the OECD, was used as the outcome variable. The RWI, like the BLI, consisted of 11 domains: “Income,” “Jobs,” “Housing,” “Health,” “Work-Life balance,” “Education,” “Community,” “Civic Engagement,” “Environment,” “Safety,” and “Life Satisfaction.” This index focuses on integrating both subjective and objective indicators to comprehensively evaluate well-being.
The “Regional Well-Being” of the OECD [] provides detailed scores for well-being at subnational regions at smaller geographical scales than the national level. However, the administrative divisions that make policymaking do not coincide with these subnational regions in several countries. For instance, the OECD’s regional well-being for Japan is presented at a relatively macroscopic level, segmenting the nation into 10 regions, including Tohoku and Kansai, each encompassing several prefectures. This level of aggregation differs from the levels at which policy decisions are operationalized, typically at the prefectural and municipal levels. To use well-being indicators more efficiently, it is important to calculate them for each administrative level where policy evaluation and decision-making are conducted. Therefore, this study adopted the RWI, a comprehensive well-being indicator at the prefectural level in Japan, based on the BLI methodology, as its outcome.
Yang and Taira [] provided domain-specific scores and an integrated RWI (IRWI) that aggregated all domains by prefecture. These RWIs assessed the well-being of all 47 Japanese prefectures for 2010, 2013, 2016, and 2019, affirming their reliability and validity relative to the BLI and the existing well-being index []. Due to the data unavailability of certain indicators constituting the RWI, the 2019 data remain the most recent. Subsequent updates have been delayed due to factors such as the COVID-19 pandemic and prolonged data aggregation processes.
Internet Search Log Data as Predictor VariablesThe relative search volume (RSV) in the internet search engine Google was used as the predictor variable and was obtained from the Google Trends website. Google Trends enables the tracking of temporal variations in the popularity of specific search words on Google and YouTube while capturing regional search dynamics and related words. Google Trends was used to collect related words in addition to the RSVs for the main words. The “Related keywords” feature of Google Trends offers other words that are frequently searched in relation to a selected word. “Related keywords” include “Top searches” and “Rising searches,” where “Top searches” are the words most frequently searched within the same session as the selected word, and “Rising searches” are the words whose search frequency has increased the most significantly over a specific period. Google Trends collects up to 25 related words.
In Google Trends, RSVs are normalized on a 0-100 scale, where the word with the highest search volume in a specific period scores 100, and the frequencies of other words are adjusted accordingly. Moreover, Google Trends facilitates the comparison of search frequencies for words across different states or prefectures within a country, normalizing data to the area with the highest frequency of a word score of 100, whereas others are rated in comparison. For instance, a search frequency score of 50 for a word in one region suggests that its search volume is half of that of the highest-ranked region.
First, in the procedure for obtaining RSVs, following the methodology established in prior research that developed the RWI framework, representative words for each domain were selected based on their relevance to the specific domain, ensuring alignment with their Japanese translations (). After selecting the main word for each domain, related search words were collected using Google Trends, which can gather up to 25 related words for any given word. The collection period spanned between January 1, 2010, and December 31, 2019, aligned with the RWI measurement years, and RSVs of related search words during this period were collected. In addition, we collected RSVs for web searches within Japan, without limiting them to any specific category but including all categories. Data collection was conducted without using quotation marks around the search words. Since searching multiple words simultaneously would result in normalization with the word having the highest search volume set to 100, we collected RSVs for each term individually.
To extract the words for the predictive model, those that included specific regions, companies, or personal names were excluded. Additionally, 2 authors, YM and TK, double-checked and deleted words irrelevant to each domain. For example, words such as “福祉 住 環境” (welfare living environment) relating to the environment of an individual’s residence were judged inappropriate and removed, since the original “environment” domain contains the words referring to the “natural environment.” In the Community domain, words like “人間関係” (human relations) and “人間 関係,” differing only in spacing, were identified to represent the same concept. Since the search trends were similar () and it was considered that searchers had the same intent, only “人間関係” was included.
Finally, we collected the RSVs for all main and related words by prefecture. Google Trends allows for the comparison of search frequencies of specific words across states or prefectures within a selected country using its “Interest by Subregion” function. The function provides a country map shaded according to the term’s popularity. The color intensity represents the percentage of searches for the leading search term in a particular region. Search term popularity is relative to the total number of Google searches performed at a specific time, in a specific location. Data were collected for each keyword by setting specific time frames for the years 2010, 2013, 2016, and 2019 (eg, January 1, 2010, to December 31, 2010). In Google Trends, extremely low search frequencies are sometimes indicated as “less than 1” instead of “0,” such cases were treated as “0” in this study. The normalization process for RSVs was conducted for the data of each domain within each year, standardizing the annual prefecture-specific data. This means, for example, that the prefecture-specific data in the “Income” domain for the year 2010 was standardized to ensure comparability across prefectures.
Statistical AnalysisA descriptive statistical analysis of the RWIs and RSVs was conducted for all words. We calculated the mean and SD of the RSVs for representative words of each domain for the years 2010, 2013, 2016, and 2019 by the prefecture to explore temporal variations in search behaviors across different regions. We also calculated Pearson partial correlation coefficients by adjusting the data year and population of prefectures [] to evaluate the association between the RWI scores and the RSV of each keyword.
As a supplementary analysis, we conducted a spatial evaluation of IRWI scores to assess the geographical interrelationships of well-being across regions. Well-being may be influenced by regional cultural characteristics, leading to similarities in IRWI scores among neighboring areas. Understanding these spatial relationships can provide insights for considering regional collaboration in policy interventions.
We first applied Global Moran I [] to assess the overall spatial autocorrelation of IRWI scores across Japan, identifying whether scores were clustered or dispersed on a national level. A significant positive Global Moran I indicates the clustering of similar values, while a negative value suggests the dispersion of dissimilar values. Additionally, Local Moran I [] was used to analyze local similarities and differences between regions and their neighbors, allowing us to identify clusters of high or low scores. This analysis highlights regional disparities in well-being, which can inform targeted policy interventions. Prefectures without adjacent regions, such as Hokkaido and Okinawa, were excluded from this analysis.
To predict the RWI, we used the Elastic Net methodology, a machine-learning technique, that was designed to prevent the problems of multicollinearity and overfitting in a linear regression through regularization []. L1 regularization (lasso regression) applies a penalty to the absolute values of the coefficients, playing a role in excluding unnecessary variables from the model. L2 regularization (ridge regression) applies a penalty to the squared coefficients, thereby reducing the coefficients of highly correlated explanatory variables to overcome multicollinearity. The Elastic Net combines these 2 types of regularization, enabling the creation of a model that addresses the multicollinearity between explanatory variables and selects important features. To further prevent overfitting, we estimated model parameters using data from 2010, 2013, and 2016, and then assessed the prediction errors with 2019 data. This approach enabled us to evaluate the model’s performance on out-of-sample predictions, ensuring that it mitigates overfitting.
In this model, RSVs and prefectures served as predictor variables (with prefectures coded as dummy variables), whereas the IRWI score was used as the outcome variable. The performance of Elastic Net varies with 2 parameters: α, the mix ratio of L1 and L2 regularization, and λ, the strength of regularization. The optimal values for α and λ were determined by fixing α and identifying the λ that minimized the mean squared error through 10-fold cross-validation. This procedure was repeated 11 times, incrementally adjusting α from 0 to 1 by 0.1, and the model with the lowest mean squared error was selected. The predictive accuracy of the model was assessed using the root mean square error (RMSE) and the coefficient of determination (R2). All statistical analyses, including standard normalization, Pearson correlation coefficient calculation, and Elastic Net processing, were conducted using R software (version 4.1.3, The R Foundation).
Ethical ConsiderationsThis study used publicly accessible data from Google Trends and open government statistics for secondary use to eliminate the need for ethical review.
The median IRWIs by prefecture in Japan were 0.67 (IQR −2.48 to 2.71), 0.00 (IQR −2.85 to 2.76), 0.13 (IQR −3.05 to 2.49), and 0.19 (IQR −2.75 to 3.06) for 2010, 2013, 2016, and 2019, respectively () []. The RWIs by 11 domains were also shown in -.
From 2010 to 2019, the Global Moran I statistics for the spatial analysis of IRWI scores across Japan ranged from 0.297 to 0.526, reflecting significant spatial autocorrelation throughout the entire period (). Corresponding P values varied from 4.955 × 10⁻³ to 3.892 × 10⁻⁶, indicating strong statistical significance. For Local Moran I, median values in 2010, 2013, 2016, and 2019 were 0.333 (IQR 0.036-0.858), 0.214 (IQR 0.006-0.746), 0.113 (IQR –0.020 to 0.825), and 0.146 (IQR ‒0.055 to 0.829), respectively. shows the spatial distribution of Local Moran I scores, with regions color-coded according to their values for each year. Additionally, each prefecture is marked with red or blue dots to indicate whether their IRWI scores were above or below the median. This visualization showed the spatial clustering of regions with high or low IRWI scores.
Table 1. Integrated scores of the Regional Well-Being Index. All scores were standardized and unitless.PrefectureYearEleven representative words were extracted for each domain, following a previous study [], and 275 related words were associated with these representative words. Of these, the RSVs for 211 words were collected and those that met the exclusion criteria were excluded ( and ). The mean search frequencies for the representative words of each domain during the data collection period (2010, 2013, 2016, and 2019) ranged from −1.587 to 3.902, with SDs ranging from 3.025 to 0.053 ().
The partial correlation coefficients between each domain of the RWI and the RSV of each word varied in ranges, indicating the minimum and maximum extents of correlation. Specifically, the coefficients ranged as follows: for “Income,” from −0.301 to 0.226; for “Jobs,” from −0.315 to 0.133; for “Housing,” from −0.604 to 0.225; for “Health,” from −0.283 to 0.297; for “Work-Life Balance,” from −0.285 to 0.350; for “Education,” from −0.396 to 0.269; for “Community,” from −0.216 to 0.063; for “Civic Engagement,” from −0.233 to 0.269; for “Environment,” from −0.070 to 0.261; for “Safety,” from −0.183 to 0.036; and for “Life Satisfaction,” from −0.112 to 0.219, as detailed in . Additionally, the overall range for the partial correlation coefficient between the IRWI and RSV for each word was determined to be from −0.409 to 0.362, as also noted in .
The best Elastic Net model was constructed using training data from 2010 to 2016, incorporating 2 to 13 variables per domain (α=0.1, λ=0.906, RMSE=1.290, and R2=0.940). This model was then used to predict outcomes for the 2019 test data, yielding an RMSE of 2.328 and an R2 of 0.665 ( and ). The model included 2-13 variables per domain as features. The standardized partial regression coefficients for words ranged from −0.386 to 0.489, and for the selected prefectures as features, they ranged from −0.704 to 0.439 ( and ).
Table 2. Elastic Net regression analysis metrics.IndexWords selected for the model (standardized partial regression coefficient)aIntercept4.078×10−15Income給与 (−0.063), 所得 申告 (−0.006), 所得 証明 (−0.008), 所得 証明 書 (−0.021), 確定 申告 (0.489), 扶養 所得 (0.030), 譲渡 所得 (0.117), 退職 所得 (0.028), 住民 税 (0.013)Jobs雇用 契約 (0.226), 雇用 保険 証 (0.020), 雇用 保険 者 証 (−0.125), 雇用 保険 被 保険 者 証 (0.043), 助成 金 雇用 (−0.104), 雇用 保険 料率 (−0.011), 失業 保険 (−0.386)Housing住宅 (−0.116), 住宅 控除 (0.087), 賃貸 住宅 (−0.143), 市営 住宅 (−0.107), 住宅 ロ−ン 控除 (−0.046), 注文 住宅 (0.033), マンション (−0.264), 住宅 展示 場 (−0.127), 住宅 情報 (−0.381), 高齢 者 住宅 (−0.025), リフォ−ム (0.059), 分譲 住宅 (0.299), エコ ポイント 住宅 (0.171)Health健康 保険 組合 (0.360), 健康 センタ− (0.035), 健康 ランド (0.320), 健康 管理 (−0.048), 健康 保険 協会 (−0.026), 保険 証 (−0.069), 全国 健康 保険 協会 (−0.275), 健康 保険 とは (−0.135), 健康 管理 センタ− (0.075)Work-life balance残業 手当 (−0.018), 労働 時間 (−0.164), 労働 基準 法 (0.001), 転職 (0.130), 月 残業 時間 (−0.159)Education教育 (−0.118), 教育 大学 (−0.068), 教育 委員 会 (−0.338), 特別 教育 (−0.145), 教育 指導 (−0.247), 教育 ロ−ン (−0.179), 教育 研究 所 (0.055), 義務 教育 (0.351), 教育 問題 (0.113)Community疲れ た (−0.019), 人間関係 悩み (−0.053), 人間 関係 苦手 (0.141)Civic engagement政治 (−0.048), 政治 ブログ (−0.307), 選挙 (0.250), 政治 問題 (−0.160)Environment環境 基準(−0.330), 自然 環境 (0.259)Safety治安 (−0.134), 日本 治安 (−0.016)Life satisfaction幸せ (−0.063), 幸せ の 時間 (−0.209), 幸せ 画像 (−0.007), 幸せ に なりたい (−0.133), 幸せ に なろう (−0.154), 幸せ に なるために (−0.034), 小さな 幸せ (−0.094)PrefectureAomori (−0.260), Iwate (−0.246), Akita (0.010), Fukushima (−0.081), Saitama (−0.031), Tokyo (0.167), Toyama (0.033), Ishikawa (0.126), Fukui (0.275), Yamanashi (0.319), Nagano (0.089), Gifu (0.439), Shizuoka (0.323), Aichi (0.415), Mie (0.334), Shiga (0.243), Osaka (−0.704), Shimane (0.269), Okayama (−0.171), Hiroshima (−0.159), Yamaguchi (0.050), Tokushima (−0.188), Kagawa (−0.245), Kochi (−0.232), Fukuoka (−0.505), Kumamoto (−0.069), Miyazaki (0.044), Okinawa (−0.497)aPlease refer to for English translations of Japanese words.
Table 3. Model accuracy statistics.aα represents the proportion of L1 to L2 regularization in Elastic Net. This value is calculated using the training data and is part of the model used to predict the test data but is not recalculated for the test data.
bNot applicable.
cλ represents the intensity of regularization. This value is calculated using the training data and is part of the model used to predict the test data but is not recalculated for the test data.
dRMSE: root mean square error; the square root of the mean squared error.
eR2 (coefficient of determination) is the determination coefficient’s value.
The primary aim of this study was to predict the RWI for each prefecture in Japan using internet search log data. The best model in this study achieved an out-of-sample R2 value of 0.665 (in-sample R2 of 0.904; and ), which is relatively high compared with the R2 values ranging from 0.005 to 0.830 reported in previous regional-level well-being studies using web data () [,,-]. Most previous studies have focused on predicting subjective well-being or single-objective indicators, representing only one aspect of well-being. Unlike earlier research, this study’s comprehensive approach to predicting the RWI underscores the efficacy of internet search data in evaluating overall well-being.
Table 4. Previous studies predicting well-being-related indicators using web data.Levels and referenceOutcomesBig Data measureBig Data sourceR2aIndividualsaR2 (coefficient of determination) is the determination coefficient’s value.
b“VKontakte” is Russia’s social networking platform.
The primary advantage of this study is its potential to reduce the time and economic resources required for conventional well-being assessments significantly. The developed model allows for frequent and quick assessment of comprehensive well-being at the prefectural level in Japan using upcoming search log data. This approach could be invaluable for policymaking aimed at enhancing well-being. Moreover, this methodology can be adapted beyond Japan, allowing countries where the BLI is measured to calculate their RWI based on the Japanese approach and to predict regional well-being using web data based on this study's methods. If the accuracy of the model is ensured, it will allow immediate and repeated assessments of comprehensive well-being. Additionally, the use of Google, a platform extensively used globally, in this study suggests its potential applicability even in many resource-limited nations where large-scale surveys pose challenges. The methodology proposed in this study could be tested in various countries, potentially enabling the assessment of well-being and the realization of evidence-based policy making (EBPM) for well-being improvement.
The analysis revealed marked disparities in mean search frequencies and their SDs when segmented by prefecture ( and ). These differences indicate variations in interest in well-being-related topics across regions. Notably, the tendency for higher search frequencies related to work-life balance in major urban areas, such as Tokyo and Osaka, might suggest an increased awareness of the working environment in these regions. Furthermore, the magnitude of the SDs reflects the degree of variability in search behavior during the observation period, suggesting the potential to assess changes in the interests and states of well-being among residents of different regions.
In the analysis of the relationship between each word and regional well-being indicators (RWI) as well as IRWI in this study, correlation coefficients showed both positive and negative values, with many remaining within the range of ±0.3. Notably, in the “Housing” domain, there were relatively high positive correlations, such as a correlation coefficient of 0.6 for “マンション.” A positive correlation coefficient suggests that as the search frequency for a word increases, so does the RWI domain or IRWI score. Although these correlations do not establish causality, they indicate a potential relationship between search behaviors related to specific words and the well-being of people in that area or with more comprehensive well-being. This implies that internet search data could be a viable means of understanding well-being.
As a supplementary analysis, Global Moran I results indicate spatial autocorrelation in Japan’s IRWI scores, suggesting that well-being in Japan is not randomly distributed but exhibits spatial patterns. Additionally, Local Moran I indicated these spatial patterns specifically form. High-score clusters were predominantly found in the Chubu and Tokai regions, while low-score clusters were identified in parts of the Tohoku and Kyushu regions, as well as in Shikoku. This suggests that regional differences in well-being may not be solely due to isolated factors but could also be influenced by interrelationships with adjacent areas, possibly reflecting common factors across multiple regions. This finding could provide important implications for policy interventions aimed at improving well-being.
LimitationsWhile focusing on predicting a comprehensive well-being index (IRWI), this study did not consider the causal relationship between fluctuations in search log data and changes in each well-being indicator of the RWI. Although the out-of-sample R2 and RMSE indicate a high degree of accuracy, the model might exhibit slight overfitting. This may be due to the limited data available, as the evaluation is restricted to a single year (2019). The RWIs used as outcomes in this study are based on somewhat outdated data, with 2019 being the most recent year available due to the unavailability of certain indicators constituting the RWI. Subsequent updates have been delayed by factors such as the COVID-19 pandemic and extended data aggregation processes. However, the year-to-year variations in RWIs are typically not substantial, suggesting that insights derived from data up to 2019 remain valuable and pertinent. Furthermore, including data from 2020 onward would introduce significant confounding effects due to the pandemic’s impact. Thus, the examination of these postpandemic trends and their implications remains an important issue for future research. Additionally, since this was an ecological study that used data at the prefecture-level rather than at the individual level, we could not evaluate associations at the individual level. It also did not address the correlations with specific RWI domains or the relative importance of each domain. The well-being indicators, BLI and RWI, are designed to allow users to assess the significance of each domain in a flexible manner. Therefore, the results of this study, which treated each domain uniformly, may differ from the interpretation of the RWI when used in a prefecture. Consequently, implementation in the field should be a topic for future research. Another limitation is the lack of consideration for the emotional valence of search queries and the reason why certain specific words significantly correlated with well-being. Although RSV captures public interest—which implicitly includes emotional aspects—the technical challenges of applying natural language processing techniques to individual search terms limited their use in this study. Therefore, future research is warranted to better understand regional characteristics and to derive more accurate interpretations of searchers’ intentions from search terms associated with well-being scores. Addressing these challenges could provide deeper insights into the relationship between web-based behavior and well-being.
In addition to these limitations, Google Trends data was limited by its relative nature, which might not have fully captured actual search activity. While it is possible that infrequently searched words would be overestimated, we believe that the impact is limited in this study because niche words were excluded during the process of extracting representative words. Additionally, the algorithms behind Google’s search functions introduced uncertainty in interpreting search intent []. It was also noted that the granularity of Google Trends data could have led to limitations in predictive accuracy [].
Comparison With Prior WorkThis study successfully improved the predictive accuracy of comprehensive well-being indices at the regional level through the use of web data, achieving relatively higher accuracy compared with the outcomes of previous studies, as shown in . Kosinski et al [] and Schwartz et al [], used Facebook data and predicted individual-level life satisfaction but achieved relatively low R2 values of 0.003 and 0.090, respectively. Panicheva et al [] integrated various types of web data, including messages and attributes from VKontakte, a SNS popular in Russia and its neighboring countries, to predict individual-level life satisfaction and mental well-being; however, the improvement in R2 was limited.
In contrast, predictions at broader national and state levels yielded more accurate results. A notable example is Algan et al [], who achieved high predictive accuracy with R2 values of 0.940 at the national level and 0.720 at the state level for Life Satisfaction. Our study showed similar levels of accuracy in forecasting a more comprehensive well-being index, effectively capturing trends in regional well-being.
Furthermore, the predictive model in this study demonstrated a level of accuracy comparable to that reported by Carpi et al [], who predicted the Human Development Index, a comprehensive composite index analogous to the RWI used in this research. This result suggests that our methodology can efficiently and accurately predict comprehensive well-being indicators using web data, suggesting that it can quickly identify broader well-being trends while saving time and money.
Implications and Actions NeededUsing the RWI calculation methodology of this study, OECD countries have the opportunity to calculate the RWI for their regions and apply these in regional policymaking. Moreover, the creation of high-accuracy models using internet search data facilitates the timely and continuous assessment of well-being. Governments and international organizations are shifting their focus from merely economic development and life expectancy to improving well-being; however, integrating well-being indicators into national and regional policy goals is traditionally time-consuming and costly. This study demonstrates the efficacy of our approach in predicting comprehensive well-being indicators through web data, indicating its potential to rapidly assess broader trends in well-being with cost and time efficiency. Furthermore, these internet search data are likely to be less susceptible to biases common in traditional survey methods, such as recall and social desirability biases, possibly unveiling aspects of well-being that conventional approaches overlook. Comparing this approach with traditional methodologies may yield insights into societal well-being and provide foundational data for policymaking and evaluations aimed at improving well-being. If this approach is effective, it could extend the reach of well-being assessments to additional countries and regions, thereby accelerating the adoption of policies designed to improve societal well-being.
ConclusionThis study predicted RWIs for Japanese prefectures with high accuracy using RSVs from internet search engines and an Elastic Net machine learning method. This approach provides an immediate and cost-effective alternative to traditional survey methods for comprehensive well-being assessments. It enables ongoing and quick assessments and serves as foundational data for EBPM focused on enhancing well-being. Adding the well-being indicators predicted by the method proposed in this study to conventional policy indicators will enable the agile detection of changes in the population and provide basic data for the discovery of new health issues and policy formulation. Moreover, this methodology suggests a potential solution for assessing well-being in resource-limited nations and regions, where large-scale epidemiological surveys are impractical. However, this study reflects ecological trends rather than individual behaviors, and further research is warranted to identify causal relationships between individual search terms and well-being.
We thank Dr Takahiro Itaya of Kyoto University for the insightful comments and constructive feedback during the initial stages of this study. We also wish to express our profound thanks to Dr Shiomi Misa for her persistent guidance and advice, which were invaluable over the entire course of our research. During the preparation of this work, the author used DeepL [] and ChatGPT (OpenAI) [] in order to improve the English language. After using these tools, the authors reviewed and edited the content as needed and took full responsibility for the content of the published article.
None declared.
Edited by A Mavragani; submitted 23.07.24; peer-reviewed by Y Matsuda, R Guo, I Boumahdi; comments to author 13.08.24; revised version received 31.08.24; accepted 08.10.24; published 11.11.24.
©Myung Si Yang, Kazuya Taira. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 11.11.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
留言 (0)