The data was extracted from the Household Income-Expenditure Survey (HIES) in Iran, conducted by the Statistical Center of Iran (SCI) in 2021. The survey was cross-sectional and included a sample of 37,988 urban and rural Iranian households from 449 cities and 31 different provinces. In the SCI, a three-stage cluster sampling method was used to select households: First, census tracts were classified and selected, second, urban and rural blocks were selected, and third, sample households were identified. Data was also collected through face-to-face interviews using a comprehensive and standardized questionnaire [30].
Due to our objectives, only households with at least one child under the age of 5 were included in this study. The sample used consisted of 8571 households from 425 cities in all provinces with complete information on the selected variables. The number of households for the provinces varies between 109 and 537 in the data included. In the HIES survey, the monthly number of cigarettes consumed in the households was recorded, but in the present study the number of cigarettes consumed daily in the households is considered as the outcome variable, with the month being converted into a day and rounded. Two categories of variables were used to identify the factors influencing cigarette consumption in the country's households: Head-of-household characteristics and household-level characteristics. Information about the household head, including age, gender, education, marital status with two categories, including married and others (Single/divorced/widow), occupation and a range of socio-demographic variables related to the household, such as area of residence, family size, number of educated members, number of student members, members who work, monthly health expenditure, house ownership status, house area, monthly income with four categories, including; less than 150 US$ (low), 150–400 US$ (low to middle), 400–600 US$ (middle) and more than 600 US$ (high), machine/motorcycle, bicycle, internet access, and computer/tablet. All variables were selected based on literature research and previous studies.
Data analysisThe three counting models from generalized linear models used in this study are the Poisson model, the negative binomial model and the generalized Poisson model, as follows:
Poisson distribution (P)The famous counting distribution is the Poisson distribution as follows:
$$f\left(y;\mu \right)=\frac^^} ; y=\text,2,\dots$$
(1)
where \(\mu>0\) is a real positive number representing the mean and variance of the distribution [31].
Negative Binomial distribution (NB)The NB model is obtained by adding another source of variability to the P model, e.g. the dispersion parameter. The added parameter allows the variance to exceed the mean. Therefore, the NB distribution allows the calculation of overdispersion [32]. The probability distribution function (p.d.f) is as follows:
$$f\left(y;r,p\right)=\left(\genfrac\right)^^ ; y=0, 1, 2,\dots , r>0 , 0< p <1$$
(2)
where the r is known as the dispersion parameter (overdispersion) [33]. The mean value and the variance of the distribution are \(\frac\) and \(\frac^}\), respectively.
Generalized Poisson distribution (GP)The p.d.f of GP is as follows;
$$f\left(y;\alpha,\lambda\right)=\left\\left(\frac1\right)\lambda\left(\lambda+\alpha\lambda\right)^e^&;\;y=0,1,2,\dots\\0&;\;for\;\;y>k\;\;if\;\;\alpha<0\end\right.$$
(3)
where \(\lambda>0\), \(\text\left(-1,-\frac\right)<\alpha \le 1\), and \(k\ge 4\). The mean and variance of the distribution are \(\frac\) and \(\frac^}\) ,respectively. If \(\alpha =0\), this distribution reduces to the Poisson distribution [34].
In statistical modeling, particularly in the context of regression analysis, the relationship between a dependent variable and one or more explanatory variables (\(_,\dots ,_\)) is often expressed by mathematical equations that capture the underlying dynamics of the data. A common approach uses a logarithmic transformation of the mean (\(\mu\)) of the outcome distribution [35], expressed as:
$$_\left(\mu \right)=\alpha +__+\cdots +__$$
Zero-inflated modelZero-inflated count models are a useful approach when dealing with datasets that have an excessive number of zeros that cannot be adequately described by standard count distributions. In other words, the purpose of these models is to account for the excess zeros in the data that cannot be explained by the count model alone. These models are a type of mixture model that combines a binary model (logit, probit, etc.) and a count model (Poisson, Negative Binomial, etc.).
The structure of a Zero-Inflated distribution is as follows:
$$P\left(Y=y;p,\Theta \right)=p+\left(1-p\right)f(y\left|\Theta )\right. ; y=0, 1, 2, \dots$$
(4)
where \(f(y\left|\Theta )\right.\) is the p.d.f of a count distribution with \(\Theta\) parameters and \(p\) is the probability of an excess zero [36, 37]. Two indices are considered when interpreting the coefficients: Odds Ratio (OR) and Risk Ratio (RR) for two parts of the model.
Multilevel zero-inflated count regression modelFor the hierarchical structure of this study, we consider a multilevel zero-inflated count distribution for the response variable (\(_\)), where (\(i=\text,\dots ,31;j=\text,\dots ,_;k=\text,\dots ,_\)). Due to the nature of the data collected in different cities and provinces, three-level (TL) count models were used for data analysis. The TL regression model is as follows
$$\left\log\left[\frac_}_}\right]=_=_^\alpha +_+_\\ log\left[_\right]=_=_^\beta +_+_\end\right.$$
(5)
where \(_^\) and \(_^\) refer to covariate vectors in two parts of the models. In this context, kth refers to the households (the first level) in the jth city (the second level) in the ith province (the third level). \(_\) and \(_\) in the zero and count part of the model refer to the random effect of the province, while \(}}_\) and \(_\) are attributed to the random effects of the city.\(\delta\),\(\gamma\), \(\tau\), and \(v\) are assumed to be independent and normally distributed with mean zero and variance\(_^\), \(_^\),\(_^\), and \(_^\) respectively [32].
Model fitting and selectionThe average number of cigarettes consumed per day in the household as a function of gender, age and other explanatory variables is modeled using P, NB, GP, zero-inflated Poisson (ZIP), zero-inflated Negative Binomial (ZINB) and Zero-inflated Generalized Poisson (ZIGP) regression models. The same explanatory variables are included in both the zero and count components of the zero-inflated models. In addition, the same variables are used in all models to compare the fitted models. It is important to use a set of fit indices to evaluate the models and carefully determine the best fit. The indices log-likelihood, AIC, BIC, and MSE are used for this purpose [29, 38]. Comparing AIC and BIC values between models can help determine the model that offers the best balance between fit and parsimony. Lower AIC and BIC values therefore indicate a better fitting model. MSE is a measure of the mean squared difference between the predicted and observed values. Lower MSE values indicate a better model fit as the model makes more accurate predictions. Data were analyzed using the library (glmmTMB) package in R, version 4.3.2.
留言 (0)