Go to the Table Of Contents

Section B: Substate Region Estimation Methodology

This report includes substate region-level estimates of 21 binary (0,1) substance use measures using combined data from the 2006, 2007, and 2008 National Surveys on Drug Use and Health (NSDUHs) for persons aged 12 or older. Binary measures correspond to questions where a "yes" or "no" response is provided (in this case, "no" = 0 and "yes" = 1). Additionally, this report presents two binary (0, 1) estimates for underage (12 to 20) use of alcohol and binge alcohol use.

The survey-weighted hierarchical Bayes (SWHB) methodology used in the production of State estimates from the 1999-2008 surveys also was used in the production of the 2006-2008 substate estimates. The SWHB methodology is described by Folsom, Shah, and Vaish (1999). A brief discussion of the precision of the estimates and interpretation of the prediction intervals (PIs) is given in Section B.1. Section B.2 lists the 21 substance use measures for which substate-level small area estimates were produced. The list of predictors used in the 2006-2008 substate-level small area estimation (SAE) modeling is given in Section B.3. In the production of the 2006-2008 substate small area estimates, new population projections (obtained from Claritas) were used. Information on the new projections and how they were used to create SAE model predictors is given in Section B.4. The methodology used to select relevant predictors is described in Section B.5. Procedures used to implement the adjustment of NSDUH weights for the purpose of obtaining substate small area estimates is described briefly in Section B.6. The goals of the SAE modeling, the general model description, and the implementation of SAE modeling remain the same and are described in Appendix E of the 2001 State report (Wright, 2003). A general model description is given in Section B.7. A short description of the calculation of the rate of first use of marijuana, major depressive episode (MDE), and underage drinking is included in Section B.8. Small area estimates specific to four age groups (12 to 17, 18 to 25, 26 or older, and 18 or older) for substate areas are produced separately from this report. Estimates of MDE will be included in these tables.

Small area estimates obtained using the SWHB methodology are design consistent (i.e., for States or substate areas with large sample sizes, the small area estimates are close to the corresponding robust design-based estimates). The substate small area estimates when aggregated by using the appropriate population totals result in national small area estimates that are very close to the national design-based estimates. However, for many reasons, including internal consistency, it is desirable to have national small area estimates exactly match the national design-based estimates. Beginning in 2002, exact benchmarking was introduced (see Appendix A, Section A.4, in Wright & Sathe, 2005). The small area estimates presented here have been benchmarked to the national design-based estimates.

B.1. Precision and Validation of the Estimates

The primary purpose of this report is to give policy officials a better perspective on the range of prevalence estimates within and across States. Because the data were collected in a consistent manner by field interviewers who adhered to the same procedures and administered the same questions across all States and substate regions, the results are comparable across the 50 States and the District of Columbia.

The 95 percent PI associated with each estimate provides a measure of the accuracy of the estimate. It defines the range within which the true value can be expected to fall 95 percent of the time. For example, the estimated prevalence of past month use of marijuana in Region 1 in Alabama is 3.8 percent, and the 95 percent PI ranges from 2.9 to 5.2 percent. Therefore, the probability is 0.95 that the true value is within that range. The PI indicates the uncertainty due to both sampling variability and model bias and is also referred to as a "credible interval." A credible interval contains a given percentage of the posterior distribution of the parameter (or measure) of interest. Note the term "PI" may have been used in other applications to estimate future values of a parameter of interest; however, that interpretation does not apply to this report. The key assumption underlying the validity of the PIs is that the State- and substate-level error (or bias correction) terms in the models behave like random effects with zero means and common variance components.

A comparison of the standard errors (SEs) among substate regions with small (n ≤ 500), medium (500 < n ≤ 1,000), and large (n > 1,000) sample sizes for certain measures in this report shows that the small area estimates behave in predictable ways. Regardless of whether the substate region is from 1 of the 8 States with a large annual sample size (3,000 to 4,000) or 1 of the other States (n = 900 annually), the sizes of the PIs are very similar and are primarily a function of the sample size of the substate region and the prevalence estimate of the measure. Substate regions with large sample sizes had the smallest SEs.

For past month use of alcohol, where the national prevalence for all persons aged 12 or older was 51.2 percent (for 2006-2008), the average relative standard error (RSE)3 was about 5.2 percent, and the RSE for substate regions with a large sample size was about 3.3 percent. For substate regions with a medium sample size, the average RSE was 4.5 percent; for small sample sizes, the average RSE was 5.9 percent.

For past month use of marijuana (with a national prevalence of 6.0 percent), the average RSE was 10.2 percent for substate regions with large samples. For medium sample sizes, the average RSE was 13.2 percent, and for small samples, the RSE was 16.1 percent, whereas the overall national average RSE was 14.6 percent. Substance use measures with lower prevalences, such as past year use of cocaine (2.3 percent nationally), displayed larger average RSEs. For large sample sizes, the average RSE was 15.1 percent. For substate regions of medium sample sizes, the average RSE was 18.2 percent; and for small samples, the average RSE was 20.1 percent.

The SAE methods used for substate regions in this report were previously validated for the NSDUH State-by-age group small area estimates (Wright, 2002). This validation exercise used direct estimates from pairs of large sample States (n = 7,200) as internal benchmarks. These internal benchmarks were compared with small area estimates based on random subsamples (n = 900) that mimicked a single year small State sample. The associated age group-specific small area estimates were based on sample sizes targeted at n = 300. Therefore, validation of the State-by-age group small area estimates should lend some validity to the small sample size substate small area estimates reported here.

B.2. Variables Modeled

Substate-level small area estimates were produced for the following set of 21 binary (0, 1) substance use measures, using combined data from the 2006-2008 NSDUHs for persons aged 12 or older:

  1. past month use of illicit drugs,
  2. past month use of illicit drugs other than marijuana,
  3. past month use of marijuana,
  4. average annual rate of first use of marijuana,
  5. perceptions of great risk of smoking marijuana once a month,
  6. past year use of marijuana,
  7. past year use of cocaine,
  8. past year nonmedical use of pain relievers,
  9. past month use of alcohol,
  10. past month binge alcohol use,
  11. perceptions of great risk of having five or more drinks of an alcoholic beverage once or twice a week,
  12. past month use of cigarettes,
  13. past month use of tobacco products,
  14. perceptions of great risk of smoking one or more packs of cigarettes per day,
  15. past year alcohol dependence,
  16. past year illicit drug dependence,
  17. past year alcohol dependence or abuse,
  18. past year illicit drug dependence or abuse,
  19. past year dependence on or abuse of illicit drugs or alcohol,
  20. needing but not receiving treatment for alcohol use in the past year, and
  21. needing but not receiving treatment for illicit drug use in the past year.

In addition to the 21 measures listed above, estimates also have been produced for underage (aged 12 to 20) past month use of alcohol and underage past month binge alcohol use. Table B1 at the end of this section lists all outcomes and the years (2002-2004, 2004-2006, and 2006-2008) for which substate-level small area estimates were produced going back to the 2002 NSDUH.

B.3. Predictors Used in Logistic Regression Models

Local area data used as potential predictor variables in the mixed logistic regression models were obtained from several sources, including Claritas, the U.S. Census Bureau, the Federal Bureau of Investigation (Uniform Crime Reports), Health Resources and Services Administration (Area Resource File), the Bureau of Labor Statistics, the Bureau of Economic Analysis, the Substance Abuse and Mental Health Services Administration (SAMHSA) (National Survey of Substance Abuse Treatment Services [N-SSATS]), and the National Center for Health Statistics (mortality data). The sources of data used in the modeling are provided in the following list.

For more information about the predictors defined from the above sources, see Appendix A, Section A.2, of the 2007-2008 State estimates report (Hughes et al., 2010).

B.4. Updated Claritas Data

Claritas data are used for the following in the NSDUH SAE process:

Up until the 2006-2007 State and 2004-2006 substate reports, 2002 Claritas data were used. The 2002 Claritas data had 2000 and 2002 population estimates, as well as 2007 population projections. For the 2007-2008 State estimates, new Claritas data with 2008 population estimates and 2012 population projections were used. The new Claritas data will be henceforth referred to as the 2008-2012 Claritas data, and the 2002 Claritas data will be referred to as the 2002-2007 Claritas data. The following main differences were observed between the two Claritas datasets:

  1. The format of the race/ethnicity data was different for the two sets of Claritas data. The age group by race by Hispanicity by gender population estimates at the block group level were not available in the 2002-2007 Claritas data. These population estimates were generated using the block group level (age by gender by race) and (race by Hispanicity) population distributions. It was assumed that in the 2002-2007 Claritas data each of the age by gender cells within a race group had the same Hispanicity distribution. Hence, the 2002-2007 Claritas data were manipulated to get the desired four-way cross of demographic domains. The 2008-2012 Claritas data have age group by race by Hispanicity by gender population distributions, so no assumptions or manipulations to the data had to be made.

  2. The 2007 and 2008 distributions of the population aged 20 to 24 in block groups were very different for the two datasets. Another difference was that there were more block groups that had a 0 population estimate for some of the 32 cells in 2008 as compared with the 32 cells in 2007.

  3. In prior State and substate reports when creating the 32 cells using the 2002-2007 Claritas data, the population from the two or more races' category was distributed among the black, white, and other race categories. Starting in 2008 and subsequent years, a decision was made to merge the two or more races' category with the other race category. This was based on a decision to discontinue creating a sample variable that split the two or more races' respondents into black, white, or other. Because the two or more races' respondents on the NSDUH sample were now all being grouped into the other category, the same technique was used to produce the 32 cell population estimates.

Some of the differences in the 2007 and 2008 population estimates can be attributed to the process used in creating the 32 cell population estimates and some to the data trends shifting over time (i.e., the 2008-2012 Claritas projections are based on updated population information). The production of 2007-2008 State and 2006-2008 substate small area estimates required the 2006, 2007 and 2008 population estimates at the 32 demographic cells. Because the 2008-2012 Claritas data are based on more recent intercensus population projections than the 2002-2007 Claritas data, which were obtained in 2002, it was decided that "new" 2006 and 2007 population projections would be obtained by "projecting back" the 2008-2012 Claritas data.

In summary, based on the information above, the following steps were taken for the current 2006-2008 substate SAE analysis:

  1. For the predictors created using the Claritas data, linear extrapolations of the 2008-2012 Claritas data were done to get the 2006 and 2007 population projections. Each of the 13 block group, tract, and county-level predictors was re-created and merged onto the 2006 and 2007 sample and universe files (the universe file is a block group-level file containing predictor variables that are defined for the entire Nation).

  2. The 2007 and 2008 sample files (with the updated Claritas predictors) were pooled and used to create new decile cutoffs for all continuous predictor variables. These cutoffs were used to create categorical analogs (10-category predictor variable) of the corresponding continuous predictor variable. For each of the 10-category predictor variables, corresponding linear, quadratic, and cubic orthogonal polynomials were formed. These new cutoffs also were used to create an updated 2006 sample predictor file. Categorical predictors based on deciles are used instead of the continuous versions to protect against spurious predictions at the extremes of the covariate ranges where the data are sparse. Using continuous predictors to fit the model (possibly including squared and cubed terms) can cause serious bias in the estimation. The decile-based categorical predictors are transformed to orthogonal polynomials to minimize multicollinearity problems.

  3. The updated population estimates for the 32 cells (age group by race/ethnicity by gender population estimates) and the new deciles were used to create the updated universe files for all 3 years (2006, 2007, and 2008).

  4. For any analysis using 2002 through 2005 NSDUH data, the old projections based on the 2002-2007 Claritas data will be used (i.e., for calculating the correlations and p values for detecting change between the 2004-2006 and 2006-2008 substate estimates, the 2002-2007 Claritas projections will be used for the 2004 and 2005 data). The 2008-2012 Claritas projections, on the other hand, will be used on the 2006-2008 sample and universe files. Despite not knowing the exact impact of using new population estimates to calculate the correlation between 2004-2006 and 2006-2008 substate estimates, it is reasonable to expect that this should have minimal impact on the variances of change estimates (and consequently the p values). Note that these substate change estimates are not part of this report, but rather will be posted as tables on the OAS Web site when they become available.

For the 2006-2008 substate small area estimates, the 2006-2008 population projections were obtained from the new 2008-2012 Claritas data. However, it was decided not to reproduce the 2004-2006 substate small area estimates using the updated projections from the 2008-2012 Claritas data to be consistent with the current practice of not updating previously published estimates.

B.5. Selection of Independent Variables for the Models

No new variable selection was done. The same fixed-effect predictors that were used in producing the previously published State and substate estimates were used to produce the 2006-2008 substate estimates.

B.6. Adjustment of Weights

The person-level NSDUH weights are poststratified (adjusted) to match census population estimates at the State level. Because the objective here was to produce small area estimates for substate regions, it was decided to ratio adjust the person-level sampling weights to population projections (available from Claritas as shown in Table E1 in Section E) at the substate by age group by gender level. The advantage to doing this ratio adjustment is to ensure that the adjusted sampling weights better reflect the demography of the substate regions. The downside to this adjustment is that the design-based estimates based on the unadjusted sampling weights may be slightly different (at the national level) from the design-based estimates obtained from the adjusted weights. However, because the aim was to be able to produce reliable substate region-level small area estimates, this ratio adjustment to the weights seemed more appropriate. Note that this ratio adjustment was done at the substate region (362 regions) by age group (12 to 17, 18 to 25, 26 to 34, and 35 or older) by gender (male and female) level collectively over 3 years (2006, 2007, and 2008) of data.

B.7. General Model Description

The model described here is similar to the logistic mixed hierarchical Bayes (HB) model that was used to produce the 2004-2006 substate small area estimates (OAS, 2008). The following model was used:

log[πaijk / (1 – πaijk )] = xaijk βa + ηai + νaij ,

where πaijk is the probability of engaging in the behavior of interest (e.g., using marijuana in the past month) for person-k belonging to age group-a in substate region-j of State-i. Let xaijk denote a pa × 1 vector of auxiliary variables associated with age group-a (12 to 17, 18 to 25, 26 to 34, and 35 or older) and βa denote the associated vector of regression parameters. The age group-specific vectors of auxiliary variables are defined for every block group in the Nation and also include person-level demographic variables, such as race/ethnicity and gender. The vectors of random effects ηi = (η1i ,…, ηAi )′ and νij = (ν1ij ,…, νAij )′ are assumed to be mutually independent with ηi ∼ NA (0, Dη ) and νij ∼ NA (0, Dν ), where A is the total number of individual age groups modeled (generally A = 4). For HB estimation purposes, an improper uniform prior distribution is assumed for βa , and proper Wishart prior distributions are assumed for Dη−1 and Dν−1. The HB solution for πaijk involves a series of complex Markov Chain Monte Carlo (MCMC) steps to generate values of the desired fixed and random effects from the underlying joint distribution. The basic process is described in Folsom et al. (1999), Shah, Barnwell, Folsom, and Vaish (2000), and Wright (2003).

Once the required number of MCMC samples for the parameters of interest are generated and tested for convergence properties (see Raftery & Lewis, 1992), the small area estimates for each age group by race/ethnicity by gender cell within a block group can be obtained. These block group-level small area estimates then can be aggregated using the appropriate population estimate projections to form substate- and State-level small area estimates for the desired age group(s). These small area estimates then are benchmarked to the national design-based estimates (see Hughes et al., 2010).

B.8. Calculation of Average Annual Rate (Incidence) of First Use of Marijuana, Major Depressive Episode, and Underage Drinking

Incidence rates typically are calculated as the number of new initiates of a substance during a period of time (such as in the past year) divided by an estimate of the number of person years of exposure (in thousands). The incidence definition used in this report employs a simpler form of the at-risk population based on the model-based methodology. This model-based average annual incidence rate for first use of marijuana is defined as follows:

Average annual rate = 100*{[X1 ÷ (0.5 * X1 + X2)] ÷ 2},

where X1is the number of marijuana initiates in the past 24 months and X2is the number of persons who never used marijuana. Both X1 and X2 are based on binary measures that correspond to questions with a "yes" or "no" response option. For details on calculating the average annual rate of first use of marijuana from the NSDUH data, see Appendix A, Section A.7, of the 2007-2008 State estimates report (Hughes et al., 2010).

Beginning in 2004, a module was included in the NSDUH questionnaire that obtained data related to having a major depressive episode (MDE); the module was based on the criteria specified for major depression in the Diagnostic and Statistical Manual of Mental Disorders, 4th edition (DSM-IV) (American Psychiatric Association [APA], 1994). These questions permit estimates to be calculated for lifetime and past year prevalence of MDE, treatment for MDE, and role impairment resulting from MDE. For this report, estimates of having MDE in the past year were produced only for youths aged 12 to 17.

In 2004, a split-sample design was implemented where adults aged 18 or older in half of the sample received the depression module while adult respondents in the other half did not. All youths aged 12 to 17 were administered the adolescent depression module that year. In 2005, 2006, 2007, and 2008, all adult and adolescent respondents were administered their respective depression modules. Separate modules were administered to adults 18 or older and youths aged 12 to 17. To make the modules developmentally appropriate for youths, there are minor wording differences in a few questions between the adult and youth modules. Since 2004, the NSDUH questions that determine MDE have remained unchanged. However, because of changes to other mental health items that precede the adult MDE questions (K6, suicide, and impairment) in the 2008 questionnaire, the reporting on MDE questions among adults appears to have been affected. No questionnaire changes were made in 2008 that affected MDE items for youths aged 12 to 17; thus, MDE small area estimates for adults were not produced for the 2007-2008 State report (see Section A.10 in Appendix A of Hughes et al., 2010) because the 2008 MDE estimates are not comparable with those for 2007. Hence, only substate region MDE estimates for youths aged 12 to 17 are produced for 2006-2008 and will be included in a set of age group tables separate from this report.

To obtain small area estimates for persons aged 12 to 20 for past month alcohol use and binge alcohol use, a separate set of models was fit for these two outcomes for the 12 to 17 age group and the 18 to 20 age group (similar to what was done for producing substate estimates using the 2004-2006 NSDUH data). For details on underage drinking, see Section A.8, Appendix A, of the 2007-2008 State estimates report (Hughes et al., 2010).

Table B1. Outcomes, by Survey Year, for Which Substate Small Area Estimates Are Available
Measure 2002-2004 2004-2006 2006-2008
Yes = available, No = not available.
1 Because of questionnaire changes, estimates for serious psychological distress (SPD) in the years 2002-2004 are not comparable with the 2004-2006 SPD estimates. For more details, see Section B.7 of the report on Substate Estimates from the 2004-2006 National Surveys on Drug Use and Health (OAS, 2008). Estimates for SPD are not available in the 2006-2008 substate report; for more details, see Section A.1 of this report.
2 Questions used to determine a major depressive episode (MDE) were added in 2004. Estimates for adults aged 18 or older are not available in the 2006-2008 substate report. However, MDE substate estimates for youths aged 12 to 17 will be produced for 2006-2008 and will be included in a set of age group tables separate from this report. For more details, see Sections A.1 and B.8 of this report.
Source: SAMHSA, Office of Applied Studies, National Survey on Drug Use and Health, 2002, 2003, 2004, 2005, 2006, 2007, and 2008.
Illicit Drug Use in Past Month Yes Yes Yes
Marijuana Use in Past Year Yes Yes Yes
Marijuana Use in Past Month Yes Yes Yes
Perceptions of Great Risk of Smoking Marijuana Once a Month Yes Yes Yes
First Use of Marijuana Yes Yes Yes
Illicit Drug Use Other Than Marijuana in Past Month Yes Yes Yes
Cocaine Use in Past Year Yes Yes Yes
Nonmedical Use of Pain Relievers in Past Year Yes Yes Yes
Alcohol Use in Past Month Yes Yes Yes
Underage Past Month Use of Alcohol Yes Yes Yes
Binge Alcohol Use in Past Month Yes Yes Yes
Underage Past Month Binge Alcohol Use Yes Yes Yes
Perceptions of Great Risk of Having Five or More Drinks of an Alcoholic Beverage Once or Twice a Week Yes Yes Yes
Tobacco Product Use in Past Month Yes Yes Yes
Cigarette Use in Past Month Yes Yes Yes
Perceptions of Great Risk of Smoking One or More Packs of Cigarettes Per Day Yes Yes Yes
Alcohol Dependence or Abuse in Past Year Yes Yes Yes
Alcohol Dependence in Past Year Yes Yes Yes
Illicit Drug Dependence or Abuse in Past Year Yes Yes Yes
Illicit Drug Dependence in Past Year Yes Yes Yes
Dependence on or Abuse of Illicit Drugs or Alcohol in Past Year Yes Yes Yes
Needing But Not Receiving Treatment for Illicit Drug Use in Past Year Yes Yes Yes
Needing But Not Receiving Treatment for Alcohol Use in Past Year Yes Yes Yes
Serious Psychological Distress in Past Year1 Yes Yes No
Having at Least One Major Depressive Episode in Past Year2 No Yes No

End Notes

3 The RSE of an estimate is the posterior SE divided by the estimate itself. Note that the RSEs have been calculated based on the unbenchmarked small area estimates.

4 The four age groups are 12 to 17, 18 to 25, 26 to 34, and 35 or older; the four race/ethnicity groups are non-Hispanic white, non-Hispanic black, non-Hispanic other, and Hispanic; and the two genders are male and female.

Go to Top of PageGo to the Table of Contents

This page was last updated on .