National Survey on Drug Use and Health Generalized Correlations of Small Area
State Estimates between Nonoverlapping Time Periods:
Documentation for CSV and Excel Files

 

Documentation for CSV and Excel Files

Description of the CSV File Type

Files with a comma separated value (*.csv) extension are in plain text. They contain characters stored in a flat, nonproprietary format and can be opened by most computer programs. Each *.csv file contains a set of tabular data, with each record delineated by a line break and each field within a record delineated by a comma. A field that contains commas as part of its content has the additional delineation of a quote mark character before and after the field's contents. When a quote mark character is part of a field's content, it is included as two consecutive ""quote mark"" characters.

Computers with Microsoft Excel installed open *.csv files in Excel by default, with the fields automatically arranged appropriately in columns. Other database programs also open *.csv files with the fields appropriately arranged.

This zip archive holds 26 CSV files (i.e., "NSDUHsaeGenCorrTab#-2015.csv"), reflecting the 26 "Generalized Correlation Table #"" tabs in the Excel file, and they contain the table title, table notes, column headings, and data.

Description of Generalized Correlations Used to Compare State Small Area Estimates between Two Nonoverlapping Time Periods

Starting with the comparison of 2002-2003 and 2003-2004 state estimates from the National Survey on Drug Use and Health (NSDUH), tests of significance of the difference in point estimates containing an overlapping year have been produced annually. In addition to these overlapping year comparisons, some nonoverlapping state comparisons with respect to the baseline period 2002-2003 (e.g., 2002-2003 vs. 2007-2008, 2002-2003 vs. 2008-2009, and beyond) have also been produced and are available for downloading from Substance Abuse and Mental Health Services Administration (SAMHSA) at https://www.samhsa.gov/data/.1 However, users of NSDUH estimates based on small area estimation (SAE) might be interested in conducting tests of significance not published for other nonoverlapping time periods, such as 2006-2007 versus 2009-2010. In order to produce the appropriate test statistic necessary to determine if the difference is statistically significant (e.g., the p value), the estimates, the Bayesian confidence interval (CI) for each estimate, and the correlation between the two estimates are needed. The estimates and CIs are available at https://www.samhsa.gov/data/; however, the correlations were not available prior to the release of the 2014-2015 state estimates. These correlations represented by generalized correlations, along with the published small area estimates and Bayesian CIs, should be used to compare state prevalence rates between two nonoverlapping time periods. The methodology for conducting such comparisons is illustrated by an example given later in this document.

The correlation in state estimates over time periods results from simultaneously modeling the data associated with the time periods of interest and/or the commonality of the data between the two time periods.2 The correlation due to this simultaneous modeling results mostly from the random effects for the population subgroups (age group by time period) being correlated over areas. For this simultaneous modeling, four age groups (12 to 17, 18 to 25, 26 to 34, and 35 or older), or three age groups (18 to 25, 26 to 34, and 35 or older) for the mental health outcomes, by two nonoverlapping time periods (i.e., eight or six subpopulation-specific models) were simultaneously fitted, each with its own set of fixed and random effects. In this case, the general covariance matrices for the state and within-state random effects were 8 × 8 or 6 × 6 matrices corresponding to the eight element or six element vectors of random effects. This correlation indicates that the area-level random contributions to the intercepts for the population subgroup-specific models can still be correlated for nonoverlapping years due to the random intercept adjustments having similar up and down patterns over areas for the two nonoverlapping time periods. Having a fixed common set of predictors across time in the SAE models might contribute to this correlation; however, no commonality of the fixed-effect predictors is required for these population subgroup-specific intercept adjustments to be correlated across areas for nonoverlapping years.

The correlation in state estimates across overlapping time periods is a result of simultaneously modeling the data associated with the time periods of interest and the commonality of data associated with the middle year (e.g., in the 2006-2007 vs. 2007-2008 state change estimates, the data for 2007 are common to both sets of estimates). Conversely, the correlation in state estimates across nonoverlapping time periods results solely from simultaneously modeling the data associated with the time periods of interest. The overlapping year correlations tend to be larger than the nonoverlapping year correlations because of commonality of the data associated with the middle year. The variance of the difference between state estimates depends on the underlying correlation between the state estimates. If the state estimates are assumed to be noncorrelated or the correlation between the state estimates is assumed to be smaller than the actual correlation, then the difference would likely be declared nonsignificant. In order to obtain reasonable estimates of this difference over nonoverlapping time periods, it is desirable to include appropriate correlations in the estimation methodology, which would require simultaneous modeling of data associated with the time periods of interest. As mentioned earlier, due to budget and time constraints, it is not practical to simultaneously model the data corresponding to all possible combinations of nonoverlapping time periods in advance. As a proxy, because nonoverlapping year correlations are expected to be between the "long-term" change correlations (i.e., correlations between the baseline period of 2002-2003 and a time period several years beyond) and the overlapping year correlations, a conservative estimate of nonoverlapping time period correlations could be the average of the long-term change correlations.

Currently, seven sets of long-term change correlations are available for each substance use measure arranged according to outcome by state by age group: (a) 2002-2003 versus 2007‑2008, (b) 2002-2003 versus 2008-2009, (c) two sets of 2002‑2003 versus 2009-2010,3 (d) 2002-2003 versus 2010-2011, (e) 2002-2003 versus 2012-2013, and (f) 2002-2003 versus 2013-2014. Correlations for the four mental health outcomes are available for a different set of time periods, as discussed in the next paragraph. The average of these seven sets of correlations is henceforth referred to as a "generalized correlation." Averaging seven sets of correlations minimizes variation and reduces the risk of using an outlier from a particular set of pair-years. Each of these seven sets of correlations was produced by simultaneously fitting 4 years of NSDUH data separately for each outcome measure. For example, to produce correlations between the 2002-2003 and 2007-2008 state estimates for past month marijuana use, four age groups (12 to 17, 18 to 25, 26 to 34, and 35 or older) by two time periods (2002-2003 and 2007-2008), that is, eight subpopulation-specific models, were fitted, each with its own set of fixed and random effects. In this case, the general covariance matrices for the state and within-state random effects were 8 × 8 matrices corresponding to the eight element (age group × time period) vectors of random effects.

For three of the four mental health measures (i.e., AMI, SMI, and suicidal thoughts), six sets of correlations are available and are arranged according to outcome by state by age group: (a) 2008-2009 versus 2010-2011, (b) 2008-2009 versus 2011-2012, (c) 2008-2009 versus 2012-2013, (d) 2009-2010 versus 2011‑2012, (e) 2009-2010 versus 2012-2013, and (f) 2010-2011 versus 2012-2013. The average of these six sets of correlations is the "generalized correlation." Similarly, the fourth mental health measure—major depressive episode (MDE)—has eight sets of correlations available that are arranged by state and age group: (a) 2005-2006 versus 2007-2008, (b) 2005-2006 versus 2008-2009, (c) 2005-2006 versus 2009-2010, (d) 2005-2006 versus 2010‑2011, (e) 2005-2006 versus 2011-2012, (f) 2005-2006 versus 2012-2013, (g) 2006-2007 versus 2009-2010, and (h) 2008-2009 versus 2010-2011. The average of these eight sets of correlations is the "generalized correlation." Note that these correlations were produced in the same manner as discussed in the previous paragraph.

These generalized correlations should be used by NSDUH data users to test the null hypothesis of no difference in state (or census region) prevalence rates for any two nonoverlapping time periods (e.g., 2006-2007 vs. 2010-2011). The national estimates are direct estimates, so the correlations for these are zero. To reiterate, these generalized correlations are not to be used for conducting tests of significance between two overlapping time periods (i.e., 2010-2011 vs. 2011-2012).

The methodology that is used to compare state prevalence rates for two time periods is given in the "National Survey on Drug Use and Health: Comparison of 2002-2003 and 2011-2012 Model-Based Prevalence Estimates (50 States and the District of Columbia)" at https://www.samhsa.gov/data/. Note that a different set of generalized correlations was used to produce the p values for comparing the 2002-2003 and 2011-2012 small area estimates. Those generalized correlations were an average of five sets of correlations (all sets except the 2002-2003 vs. 2012-2013 correlations were available at the time). Using the methodology provided in that document, NSDUH data users can compare state prevalence rates for any two nonoverlapping time periods. To illustrate the procedure, an example comparing the 2006-2007 and 2011-2012 state prevalence rates of past month illicit drug use in Alabama among young adults aged 18 to 25 is given in the next section. Note that there were changes to the survey in 2002;4 thus, these correlations should be used to compare state prevalence rates only from 2002-2003 and beyond.

Comparison of State Estimates in Nonoverlapping Years Using a Generalized Correlation

This section describes a method for determining whether differences in prevalence rates between two nonoverlapping time periods (i.e., 2002-2003 and 2011-2012) for a given state are statistically significant. To determine whether the differences between two nonoverlapping state prevalence rates at time period 1 and time period 2 are statistically significant, let pi 1 sub s and a and pi 2 sub s and a denote the prevalence rates at time period 1 and time period 2, respectively, for state-s and age group-a. The difference between pi 1 sub s and a and pi 2 sub s and a is defined in terms of the log-odds ratio lor sub s and a as opposed to the simple difference because the posterior distribution of lor sub s and a is closer to Gaussian than the posterior distribution of the simple difference (Pi 2 sub s and a minus pi 1 sub s and a represents a simple difference between two prevalence rates.). The lor sub s and a is defined as

Equation 1,     D


where ln denotes the natural logarithm. The p value is computed to test the null hypothesis of no change (i.e., Pi 2 sub s and a is equal to pi 1 sub s and a. or equivalently, Log-odds ratio, lor sub s and a, is equal to zero.). An estimate of log-odds ratio, lor sub s and a is given by

Equation 2,     D

where p 1 sub s and a and p 2 sub s and a are the state estimates (i.e., the benchmarked small area estimates [BSAEs]) for the 2 years being compared. To compute the variance of the estimate of the log-odds ratio, lor hat sub s and a that is, variance v of the estimate of the log-odds ratio, lor hat sub s and a let Theta 1 hat equal the ratio of p 1 sub s and a and 1 minus p 1 sub s and a and Theta 2 hat equal the ratio of p 2 sub s and a and 1 minus p 2 sub s and a then

Equation 3,     D

where the covariance between the natural logarithm of Theta 1 hat and the natural logarithm of Theta 2 hat denotes the covariance between the natural logarithm of Theta 1 hat and the natural logarithm of Theta 2 hat. This covariance is defined in terms of the associated correlation as follows:

Equation 4,     D

where Variance v of the natural logarithm of Theta sub i is equal to the square of quantity q. Quantity q is calculated as the difference between capital U sub i and capital L sub i divided by 2 times 1.96, where i takes values 1 and 2., Capital U sub i is the natural logarithm of upper sub i divided by 1 minus upper sub i., Capital L sub i is the natural logarithm of lower sub i divided by 1 minus lower sub i., and the lower and upper are the 95 percent Bayesian CIs.

For the correlation between the natural logarithm of Theta 1 hat and the natural logarithm of Theta 2 hat for an outcome measure by state by age group, the generalized correlation will be used.

To calculate the p value for testing the null hypothesis of no difference (Log-odds ratio lor is equal to zero.), it is assumed that the posterior distribution of log-odds ratio lor is normal with Mean is equal to the estimate of the log-odds ratio, lor hat sub s and a. and Variance is equal to the variance v of the estimate of the log-odds ratio, lor hat sub s and a.. With the null value of (Log-odds ratio lor is equal to zero.), the Bayes p value or significance levels for the null hypothesis of no difference is The p value is equal to 2 times the probability of realizing a standard normal variate greater than or equal to the absolute value of a quantity z., where capital Z is a standard normal random variate, Quantity z is the estimate of the log-odds ratio, lor hat sub s and a, divided by the square root of the variance v of the estimate of the log-odds ratio, lor hat sub s and a., and absolute value of quantity z denotes the absolute value of quantity z. This Bayesian significance level (or p value) for the null value of log-odds ratio lor, say log-odds ratio lor sub zero, is defined following Rubin5 as the posterior probability for the collection of the log-odds ratio lor values that are less likely or have smaller posterior density d of the log-odds ratio lor than the null (no change) value log-odds ratio lor sub zero. That is, The p value of log-odds ratio lor sub zero is equal to the probability of d of the log-odds ratio lor when it is less than or equal to d of the log-odds ratio lor sub zero.. With the posterior distribution of log-odds ratio approximately normal, the p value of log-odds ratio lor sub zero is given by the above expression.

For overlapping time periods,6 p values are given in published state reports and web documents, and the method described here should not be used. Also, because of changes to the survey in 2002, these generalized correlations should not be used to test differences between 1999-2000 small area estimates or 2000-2001 small area estimates and the other small area estimates beyond 2002.

Example. The following exhibit shows the prevalence estimates for past month illicit drug use among young adults aged 18 to 25 in Alabama for 2006-2007 and 2011-2012.

State Estimate (%) 95% Confidence Interval (%)
1 See Table 1 of the "2006-2007 NSDUH: Model-Based Prevalence Estimates (50 States and the District of Columbia" at https://www.samhsa.gov/data.
2 See Table 1 of the "2011-2012 NSDUH: Model-Based Prevalence Estimates (50 States and the District of Columbia)" at https://www.samhsa.gov/data.
2006-20071 15.90 (13.18, 19.05)
2011-20122 17.51 (14.96, 20.40)

The generalized correlation for illicit drug use for 18 to 25 years olds in Alabama is 0.21994.7 Note that generalized correlations are on the logit scale;8 that is, they are the correlation between the logit of p1 and the logit of p2 (not the correlation between p1 and p2, where p1 and p2 are the 2006-2007 and the 2011-2012 small area estimates, respectively).

The p value is calculated using the following methodology. Using the data from the exhibit, the following terms are first defined:

p1 = 0.1590, lower1 = 0.1318, upper1 = 0.1905, p2 = 0.1751, lower2 = 0.1496, and upper2 = 0.2040.

Then the following calculations are made:

Equation 5,     D

Equation 6a,     D

Equation 6b,     D

        Equation 6c,     D   and

Equation 6d.     D

Define Theta 1 hat to be equal to p 1 divided by 1 minus p 1 and Theta 2 hat to be equal to p 2 divided by 1 minus p 2, then the variance of Theta 1 hat and Theta 2 hat is given by the following:

Equation 7a     D   and

Equation 7b.     D


Using the above variances and the generalized correlation, the variance of log-odds ratio lor hat is given by the following:

Equation 8,     D

where

Equation 9     D

Hence,

Equation 10     D   and

Equation 11.     D


The Bayes p value for the null hypothesis of no difference is defined as follows: The Bayes p value equals 2 times capital P times quantity Q, where quantity Q is a capital Z that is more than or equal to the absolute value of 0.88808. The significance level is therefore 0.375., where abs denotes the absolute value and Z is the standard normal random variable. Because the p value is greater than 0.05, it can be said that at the 5 percent level of significance, these two prevalence rates are not significantly different.



End Notes

1 Because of methodological changes implemented in the 2002 survey, a new baseline for all outcomes began that year. For the mental health outcomes, including any mental illness (AMI), serious mental illness (SMI), and serious thoughts of suicide, the baseline is 2008 because of new questions that were introduced in the survey that year.

2 For more information on this type of correlation, see the "National Survey on Drug Use and Health: Comparison of 2002-2003 and 2010-2011 Model-Based Prevalence Estimates (50 States and the District of Columbia)" and the "National Survey on Drug Use and Health: Comparison of 2009-2010 and 2010-2011 Model‑Based Prevalence Estimates (50 States and the District of Columbia)" at https://www.samhsa.gov/data/.

3 During regular data collection and processing checks for the 2011 NSDUH, data errors were identified. These errors affected the data for Pennsylvania (2006 to 2010) and Maryland (2008 and 2009) (for more details about the data errors, see Section A.7 of the "2011-2012 National Survey on Drug Use and Health: Guide to State Tables and Summary of Small Area Estimation Methodology" at https://www.samhsa.gov/data/). The first set of 2002-2003 versus 2009-2010 correlations that was produced before the data errors were identified included the erroneous data from Pennsylvania and Maryland. The second set of 2002-2003 versus 2009-2010 correlations was produced excluding the erroneous data from Pennsylvania and Maryland. The two sets of 2002-2003 versus 2009-2010 correlations were compared, and it was concluded that the data errors did not affect the underlying correlations. Therefore, the previously produced correlations (2002-2003 vs. 2007-2008 and 2002-2003 vs. 2008-2009) were not revised.

4 For details, see Section A.2 of the "2011-2012 National Surveys on Drug Use and Health: Guide to State Tables and Summary of Small Area Estimation Methodology" at https://www.samhsa.gov/data/.

5 Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys (Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics). New York, NY: John Wiley & Sons.

6 The overlapping time periods are as follows: 1999-2000 versus 2000-2001, 2002-2003 versus 2003-2004, 2003-2004 versus 2004-2005, 2004-2005 versus 2005-2006, 2005-2006 versus 2006-2007, 2006-2007 versus 2007-2008, 2007-2008 versus 2008-2009, 2008-2009 versus 2009-2010, 2009-2010 versus 2010-2011, 2010-2011 versus 2011-2012, 2011-2012 versus 2012-2013, and 2012-2013 versus 2013-2014.

7 See Table 1 of the generalized correlation Excel files at https://www.samhsa.gov/data.

8 The logit scale is defined as follows: logit x is equal to the natural logarithm of x divided by 1 minus x, where ln denotes the natural logarithmic function.


Long Descriptions—Equations

Long description, Equation 1. The log-odds ratio, lor sub s and a, is defined as the natural logarithm of the ratio of two quantities. The numerator of the ratio is pi 2 sub s and a divided by 1 minus pi 2 sub s and a. The denominator of the ratio is pi 1 sub s and a divided by 1 minus pi 1 sub s and a.

Long description end. Return to Equation 1.

Long description, Equation 2. The estimate of the log-odds ratio, lor hat sub s and a, is defined as the natural logarithm of the ratio of two quantities. The numerator of the ratio is p 2 sub s and a divided by 1 minus p 2 sub s and a. The denominator of the ratio is p 1 sub s and a divided by 1 minus p 1 sub s and a, where p 1 sub s and a are the State estimates for time period 1 and p2 sub s and are the State estimates for time period 2.

Long description end. Return to Equation 2.

Long description, Equation 3. Variance v of the estimate of the log-odds ratio, lor hat sub s and a, is a function of three quantities: q1, q2, and q3. It is expressed as the sum of q1 and q2 minus q3. Quantity q1 is the variance v of the natural logarithm of Theta 1 hat, quantity q2 is the variance v of the natural logarithm of Theta 2 hat, and quantity q3 is 2 times the covariance between the natural logarithm of Theta 1 hat and the natural logarithm of Theta 2 hat.

Long description end. Return to Equation 3.

Long description, Equation 4. The covariance between the natural logarithm of Theta 1 hat and the natural logarithm of Theta 2 hat is equal to the correlation between the natural logarithm of Theta 1 hat and the natural logarithm of Theta 2 hat multiplied by the square root of the product of the variance v of the natural logarithm of Theta 1 hat and the variance v of the natural logarithm of Theta 2 hat.

Long description end. Return to Equation 4.

Long description, Equation 5. The estimate of the log-odds ratio, lor hat, is defined as the natural logarithm of the ratio of two quantities. The numerator of the ratio is p 2 divided by 1 minus p 2. The denominator of the ratio is p 1 divided by 1 minus p 1, where p1 is 0.1590 and p 2 is 0.1751. The estimate lor hat is calculated to be 0.1158.

Long description end. Return to Equation 5.

Long description, Equation 6a. Capital U sub 1 is defined as the natural logarithm of the ratio of 0.1905 and 1 minus 0.1905, which is −1.44676.

Long description end. Return to Equation 6a.

Long description, Equation 6b. Capital L sub 1 is defined as the natural logarithm of the ratio of 0.1318 and 1 minus 0.1318, which is −1.88514

Long description end. Return to Equation 6b.

Long description, Equation 6c. Capital U sub 2 is defined as the natural logarithm of the ratio of 0.2040 and 1 minus 0.2040, which is −1.36148.

Long description end. Return to Equation 6c.

Long description, Equation 6d. Capital L sub 2 is defined as the natural logarithm of the ratio of 0.1496 and 1 minus 0.1496, which is −1.73774.

Long description end. Return to Equation 6d.

Long description, Equation 7a. The variance v of the natural logarithm of Theta 1 hat is equal to the square of quantity q. Quantity q is calculated as the difference between capital U sub 1 and capital L sub 1 divided by the product of 2 and 1.96. Here, capital U sub 1 is −1.44676, and capital L sub 1 is −1.88514. Hence, the variance v of the natural logarithm of Theta 1 hat is calculated to be 0.01251.

Long description end. Return to Equation 7a.

Long description, Equation 7b. The variance v of the natural logarithm of Theta 2 hat is equal to the square of quantity q. Quantity q is calculated as the difference between capital U sub 2 and capital L sub 2 divided by the product of 2 and 1.96. Here, capital U sub 2 is −1.36148, and capital L sub 2 is −1.73774. Hence, the variance v of the natural logarithm of Theta 2 hat is calculated to be 0.00921.

Long description end. Return to Equation 7b.

Long description, Equation 8. Variance v of the estimate of the log-odds ratio, lor hat, is a function of three quantities: q1, q2, and q3. It is expressed as the sum of q1 and q2 minus q3. Quantity q1 is the variance v of the natural logarithm of Theta 1 hat, quantity q2 is the variance v of the natural logarithm of Theta 2 hat, and quantity q3 is 2 times the covariance between the natural logarithm of Theta 1 hat and the natural logarithm of Theta 2 hat.

Long description end. Return to Equation 8.

Long description, Equation 9. The covariance between the natural logarithm of Theta 1 hat and the natural logarithm of Theta 2 hat is equal to the correlation between the natural logarithm of Theta 1 hat and the natural logarithm of Theta 2 hat multiplied by the square root of the product of the variance v of the natural logarithm of Theta 1 hat and the variance v of the natural logarithm of Theta 2 hat. For this example, this is equivalent to 0.21994 times the square root of the product of 0.01251 and 0.00921, which equals 0.00236.

Long description end. Return to Equation 9.

Long description, Equation 10. The variance of the estimate of log-odds ratio, lor hat, in this example is equivalent to the sum of 0.01251 and 0.00921 minus 2 times 0.00236, which equals 0.01700.

Long description end. Return to Equation 10.

Long description, Equation 11. The quantity z is equal to lor hat divided by the square root of the variance of lor hat. For this example, this equivalent to 0.1158 divided by the square root of 0.01700, which equals 0.88808.

Long description end. Return to Equation 11.

Go to Top of Page