2014-2015 National Survey on Drug Use and Health:
Comparison of Population Percentages from the United States, Census Regions, States,
and the District of Columbia (Documentation for CSV and Excel Files)

 

Documentation for CSV and Excel Files

Description of the CSV File Type

Files with a comma separated value (*.csv) extension are in plain text. They contain characters stored in a flat, nonproprietary format and can be opened by most computer programs. Each *.csv file contains a set of tabular data, with each record delineated by a line break and each field within a record delineated by a comma. A field that contains commas as part of its content has the additional delineation of a quote mark character before and after the field's contents. When a quote mark character is part of a field's content, it is included as two consecutive ""quote mark"" characters.

Computers with Microsoft Excel installed open *.csv files in Excel by default, with the fields automatically arranged appropriately in columns. Other database programs also open *.csv files with the fields appropriately arranged.

The 64 CSV files (i.e., "P Value Table#.csv") reflect the 64 Excel tables, and they contain the table title, table notes, column headings, and data. The webpage at http://www.samhsa.gov/data/ for the 2014-2015 NSDUH state p value tables includes a hyperlinked table of contents on the first sheet of the Excel file that combines all of the Excel tables, as well as a listing on the webpage itself of the individually linked CSV files.

How to Use the P Value Tables

The p values contained in these tables for each outcome and age group can be used to test the null hypothesis of no difference between population percentages for the following types of comparisons:

In general, to find the p value when testing any two geographic areas, navigate to the row of the area with the higher order number, then navigate to the column of the other area. For example, within any given table, by scrolling across Alabama's state row to the South's census region column, the p value found will determine whether Alabama's state population percentage and the South's census region population percentage are significantly different for a particular outcome of interest. Note that the tests included here are for a given outcome and age group.1

For example, Table 1.2 contains p values for past year marijuana use among youths aged 12 to 17. The p value for testing the null hypothesis of no difference between Oregon and the West region population percentages for past year marijuana use among youths age 12 to 17 is 0.012. Thus, the hypothesis of no difference (Oregon population percentage = West region population percentage) is rejected at the 5 percent level of significance, meaning that the two prevalence rates are statistically different. Note that the Oregon and West region estimates are 17.6 and 14.4 percent, respectively.2

Comparison between Two Small Area Population Percentages

To produce state, census region, and national small area estimates, the 2014-2015 NSDUH data were modeled using the method discussed in Section B.1 of the "2014-2015 NSDUH: Guide to State Tables and Summary of Small Area Estimation Methodology" document at http://www.samhsa.gov/data/. This modeling results in 1,250 Markov Chain Monte Carlo (MCMC) samples that are used here to calculate p values for testing the null hypothesis of no difference between two small area population percentages.

Let pi 1 sub a and pi 2 sub a denote the 2014-2015 population percentages of two areas (e.g., state 1 vs. state 2 or state 1 vs. national) for age group-a. The difference between pi 1 sub a and pi 2 sub a is defined in terms of the log-odds ratio, The log-odds ratio, lor sub a, is defined as the natural logarithm of the ratio of two quantities. The numerator of the ratio is pi 2 sub a divided by 1 minus pi 2 sub a. The denominator of the ratio is pi 1 sub a divided by 1 minus pi 1 sub a., where ln denotes the natural logarithm, as opposed to the simple difference Pi 2 sub a minus pi 1 sub a because the posterior distribution of the log-odds ratio is closer to Gaussian than the posterior distribution of the simple difference.

An estimate, lor hat sub a, of lor sub a is given by the average of the 1,250 MCMC sample-based log-odds ratios. Let lor i sub a denote the log-odds ratio for the i-th MCMC sample. That is,

Equation 1.     D

Then lor hat sub a is defined as the ratio of two quantities. The numerator is the sum over 1250 values of lor i sub a. The denominator of the ratio is 1250., and the variance of lor hat sub a is given by The variance of lor hat sub a is defined as the ratio of two quantities. The numerator is the sum over 1250 values of the square of the difference between lor i sub a and lor hat sub a. The denominator is 1250..

To calculate the p value for testing the null hypothesis of no difference, (Log-odds ratio, lor sub a, is equal to zero.), it is assumed that the posterior distribution of lor sub a is normal with Mean is equal to the estimate of the log-odds ratio, lor hat sub a. and Variance is equal to the variance v of the estimate of the log-odds ratio, lor hat sub a.. With (Log-odds ratio, lor sub a, is equal to zero.), the Bayes p value or significance level for the null hypothesis of no difference is The p value is equal to 2 times the probability of realizing a standard normal variate greater than or equal to the absolute value of a quantity z., where capital Z is a standard normal random variate, Quantity z is the estimate of the log-odds ratio, lor hat sub a, divided by the square root of the variance of lor hat sub a., and absolute value of quantity z denotes the absolute value of quantity z. This Bayesian significance level (or p value) for the null value of Log-odds ratio lor, say log-odds ratio lor sub zero, is defined following Rubin3 as the posterior probability for the collection of the Log-odds ratio lor values that are less likely or have smaller posterior density d of the log-odds ratio lor than the null (no change) value log-odds ratio lor sub zero. That is, The p value of log-odds ratio lor sub zero is equal to the probability of d of the log-odds ratio lor when it is less than or equal to d of the log-odds ratio lor sub zero.. With the posterior distribution of Log-odds ratio lor approximately normal, the p value of log-odds ratio lor sub zero is given by the above expression. If the p value is less than 0.05, then it can be stated that the estimates for the two areas are statistically different from each other.

Note that in the 2014-2015 methodology and guide,4 Section B.7 discusses a method for comparing two state estimates to determine whether any differences are statistically significant. The discussion in that section was meant to provide a quick ad hoc way to test the differences in two state estimates using the assumption that state estimates are not correlated. However, even though between-state correlations are small, they are not strictly negligible, and state estimates definitely contribute to regional and national estimates. Thus, the assumption of no correlation does not work in that circumstance. The test described above is based on a direct calculation of the variance of the difference of the log-odds of two areas and thus takes into account the correlation5 between the estimated log-odds of two areas to calculate the p values. Therefore, the p values shown in these Excel and CSV tables for state versus state tests are more accurate and may be slightly different from the p values using the approximate test described in Section B.7 of the 2014-2015 methodology and guide.


End Notes

1 The substance use and mental disorder outcomes in these tables focus on illicit drug use, alcohol use, tobacco use, alcohol use disorder, serious mental illness, any mental illness, suicidal thoughts and behavior, and major depressive episode. The age groups include individuals aged 12 or older, youths aged 12 to 17, young adults aged 18 to 25, adults aged 26 or older, and adults aged 18 or older. Alcohol use is also provided for individuals aged 12 to 20. Note that not all outcomes have data broken out by all age groups.

2 See Table 1 of the "2014-2015 NSDUH: Model-Based Prevalence Estimates (50 States and the District of Columbia)" at http://www.samhsa.gov/data/.

3 Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys (Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics). New York, NY: John Wiley & Sons.

4 See the "2014-2015 NSDUH: Guide to State Tables and Summary of Small Area Estimation Methodology" document at http://www.samhsa.gov/data/.

5 That is, Variance X minus Y, is a function of three quantities: q1, q2, and q3. It is expressed as the sum of q1 and q2 minus q3. Quantity q1 is the variance of X, quantity q2 is the variance Y, and quantity q3 is 2 times the covariance between X and Y, where The covariance between X and Y is equal to the correlation between X and Y multiplied by the square root of the product of the variance of X and the variance of Y and X, Y are the estimated log-odds of two areas.


Long Descriptions—Equations and Figures

Long description, Equation 1. The log-odds ratio, lor i sub a, is defined as the natural logarithm of the ratio of two quantities. The numerator of the ratio is pi 2 i sub a divided by 1 minus pi 2 i sub a. The denominator of the ratio is pi 1 i sub a, divided by 1 minus pi 1 i sub a.

Long description end. Return to Equation 1.

Go to Top of Page