Files with a comma-separated value (*.csv) extension are in plain text. They contain characters stored in a flat, nonproprietary format and can be opened by most computer programs. Each *.csv file contains a set of tabular data, with each record delineated by a line break and each field within a record delineated by a comma. A field that contains commas as part of its content has the additional delineation of a quote mark character before and after the field’s contents. When a quote mark character is part of a field’s content, it is included as two consecutive ““quote mark”” characters.
Computers with Microsoft Excel installed open *.csv files in Excel by default, with the fields automatically arranged appropriately in columns. Other database programs also open *.csv files with the fields appropriately arranged.
The 181 *.csv files (i.e., “P Value Table#.csv”) reflect the 181 Excel tables, and they contain the table title, table notes, column headings, and data. The webpage at https://www.samhsa.gov/data/report/2022-2023-nsduh-substate-region-estimates-p-value-tables for the 2022‑2023 National Surveys on Drug Use and Health (NSDUH) state p value tables includes a hyperlinked table of contents on the first sheet of the Excel file that combines all of the Excel tables, as well as a ZIP file containing all of the individual *.csv files. Additionally, the ZIP file includes a text file with a list of the table numbers and titles.
The p values contained in these tables for each outcome and age group can be used to test the null hypothesis of no difference between population percentages for the following types of comparisons:
In general, to find the p value for testing the difference between population percentage of any two geographic areas, navigate to the row of the area with the higher order number, then navigate to the column of the other area. For example, within any given table, by scrolling across Alabama’s state row to the South’s census region column, the p value found will determine whether Alabama’s state population percentage and the South’s census region population percentage are significantly different for a particular outcome of interest. Note that the tests included here, for a given outcome and age group, are produced using 2022 and 2023 data.1
For example, Table 2.3 contains the p values for testing the difference between population percentage of any two geographic areas for marijuana use in the past year among people aged 18 to 25. It can be seen, for example, that the p value for testing the null hypothesis of no difference between the South region and Alabama population percentages for marijuana use in the past year for people aged 18 to 25 is 0.003 (Table 2.3, row=Alabama, column=South). Thus, the hypothesis of no difference (Alabama population percentage = South region population percentage) is rejected at the 1 percent level of significance, meaning that the two population percentages are statistically different. Note that the Alabama and South region estimates for marijuana use in the past year among people aged 18 to 25 are 26.43 and 32.57 percent, respectively.2
To produce state, census region, and national small area estimates, the 2022‑2023 NSDUH data were modeled using the method discussed in Section B.1 of 2022‑2023 National Surveys on Drug Use and Health: Guide to State Tables and Summary of Small Area Estimation Methodology at https://www.samhsa.gov/data/report/2022-2023-nsduh-guide-state-tables-and-summary-sae-methodology. This modeling results in 1,250 Markov Chain Monte Carlo (MCMC) samples that are used here to calculate p values for testing the null hypothesis of no difference between two small area population percentages.
Let
and
denote
the population percentages of two areas (e.g., state 1 versus state 2 or state 1 versus national) for age group‑a. The difference between
and
is
defined in terms of the log-odds ratio (
) as opposed to the simple difference because the posterior distribution of
is closer to Gaussian than the posterior distribution of the simple difference (
). Let
ln denote the natural logarithm, then
is defined as follows:
. D
An estimate of
is given by the average of the 1,250 MCMC sample-based log-odds ratios. Let
denote the log-odds ratio for the i‑th MCMC sample. That is,
. D
Then
and the variance of
is
given by the following:
. D
To calculate the p value for testing the null hypothesis of no difference, (
), it is assumed that the posterior distribution of
is normal with estimated
and
. The Bayesian p value or significance level for the null hypothesis of no difference, (
), is
, where
is a standard normal random variate,
, and
denotes the absolute value of
. This Bayesian significance level (or p value) for the null value of
, say
, is
defined following Rubin (1987)3 as the posterior probability for the collection of the
values that are less likely or have smaller posterior density,
, than the null (no change) value,
. That is,
. D
With the posterior distribution of
approximately normal,
is given by the above expression. If the p value is less than 0.01, for example, then it can be stated that the population percentages of two areas are statistically different from each other at the 1 percent level of significance.
1 The outcomes in these tables focus on illicit drug use, alcohol use, tobacco use, perception of great risk of harm from substance use, substance use disorder (SUD), substance use treatment, any mental illness, serious mental illness, co-occurring SUD and mental illness, mental health treatment, major depressive episode, and suicidal thoughts and behavior. The age groups include people aged 12 or older, youths aged 12 to 17, young adults aged 18 to 25, adults aged 26 or older, and adults aged 18 or older. Alcohol and tobacco related outcomes are also provided for people aged 12 to 20 (i.e., underage people). Note that not all outcomes have data broken out by all age groups. Estimates for youths aged 12 to 17 are not available for past year heroin use because past year heroin use was extremely rare among youths aged 12 to 17 in the 2022 and 2023 NSDUHs. As a result, estimates for people aged 12 or older are also not produced. Thus, p value tables for these two age groups for past year heroin use are not available.
2 See Table 2 of 2022‑2023 National Surveys on Drug Use and Health: Model-Based Prevalence Estimates (50 States and the District of Columbia) at https://www.samhsa.gov/data/report/2022-2023-nsduh-state-prevalence-estimates.
3 See the following reference: Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys (Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics). John Wiley & Sons.
Long description, Equation 1: The log-odds ratio, lor sub a, is defined as the natural logarithm of the ratio of two quantities; the numerator of the ratio is pi sub 2 a divided by 1 minus pi sub 2 a. The denominator of the ratio is pi sub 1 a divided by 1 minus pi sub 1 a.
Long description end. Return to Equation 1.
Long description, Equation 2: The log-odds ratio, lor i sub a, is defined as the natural logarithm of the ratio of two quantities. The numerator of the ratio is pi i sub 2 a divided by 1 minus pi i sub 2 a. The denominator of the ratio is pi i sub 1 a, divided by 1 minus pi i sub 1 a. The value i in the equation takes on values 1 to 1250.
Long description end. Return to Equation 2.
Long description, Equation 3: The variance of lor hat sub a is defined as the ratio of two quantities. The numerator is the sum over 1,250 values of the square of the difference between lor i sub a and lor hat sub a. The denominator is 1,250.
Long description end. Return to Equation 3.
Long description, Equation 4: The p value of log-odds ratio lor sub zero is equal to the probability of d of the log-odds ratio lor when it is less than or equal to d of the log-odds ratio lor sub zero.
Long description end. Return to Equation 4.