HCUP Calculating Standard Errors - Accessible Version
HCUP Calculating Standard Errors - Introduction
- Standard Errors
- Importance of Calculating Standard Errors
- HCUP Nationwide Database Sample Design
- Finite Population Correction
- Statistical Software
- National Estimate Example
- Example Results
- Verification of Results
- Standard Errors for Subsets
- Calculating Standard Errors for Subsets
- Subsets: Recommended Method
- Subsets: Recommended Method Results
- Verification of Results
- Subsets: Alternate Method
- Subsets: Alternate Method Results
- Verification of Results
- Significance Testing
- Wrap-Up
Thank you for joining us for this Healthcare Cost and Utilization Project (HCUP) online tutorial on Calculating Standard Errors.
Before we get started, a quick word about HCUP:
The goal of this tutorial is to show you how to determine the precision of the estimates you calculate from HCUP nationwide databases so that you will be able to draw sound conclusions from your analyses.
Standard error is a measure of the precision of a statistic. It reflects the amount that a sample statistic's value would fluctuate if a large number of samples were to be drawn using the same sampling design. Less precise estimates have larger standard errors while more precise estimates have smaller standard errors.
The HCUP nationwide databases are not simple random samples. The NIS (beginning with data year 2012) KID, and NRD are stratified samples. The NIS was redesigned in 2012 to improve national estimates. Prior to its redesign, the NIS was a stratified two-stage cluster sample without replacement. The NEDS also is a stratified two-stage cluster sample without replacement. Standard formulas for a stratified two-stage cluster sample without replacement may be used to calculate standard errors in most applications for all four samples. Although a sample of hospitals is not drawn for the NIS (beginning with data year 2012), KID, or NRD, for estimation purposes, hospitals should be treated as though they were selected at the first stage of sampling from the entire universe of hospitals within each stratum. Examples provided in this tutorial use 2013 NIS data, but the same standard error calculations apply to prior data years of the NIS as well as to the NEDS, KID, and NRD. To review the sample designs, refer to the HCUP Sample Design Tutorial.
Prior to data year 2012, the NIS was a stratified two-stage cluster sample, similar to the NEDS. Beginning with the 2012 data year, the NIS is a stratified sample of hospital discharges. Discharges in the sampling frame are stratified by five key hospital characteristics. Then, a systematic random sample of discharges is chosen from each of the strata after the discharges are sorted by "control" variables ordered as follows: encrypted hospital ID, Diagnosis-Related Group (DRG), admission month, and a random number. Although the NIS is not a cluster sample, (discharges are sampled from all frame hospitals) discharges are still clustered within hospitals. Consequently, each hospital is considered a cluster for the purpose of calculating standard errors.
The NEDS is a stratified two-stage cluster sample. Hospital-based emergency departments in the sampling frame are stratified by five key hospital characteristics. Then, a random sample of hospital-based emergency departments is chosen from each of the strata. In sampling terminology, each emergency department is considered a cluster. The NEDS includes all discharges from the selected clusters, or emergency departments.
The KID is comprised of a sample of pediatric discharges from all hospitals in the sampling frame. Discharges are stratified by whether they are an uncomplicated in-hospital birth, a complicated in-hospital birth, or a pediatric non-birth. For the KID, a random sample of 10% of uncomplicated in-hospital births and 80% of all other pediatric discharges is selected.
The NRD is drawn from HCUP State Inpatient Databases (SID) that contain reliable, verified patient linkage numbers that can be used to track a person across hospitals within a State, while adhering to strict privacy guidelines. All of the discharges in the sampling frame were included, making the NRD a sample of convenience. Discharges are post-stratified for the purpose of weighting by hospital characteristics (census region, urban/rural location, hospital teaching status, size of the hospital defined by the number of beds, and hospital control) and patient characteristics (sex and five age groups [0, 1-17, 18-44, 45-64, and 65 and older]). The procedures being described in this tutorial all assume inferences to a large population. Therefore, the finite population correction is not used. It is applied only when inferences are being made to the specific population of patients actually hospitalized during the year of the data. Usually analysts prefer not to use the finite population correction because they are interested in the long-run results for hospitals. For example, interest centers on the true, long-run mortality rate for a hospital rather over multiple years rather than to the mortality rate actually observed in a single year.
Several statistical programming packages can be used to calculate sample statistics and appropriate standard errors based on data from complex sampling designs. Some examples of these statistical programming packages are SAS®, SUDAAN®, STATA®, and SPSS®. I will use SAS in today's demonstrations. In particular, I will use the SAS survey sampling and analysis procedures. - SURVEYFREQ
- SURVEYLOGISTIC
- SURVEYMEANS
- SURVEYREG
These procedures incorporate the complex sample design of the HCUP nationwide databases into the analysis. They MUST be used when calculating national estimates, regional estimates and standard errors. The HCUP reports Calculating Nationwide Inpatient Sample (NIS) Variances for Data Years 2011 and Earlier and Calculating National Inpatient Sample (NIS) Variances for Data Years 2012 and Later provide more information as well as example code for calculating standard errors using other statistical packages. First I will show you how to produce standard errors for statistics based on the entire National Inpatient Sample. The SAS program code below produces national estimates of the sums, the means, and the standard errors for the number of discharges, the length of stay, the percentage of people who died during hospitalization, and the total hospital charges from the 2013 NIS.
LIBNAME NIS2013 "C:\"; DATA NIS_2013_CORE; SET NIS2013.NIS_2013_CORE; LENGTH DISCHGS 3; RETAIN DISCHGS 1; RUN; PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING; WEIGHT discwt; CLASS died; FORMAT died FDIED.; CLUSTER hosp_nis; STRATA nis_stratum; VAR DISCHGS los died totchg; RUN; In all examples, the following conventions apply:
LIBNAME NIS2013 "C:\"; DATA NIS_2013_CORE; SET NIS2013.NIS_2013_CORE; LENGTH DISCHGS 3; RETAIN DISCHGS 1; RUN; PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING; WEIGHT discwt; CLASS died; FORMAT died FDIED.; CLUSTER hosp_nis; STRATA nis_stratum; VAR DISCHGS los died totchg; RUN;
SET NIS2013.NIS_2013_CORE;
LENGTH DISCHGS 3;
RETAIN DISCHGS 1;
PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING;
PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING;
PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING;
PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING;
PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING;
PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING;
WEIGHT discwt;
CLASS died;
FORMAT died FDIED.;
CLUSTER hosp_nis;
STRATA nis_stratum; Here are the results of the program. The SURVEYMEANS Procedure Data Summary Number of Strata 202 Number of Clusters 4363 Number of Observations 7119563 Sum of Weights 35597792 Class Level Information CLASS Variable Label Levels Values DIED Died during hospitalization 4 .: Missing .A: Invalid 0: Did not die in hospital 1: Died in hospital Statistics Std Error Variable Level Label Mean of Mean Sum Std Dev -------------------------------------------------------------------------------------------------------------------------------------- DISCHGS 1.00 0.00 35,597,792 296,045 LOS Length of stay (cleaned) 4.55 0.02 161,796,496 1,466,640 TOTCHG Total charges (cleaned) 39,513.25 480.47 1,378,643,839,214 21,505,352,862 DIED .: Missing Died during hospitalization 0.00 0.00 9,585 2,359 .A: Invalid Died during hospitalization 0.00 0.00 3,575 855 0: Did not die in hospital Died during hospitalization 0.98 0.00 34,912,122 290,483 1: Died in hospital Died during hospitalization 0.02 0.00 672,510 7,974 -------------------------------------------------------------------------------------------------------------------------------------- As you can see, there are 202 sampling strata; 4,363 clusters, each of which is a hospital; and 7,119,563 unweighted sample records in the 2013 NIS. Data Summary Number of Strata 202 Number of Clusters 4363 Number of Observations 7119563 Sum of Weights 35597792 According to the results, it is estimated that nationwide there were a total of 35,597,792 inpatient discharges with a standard deviation of 296,045. Std Error Variable Level Label Mean of Mean Sum Std Dev -------------------------------------------------------------------------------------------------------------------------------------- DISCHGS 1.00 0.00 35,597,792 296,045 LOS Length of stay (cleaned) 4.55 0.02 161,796,496 1,466,640 TOTCHG Total charges (cleaned) 39,513.25 480.47 1,378,643,839,214 21,505,352,862 DIED .: Missing Died during hospitalization 0.00 0.00 9,585 2,359 .A: Invalid Died during hospitalization 0.00 0.00 3,575 855 0: Did not die in hospital Died during hospitalization 0.98 0.00 34,912,122 290,483 1: Died in hospital Died during hospitalization 0.02 0.00 672,510 7,974 -------------------------------------------------------------------------------------------------------------------------------------- The estimated average length of stay was 4.55 days with a standard error of .02 days.
The results of the example analysis can be verified using HCUPnet. Here are the results of an HCUPnet query corresponding to our SAS program. When the results of the SAS program are compared to HCUPnet output, all of the estimates and standard errors agree: total discharges, length of stay, total charges, and in-hospital deaths. When the results of the SAS program are compared to HCUPnet output, you may notice small discrepancies in some estimates. HCUPnet uses data that are stored as SAS files. The NIS files that are purchased through the HCUP Central Distributor are sent as ASCII files. Weights (for making national estimates) in the ASCII files are truncated at the fourth decimal place, thus some resulting estimates will be slightly different from those from HCUPnet; however, the differences should be very small.
What if your research focuses on only a subset of discharges from the NIS, such as hospital stays in which a coronary artery bypass graft, or CABG (pronounced "cabbage") was performed? Does calculating standard errors for a subset of discharges differ from calculating standard errors for estimates based on the entire sample? Yes. When you produce statistics based on all the discharges in the sample, you include discharges from all of the hospitals in the sample, and thus take all of the hospitals, or clusters, in the sample into account.
There are two methods you can use to account for all of the hospitals in the sample: - The recommended method uses all of the records in the core file and identifies discharges of interest.
- The alternate method subsets the database and creates "dummy" records for hospitals in every stratum to ensure the appropriate calculation of standard errors. This method is sometimes necessitated by computer memory limitations, and may be of particular use when working with the Nationwide Emergency Department Sample--which contains 30 million unweighted observations. We will look at both methods.
The recommended method for calculating standard errors requires more disk space and CPU time than the alternate method because the HCUP nationwide databases have a large number of records, all of which are involved in the recommended method. This may present a challenge in terms of disk space or software capabilities when using a database such as the 2013 NEDS--which contains roughly 30 million unweighted observations. In this case the alternate method, which we will look at shortly, may be more appropriate. See below for an explanation of each line of code and the recommended method for calculating standard errors. LIBNAME NIS2013 "C:\"; /* CREATE SUBSET OF CABG PROCEDURES */ DATA CABGSUBSET; SET NIS2013.NIS_2013_CORE; LENGTH DISCHGS CABG 3; RETAIN DISCHGS 1; IF PRCCS1=44 THEN CABG=1; ELSE CABG=0; RUN; PROC SURVEYMEANS DATA=CABGSUBSET SUM STD MEAN STDERR MISSING; WEIGHT discwt; CLASS died; FORMAT dief fdied.; CLUSTER hosp_nis; STRATA nis_stratum; VAR DISCHGS los died totchg; DOMAIN CABG; RUN;
The data summary shows the output accounts for all 4,363 hospitals in the sample and all 7 million unweighted observations. The first set of statistics, where CABG equals zero, are for discharges which did not have a CABG performed. The second set of statistics, where CABG equals one, are for those discharges for which CABG was the principal procedure. The SURVEYMEANS Procedure Data Summary Number of Strata 202 Number of Clusters 4363 Number of Observations 7119563 Sum of Weights 35597792 Class Level Information CLASS Variable Label Levels Values DIED Died during hospitalization 4 .: Missing .A: Invalid 0: Did not die in hospital 1: Died in hospital Domain Statistics in CABG Std Error CABG Variable Level Label Mean of Mean Sum Std Dev ---------------------------------------------------------------------------------------------------------------------------------------------------------- 0 DISCHGS 1.00 0.00 35,440,072 294,316 LOS Length of stay (cleaned) 4.52 0.02 160,334,536 1,449,387 TOTCHG Total charges (cleaned) 38,971.03 476.96 1,353,657,499,120 21,351,712,415 DIED .: Missing Died during hospitalization 0.00 0.00 9,545 2,356 .A: Invalid Died during hospitalization 0.00 0.00 3,545 854 0: Did not die in hospital Died during hospitalization 0.98 0.00 34,757,392 288,803 1: Died in hospital Died during hospitalization 0.02 0.00 669,590 7,935 1 DISCHGS 1.00 0.00 157,720 4,347 LOS Length of stay (cleaned) 9.27 0.06 1,461,960 41,387 TOTCHG Total charges (cleaned) 160,477.45 2,469.74 24,986,340,094 738,469,446 DIED .: Missing Died during hospitalization 0.00 0.00 40 18 .A: Invalid Died during hospitalization 0.00 0.00 30 25 0: Did not die in hospital Died during hospitalization 0.98 0.00 154,730 4,275 1: Died in hospital Died during hospitalization 0.02 0.00 2,920 142 ---------------------------------------------------------------------------------------------------------------------------------------------------------- Results show an estimated total of 157,720 hospitalizations in which CABG is the principal procedure with a standard deviation of 4,347. The average length of stay, indicated as LOS, is estimated at 9.27 days with a standard error of 0.06 days. The estimated average total charges were $160,477.45 with a standard error of $2,469.74. The mean of the flags indicating death during hospitalization was 0.02. In other words, 2 percent of stays resulted in death during hospitalization with a standard error of 0.00 percent. The results of the example analysis can be verified using HCUPnet. Here are the results of a query corresponding to our SAS program. The results of the SAS program are compared to HCUPnet output and you can see that all of the estimates are the same. The alternate method for calculating appropriate standard errors is to subset the nationwide database to the observations of interest. Then, append one "dummy" observation for each of the hospitals included in the nationwide database that is not represented in the subset. The dummy observations ensure that all the hospitals in the sample are taken into account, resulting in the accurate calculation of standard error. To do this, you must concatenate the subset of interest with the HOSPITAL file.
LIBNAME NIS2013 "C:\"; /* CREATE SUBSET OF CABG PROCEDURES */ DATA CABGSUBSET; SET NIS2013.NIS_2013_CORE; LENGTH DISCHGS 3; RETAIN DISCHGS 1; IF PRCCS1=44; /* CREATE ANALYSIS FILE */ DATA CABGSUBSET; SET CABGSUBSET NIS2013.NIS_2013_HOSPITAL (IN=INHOSP KEEP=HOSP_NIS NIS_STRATUM) ; LENGTH INSUBSET 3; INSUBSET = 1; IF INHOSP THEN DO; INSUBSET = 2; /* ASSIGN A VALUE OUTSIDE THE SUBSET */ DISCWT = 1; /* ASSIGN A VALID WEIGHT */ /* ASSIGN ANALYSIS VARIABLES TO 0 */ DISCHGS = 0; los = 0; died = 0; totchg = 0; END; RUN; TITLE "CABG Subset Statistics Using Alternative Method"; PROC SURVEYMEANS DATA=CABGSUBSET SUM STD MEAN STDERR MISSING; WEIGHT discwt; CLASS died; FORMAT died FDIED.; CLUSTER hosp_nis; STRATA nis_stratum; VAR DISCHGS los died totchg; DOMAIN INSUBSET; RUN; The Hospital File is a supplemental file which is provided with the NIS Core File. It contains a few key variables for each hospital included in the nationwide database.
The alternate method produces the same correct statistical output as the recommended method. Again, results of the analysis can be verified using HCUPnet. The SURVEYMEANS Procedure Data Summary Number of Strata 202 Number of Clusters 4363 Number of Observations 35907 Sum of Weights 162083.005 Domain Statistics in INSUBSET Std Error INSUBSET Variable Level Label Mean of Mean Sum Std Dev ------------------------------------------------------------------------------------------------------------------------------------------------------- 1 DISCHGS 1.00 0.00 157,720 4,347 LOS Length of stay (cleaned) 9.27 0.06 1,461,960 41,387 TOTCHG Total charges (cleaned) 160,477.45 2,469.74 24,986,340,094 738,469,446 DIED .: Missing Died during hospitalization 0.00 0.00 40 18 .A: Invalid Died during hospitalization 0.00 0.00 30 25 0: Did not die in hospital Died during hospitalization 0.98 0.00 154,730 4,275 1: Died in hospital Died during hospitalization 0.02 0.00 2,920 142 2 DISCHGS 0.00 0.00 0 0 LOS Length of stay (cleaned) 0.00 0.00 0 0 TOTCHG Total charges (cleaned) 0.00 0.00 0 0 DIED .: Missing Died during hospitalization 0.00 0.00 0 0 .A: Invalid Died during hospitalization 0.00 0.00 0 0 0: Did not die in hospital Died during hospitalization 1.00 0.00 4,363 0 1: Died in hospital Died during hospitalization 0.00 0.00 0 0 ------------------------------------------------------------------------------------------------------------------------------------------------------- Remember, if the alternate method was not correctly applied, and all hospitals in the sample were not included in the analysis, standard errors will be incorrect. The SURVEYMEANS Procedure Data Summary Number of Strata 124 Number of Clusters 1110 Number of Observations 31544 Sum of Weights 157720.005 Class Level Information CLASS Variable Label Levels Values DIED Died during hospitalization 4 .: Missing .A: Invalid 0: Did not die in hospital 1: Died in hospital Statistics Std Error Variable Level Label Mean of Mean Sum Std Dev -------------------------------------------------------------------------------------------------------------------------------------------------- DISCHGS 1.00 0.00 157,720 3,439 LOS Length of stay (cleaned) 9.27 0.06 1,461,960 33,676 TOTCHG Total charges (cleaned) 160,477.45 2,373.93 24,986,340,094 615,460,503 DIED .: Missing Died during hospitalization 0.00 0.00 40 18 .A: Invalid Died during hospitalization 0.00 0.00 30 25 0: Did not die in hospital Died during hospitalization 0.98 0.00 154,730 3,385 1: Died in hospital Died during hospitalization 0.02 0.00 2,920 134 -------------------------------------------------------------------------------------------------------------------------------------------------- Here is an example of output from a program which does not account for all hospitals in the sample. The number of strata and clusters do not reflect the complete sample. The standard errors produced when all hospitals are not accounted for are incorrect and could lead to erroneous conclusions in your research. It is critical to ensure you obtain a correct standard error. Once you have calculated standard errors for the subset of discharges you are studying, you may want to check to see if there are any statistically significant differences between outcomes or measures of hospital stays in your subset and other subsets. The Z-Test calculator is a convenient way to do just that. It can be accessed by clicking the Z-test calculator link below any HCUPnet query results page.
To test if the length of stay of a discharge with a principal procedure of CABG is significantly different from that of stays which did not have a principal CABG procedure, select the Z-Test calculator.
Perhaps I am also Interested in testing to see if there has been a statistically significant change in the number of hospital stays with CABG between 2003 and 2013.
As you calculate sample statistics and standard errors from the HCUP nationwide databases, you should consider the following key points: - The HCUP nationwide databases are not simple random samples and the usual variance calculations cannot be used.
- When using the HCUP nationwide databases to produce national and regional estimates, a statistical programming package that incorporates the complex sample design into the data analysis must be used.
- When calculating statistics such as standard errors, all hospitals in the sample must always be accounted for, even if you are only interested in a subset of records. This can be accomplished using either of the methods outlined in this tutorial.
If you are looking for more information on the subject matter covered here, several resources are available on the HCUP User Support (HCUP-US) website: www.hcup-us.ahrq.gov. If you can't find what you need, feel free to email the HCUP Technical Assistance staff at hcup@ahrq.gov. AHRQ has research personnel available to respond to technical questions you may have. Inquiries are answered within three business days. Thank you for accessing this module. There are several other HCUP Online Tutorials. Take a look to see if there are other topics that could be helpful to you. If you have any feedback regarding this module, please email us at hcup@ahrq.gov. Detailed documentation of HCUP is available on the HCUP User Support website (http://www.hcup-us.ahrq.gov For documentation on each of the HCUP national databases, click on the links below: - NIS Database Documentation
- KID Database Documentation
- NEDS Database Documentation
- NRD Database Documentation
Special Methods Documents are available at http://hcup-us.ahrq.gov/reports/methods.jsp. Specific reports of interest to this module include: |

Internet Citation: HCUP Calculating Standard Errors - Accessible Version. Healthcare Cost and Utilization Project (HCUP). November 2016. Agency for Healthcare Research and Quality, Rockville, MD. www.hcup-us.ahrq.gov/tech_assist/standarderrors/508/508course_2016.jsp. |

Are you having problems viewing or printing pages on this Website? |

If you have comments, suggestions, and/or questions, please contact hcup@ahrq.gov. |

Privacy Notice, Viewers & Players |

Last modified 11/18/16 |