Data Source
Data from the IBM® MarketScan® Commercial Subset (October 1, 2015 – December 31, 2018) was used. This database consists of employer and health insurance-related data that contains medical and pharmaceutical claims data for beneficiaries, including employees, their spouses, and dependents, who are covered by employer-sponsored private health insurance in all US census regions. The database contains records of inpatient (IP) services, IP admissions, outpatient (OR) services, prescription drug claims, and other medical care. The database contains the part of the payments paid by the employer and all expenses incurred by the patients. The database also includes standard demographic variables such as age and gender; However, information on the breed is not available. Because the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5) diagnoses are not available in the claim data, International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) codes were used to identify symptoms and identify disorders based on clinical input.
The data is anonymized and meets the requirements of the Health Insurance Portability and Accountability Act; therefore, institutional review board approval was not required.
Study design and sample selection
The analyzes for this study were conducted based on a retrospective cohort design to identify three (3) groups: true positive PTSD, probable PTSD, and no PTSD cohorts. The study population included civilian, commercially insured adults (aged 18-64 years) in the US. Patients diagnosed with PTSD (actual positive PTSD cohort) were classified as those with ≥ 2 PTSD diagnoses (ICD-10-CM: F43.1) on different dates and ≥ 2 psychiatric evaluations within a 3-month period beginning on or before the first, identified observed PTSD diagnosis. Patients who were not confirmed to have PTSD (Actual Negative PTSD cohort) were identified as having never been diagnosed with a severe stress/adjustment disorder reaction (ICD-10-CM: F43) at any time point and demonstrated to have that severe stress reaction/adjustment disorder was diagnosed was ruled out based on the presence of ≥2 psychiatric evaluations within a 3 month period.
Patients with probable or improbable undiagnosed PTSD (probable PTSD and improbable PTSD cohorts) were identified among patients for whom PTSD status could not be confirmed from the available data (cohort unlabeled). This cohort included patients without a diagnosis of Major Stress Adjustment Disorder Reaction and with no evidence to rule out a diagnosis of Major Stress Adjustment Disorder Reaction. Patients likely or unlikely to have undiagnosed PTSD were defined based on model performance metrics, as described below. Patients with unlikely undiagnosed PTSD and patients in the actual negative PTSD cohort comprised a representative sample of the general civilian population without PTSD (Without PTSD Cohort); this population was used for descriptive comparison purposes only.
The index date was defined as the calendar date of the first observed PTSD diagnosis for patients in the true PTSD positive cohort, the most recent calendar date followed by 6 months of continuous health insurance enrollment for patients in the true PTSD negative cohort, and a randomly selected calendar date within most recent period of continuous health enrollment with at least 6 months of continuous health enrollment both before and after the index date for the untagged cohort. For all three cohorts, the study period was defined as the 6-month period from the index date to the earliest end of data availability (December 31, 2018) or the end of continuous health insurance enrollment.
feature selection
Based on the scientific literature, features were selected for inclusion in the machine learning model [1, 28,29,30,31,32,33], discussions with a clinical expert and available medical history captured in claims data. It was expected that trauma indicators would be grossly underestimated in claims data, as traumatic events that occurred before data became available or that were not related to the use of health resources could not be captured. Therefore, trauma indicators were not used as a main feature in the model; Instead, the model was created using information routinely collected in clinical practice. The characteristics included information on the patients’ demographics, clinical characteristics, symptoms and complications potentially related to PTSD, treatments received, and emergency services (ED) utilization. Characteristics were identified in the cohorts with true PTSD positive and true PTSD negative on the index date (demographic characteristics) or during the study period (clinical characteristics, symptoms and complications potentially related to PTSD, treatments received and use of ED services).
Both binary variables (i.e. the presence of the trait) and count variables (i.e. the number of days on which an entitlement to the trait was observed) were included in the model. For example, whether or not a particular treatment was received was recorded as a binary variable, while the number of prescription fills for the treatment was recorded as a count variable [34]. A total of 490 features were available for modeling (additional file 1).
Statistical analysis
Random forest model development
A random forest machine learning model was developed and trained to distinguish between patients with and without PTSD using the Actual Positive PTSD and Actual Negative PTSD cohorts. A random forest model is a model based on a decision tree, where each decision tree is constructed based on a random sample of the data and a random selection of the features. This approach was chosen because of its ability to model non-linear relationships between features and outcome variables and to consider a large feature space [25]. The random forest model was implemented with a maximum of 200 trees, above which the model performance stabilized; Default values for the minimum node size (one) and tree depth (undetermined) were chosen. The most important features for predicting PTSD status were identified by the model; Importance was measured by permutation (i.e. the amount of prediction error added to the model when a feature is lost).
The final random forest model after feature reduction [35]was trained based on 324 predictive characteristics and then applied to the Unmarked cohort to identify individuals who are likely and unlikely to have undiagnosed PTSD (Supplementary File 1).
Evaluation of the model performance
The performance of the random forest model was evaluated using measures of the area under the ROC curve [AUC] and F-Beta scores. The AUC provides an aggregate measure of the model’s performance across all classification thresholds. In general, the higher the AUC, the better the model performance; A model that randomly predicts the likelihood of patients having PTSD would have an AUC of 0.5, while a model that predicts the likelihood of patients having PTSD with 100% accuracy would have an AUC of 1, 0 would have.
F-Beta scores are a measure of model performance consisting of the weighted (harmonic) mean of model precision and model recall at each potential classification threshold. The value of beta indicates the relative weighting of precision and recall, so beta = 1 indicates that precision and recall are weighted equally, and beta < 1 indicates that precision is weighted more heavily than recall. Similar to AUC, a higher F-Beta score generally indicates better model performance. Because this study did not aim to identify all undiagnosed patients with PTSD, but wanted confidence that patients predicted to have undiagnosed PTSD might actually have PTSD, multiple beta values were assessed, which improves precision weighted than the memory.
Descriptive analysis of patient characteristics by PTSD status
Patient characteristics, including demographic and clinical characteristics, symptoms and complications potentially associated with PTSD, treatments received, health care costs, and health resource utilization (HRU) were analyzed for the True Positive PTSD, Probable PTSD, and Without PTSD cohorts. described separately. Demographic characteristics (e.g. age, sex) were described at the index date, while clinical characteristics (e.g. Charlson Comorbidity Index [CCI], comorbidities) were reported during the study period. Potential symptoms and complications associated with PTSD were described during the study period and included those of general health or quality of life (e.g., difficulty sleeping); behavioral symptoms or disorders (eg, eating disorders); Symptoms affecting cognition or perception (eg, somnolence, lightheadedness); physiological symptoms or reactions (eg, abnormal blood pressure, abnormal heart rate); Substance use indicators (e.g. rehabilitation services); and mental, behavioral, and neurodevelopmental disorders (eg, major depressive disorder). [MDD], anxiety disorders), among others. The treatments patients received in the three cohorts were described throughout the study period. Total healthcare costs (2018 USD) and HRU incurred during the study period included medical (IP, OP, and ED) and pharmacy components and were reported per patient per 6 months (PPP6M). Means, standard deviations, and medians have been described for continuous variables and frequency counts and percentages for categorical variables. No statistical comparisons were made between cohorts; All differences reported in this study are numerical.
#Identifying #Individuals #Undiagnosed #PostTraumatic #Stress #Disorder #Large #United #States #Civilian #Population #Machine #Learning #Approach #BMC #Psychiatry
Leave a Comment