Details will be discussed in the next subsection. We will refer to the summarized
information of the medical factors over a specific time interval as features.
Each feature related to Diagnoses, Procedures CPT, Procedures ICD9 and Visits to the
Emergency Room is an integer count of such records for a specific patient during the
specific time interval. Zero indicates absence of any record. Blood pressure and lab tests
features are continuous-valued. Missing values are replaced by the average of values of
patients with a record at the same time interval. Features related to tobacco use are
indicators of current- or past-smoker in the specific time interval. Admission features
contain the total number of days of hospitalization over the specific time interval the feature
corresponds to. Admission records are used both to form the Admission features (past
admission records) and in order to calculate the prediction variable (existence of admission
records in the target year). We treat our problem as a classification problem and each patient
is assigned a label: 1 if there is a heart-related hospitalization in the target year and 0
otherwise.
2.2 Data Preprocessing
In this subsection we discuss several data organization and preprocessing choices we make.
For each patient, a target year is fixed (the year in which a hospitalization prediction is
sought) and all past patient records are organized as follows.
•
Summarization of the medical factors in the history of a patient: Based on
experimentation, an effective way to summarize each patient's medical history is to
form four time blocks for each medical factor with all corresponding records
summarized over one, two, and three years before the target year and all earlier
records being summarized in a fourth block. For blood pressure and tobacco use,
only the year before the target year is kept. This process results to a vector of 212
features for each patient.
•
Selection of the target year: As a result of the nature of the data, the two classes are
highly imbalanced. When we fix the target year for all patients to be 2010, the
number of hospitalized patients is about 2% of the total number of patients, which
makes the classification problem much more challenging. Thus, and to increase the
number of hospitalized patient examples, if a patient had only one hospitalization
throughout 2007-2010, the year of hospitalization is set as the target year for that
patient. If a patient had multiple hospitalizations, a target year between the first and
the last hospitalization is randomly selected.
•
Setting the target time interval to be a year: A year has been proven to be an
appropriate time interval for prediction for our data set. We conducted trials setting
the time interval for prediction to be 1, 3, 6 and 12 months and used a Support
Vector Machine classifier — a method described later in more detail. Setting the
target time interval to one year yielded the best results. Moreover, given that
hospitalization occurs roughly uniformly within a year, we take the prediction time
interval to be a calendar year.
Dai et al.
Page 4
Int J Med Inform. Author manuscript; available in PMC 2016 March 01.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript