Aligned data models for predictive analytics

Date
2018
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
This dissertation is a study of three application areas where the underlying data is misaligned with the question that is being asked of it. In these three problems, novel transformations and/or structuring of the captured data are deployed prior to any modeling efforts. Subsequent modeling steps are customized to take advantage of the transformed data and to reveal insights desirable to a firm querying their captured data sources. We refer to the manipulated data and the associated modeling techniques as aligned data models. ☐ The first aligned data model developed in this dissertation seeks to model the concept of uniqueness and explore the predictive capabilities of representing uniqueness within a modeling process. Two application areas are explored: 1) predicting competition intensity where the more unique offerings of a firm are used to differentiate competitors and 2) predicting purchase behavior where the unique attributes of customers are presumed to make them more likely to exhibit similar purchase behavior. This study develops a uniqueness-motivated probabilistic similarity measure, extending the existing literature on similarity measures, to capture the notion of uniqueness and show its value in applications. The developed method can capture uniqueness for various kinds of data including measures that are numeric, categorical, hierarchical, and distance-based. ☐ The second aligned data model focuses on identifying areas where the intensity of bivariate relationships becomes interesting to a decision maker. Since linear relationship assumptions do not vary over the domain of the explanatory variable, this study examines how to model a bivariate non-linear relationship so that areas of relationship intensity that are of interest to the decision maker can be uncovered. These areas might be regions of no relationship, regions of an intense positive/negative relationship, or regions within a defined interval of intensity. ☐ To facilitate this mining for relationship intensity, we developed the R package \textbf{gpviz} and its underlying methodology. The package learns non-linear relationships using Gaussian process and numerically approximates a distribution of relationship intensity by using the first derivatives of a Gaussian process prediction model. By visualizing the distribution of first derivatives, users immediately have access to subtleties of any non-linear bivariate relationship. Through specifying the bounds of relationship intensity, users can identify regions of interest (i.e. values of the explanatory variable) that are particularly pertinent to their decision making. Example applications are shown including one where data from failed commercial banks is used to identify ranges of a bank's deposit-asset ratio where FDIC losses, associated with a bank's failure, tend to increase dramatically. ☐ The third aligned data model extracts insight from problems where particular sequences of hierarchically-structured activities, not just the presence of activities, provides predictive power for future events. In this study, we highlight an application of great value to medical practitioners and their patients; namely, surgical adverse event prevention. Here the model used seeks to predict the probability of hospitalized patients experiencing surgical adverse events. Given that activities in this setting have an underlying hierarchical relationship, we reshape the data then leverage a dynamic recurrent neural network to incorporate this data structure and show that it is capable of improving prediction performance. ☐ In summary, this dissertation focuses on proposing data models which align to the needs and insights of the problems they seek to address. As opposed to workflows that mine associations from captured data, this study motivates all modeling decisions with real needs prior to performing any inference or algorithmic processes. The three parts of this study all differ slightly in focus and hence, contribute to a well-rounded study of making aligned data models. The uniqueness work is methodologically focused, the non-linear work is computationally focused, and the activity-sequence work is more application driven. Taken together, this work proposes several useful machine learning models and a sophisticated software package all designed to solve real problems by better aligning a data model with the insights that are to be extracted.
Description
Keywords
Pure sciences, Applied sciences, Social sciences, Competitor identification, Data interpretation, Hybrid neural network, Predictive modeling, Region of interest, Uniqueness-motivated similarity
Citation