Data Fusion:
A Way to Provide
More Data to Mine in?
Peter van der Putten
ab
a Sentient Machine Research
Baarsjesweg 224,
1058 Amsterdam, The Netherlands
pvdputten@smr.nl
Abstract
In everyday data mining practice, the availability of data is often a serious problem. For instance, in database marketing elementary customer information resides in customer databases, but market survey data is only available for a subset or even a different sample of customers. Data fusion provides a way out by combining information from different sources for each customer. We present a simple data fusion procedure based on a nearest neighbor algorithm. We suggest different measures to evaluate the quality of the fusion process. An experiment on real world data is described to illustrate the added value of our approach.
In marketing, direct forms of communication are
getting more popular. Instead of broadcasting a single message to all customers
through traditional mass media such as television and print, the most promising
potential customers receive personalized offers at the most appropriate time
and through the most appropriate channels. So it becomes more important to
gather information about media consumption, attitudes, product propensity etc. at
an individual level.
The amount of data that is collected about
customers is generally growing very fast, however it is often scattered among a
large number of sources. For instance, elementary customer information resides
in customer databases, but market survey data depicting a richer view of the
customer is only available for a small sample. Simply collecting all this
information for the whole customer database in a single source survey is often
too expensive.
Figure 1: Data Fusion in a nutshell
A widely accepted alternative within database
marketing is to buy external socio-demographic data that has been collected at
a regional level. All customers living in a
single region, for instance in the same zip
code area, receive equal values. However, the kind of information that can be
acquired is relatively limited. Furthermore, the assumption that all
customers within a region are similar is at the least questionable.
Data fusion techniques can provide a way out.
Information from different sources is combined by matching customers on
variables that are available in both data sources. The resulting enriched data
set can be used for all kinds of data mining and database marketing analyses.
In this paper we will give a practical
introduction to the application of data fusion for data mining in a database
marketing context (section 2). This will be illustrated with some preliminary
empirical results on a real world data set (section 3). Rather than presenting
a complete solution we argue that data fusion might be a valuable tool in every
day data mining practice. Furthermore we aim to demonstrate that the data
fusion problem is far from simple and contains a lot of interesting topics for
future algorithmic and methodological data mining research (section 4).
Data fusion is not new. From the 1970s through
the 1990s the subject was quite popular, most particularly in the field of
media research [1,2,4,5,7,16] and micro economic analysis [3,10,11]. Up until
today, data fusion is used to reduce the required number of respondents or
questions in a survey. For instance, for the Belgian National Readership survey
questions regarding media and questions regarding products are collected in 2
separate groups of 10.000 respondents and fused into a single survey thus reducing
costs and time for a respondent needed to complete a survey [9].
In our research we are ultimately aiming at
fusing entire customer databases with surveys instead of merging surveys with
surveys. This implies new ways to exploit existing survey data. Furthermore,
the single source alternative - asking all the questions to all the customers -
might be an option for merging surveys, but in most cases it will not even be a
possibility when merging large customer databases with survey data.
The core data fusion concept is illustrated in
figure 1. Assume a company has one million customers. For each customer, 50
variables are stored in the customer database. Furthermore, there exists a
market survey with 1000 respondents, not necessarily customers of the company,
and they were asked questions corresponding to 1500 variables. In this example
25 variables occur in both the database and the survey: these variables are
called common variables. Now assume
that we want to transfer the information from the market survey, the donor, to the customer database, the recipient. For each record in the
customer database the data fusion procedure is used to predict the most
probable answers on the market survey questions, thus creating a virtual survey
with each customer. The variables to be predicted are called fusion variables.
The most straightforward procedure to perform
this kind of data fusion is statistical
matching, which can be compared to k-nearest neighbor classification. For
each recipient record those k donor
records are selected that have the smallest distance to the recipient record,
with the distance measure defined over the common variables. Based on this set
of k donors the values of the fusion
variables are estimated, e.g. by taking the average for ordinals or the mode
for nominals.
Sometimes separate fusions are carried out for
groups for which 'mistakes' in the predictions are unacceptable, e.g.
predicting 'pregnant last year' for men. In this case the gender variable will
become a so-called cell variable; the
match between recipient and donor must be 100% on the cell variable, otherwise
they won't be matched at all.
An important issue in data fusion is measuring
the quality of the fusion; this is far from straightforward.
The bottom line evaluation is what we call the
external evaluation. Assume for instance that we want to improve the response
on mailings for a certain set of products, so this was the reason why the
fusion variables were added in the first place. In this case, one way to
evaluate the external quality is to check whether an improved mail response
prediction model can be built when fused data is included in the input.
However, the added value of socio-demographic and other external variables is often
of limited value for purely predictive modeling. These variables have more
value for descriptive data mining, e.g. discovering why people are interested
in these products [13,14]. The added value of the fused database is that
associations can be determined between variables that are unique for the
customer database and fusion variables from the market survey.
The internal evaluation of the data fusion
procedure is simply the a priori evaluation before external evaluation has
taken place. We identify evaluating representativeness versus predictiveness,
although the problem to make this distinction formal is an interesting problem
on its own. One challenge for both the fusion procedure and the evaluation of
representativeness of the fused data is that the donor and the recipient might
be samples from different populations, e.g. a customer database from a bank
versus a national media survey. If both donor and recipient are samples from
the same population, penalty factors can be used to punish ‘popular’ donors to
ensure that donors are used evenly and joint distributions between commons and
fusion variables are retained [11]. The penalty factor is popularity measure
that is added to the distance measure. Standard statistical tests can be used
to check whether there are significant deviations in frequency distributions
for variable values in the fused data set. An interesting problem when testing
predictiveness is that in general there are no target values available for the
recipient, so measures like root mean squared error and classification error
can generally only be calculated for the donor.
In this section we will describe some
preliminary experiments and results with a standard statistical matching data
fusion procedure. We assume the following hypothetical business case. A bank
wants to learn more about its credit card customers and expand the market for
this product. Unfortunately, there is no survey data available that includes
credit cardholdership, this variable is just known for actual customers. Data
fusion is used to enrich a customer database with survey data; we assume that
donor and recipient are random samples from the same population.. The resulting
data set serves as a starting point for further descriptive and predictive data
mining analysis.
To approximate the bank case we did not use separate donors, but we chose to split up an existing real world market survey into a donor and a recipient. The recipient contained 2000 records with a cell variable for gender, commons for age, marital status, region, number of persons in the household and income. Furthermore the recipient contained a unique variable for credit card ownership, the target variable to model. The donor contained 4880 records, with 36 variables for which we expected that there might be a relation to credit cardholdership: general household demographics, holiday and spare time activities, financial product usage and personal attitudes. The original survey contains over a thousand of variables and over 5000 possible variable values.
We fused the donor and the recipient using 4 fold cross validation on the donor to determine the optimal k based on root mean squared error. Only ordinals and binary fusion variables were included, so we restricted ourselves to predicting averages.
Apart from the root mean squared error cross
validation procedure we restricted ourselves to representativeness evaluation.
First we compared averages for all variables
for the donor and the recipient. As could be expected from the donor and
recipient sizes and the fact that both sets were generated from the same source
there weren't many significant differences between donor and recipient for the
common variables. Within the recipient 'not married' was over represented
(30.0% instead of 26.6%), 'married and living together' was under represented
(56.1% versus 60.0%) and the countryside and larger families were slightly over
represented. The average fusion variable values were very well preserved in the
recipient survey compared to the donor survey. Only the averages of "Way
Of Spending The Night during Summer Holiday" and "Number Of Savings
Accounts" differed significantly, respectively by 2.6% and 1.5%.
Apart from general statistics we wanted to
evaluate the preservation of relations between variables, for which we used the
following weak measures. For each common variable we listed the correlation
with all fusion variables (the real values for the donor and the predicted
values for the recipient). The mean difference between common-fusion
correlations in the donor versus the recipient was 0.12 ± 0.028. In other
words, these correlations were very well preserved. A similar procedure could
be carried out for the fusion variables with respect to each other. Further
work should also be done on the application of penalty factors to improve
representativeness. However, our preliminary experiments have demonstrated that
penalties have a negative effect on the prediction quality (measured in rmse).
To experiment with the added value of data fusion for further analysis (external evaluation) we first performed some descriptive data mining to discover relations between the target variable, credit cardholdership, and the fusion variables using straightforward univariate techniques. First we selected the top 10 fusion variables with the highest absolute correlations with the target (see Table I). Note that, in contrast to standard practice, it is perfectly allowed to include fusion variables such as ‘frequency usage credit card’ in the set of input variables for prediction, because these variable values were based on predictions on the common variables. Smaller effects included "Need for cognition" (average 1.05 times higher) and less "housewives" (0.9 times lower). These results can already offer a lot of insight to a marketer. Also, when new products launched, the relation with survey variables can be estimated without carrying out a new survey.
The descriptive results were also used to guide the predictive data mining modelling process. In this case we wanted to investigate whether different computational learning methods would be able to exploit the added information in the fusion variables. We included neural networks, linear regression, k nearest neighbor and an adapted version of Naive Bayes adapted for ordinals (Naive Bayes Gaussian). We report results over 10 runs with train and test sets of equal size. The quality of the models was measured by the so called c-index, a rank based non parametric test related to Kendall's Tau [12], which measures the concordance between the ordered lists of real and predicted cardholders (see [15] for details on the algorithms and the c-index).
We compared models trained on commons only, for which no fusion was actually needed, and models on commons plus either all or a selection of correlated fusion variables (see Table II; c=0.5 means random prediction, c=1 means perfect prediction). These results indicate that for this data set the models that include the fusion variables outperform the models built using commons only. For linear regression these differences were most significant. Significance was tested by a one sided two sample T test on the ‘fusion’ runs versus the ‘only commons’ runs. In figure 2 cumulative response curves are drawn for the linear regression models. The test recipients are ordered from high score to low score on the x-axis. The data points correspond to the actual proportion of cardholders up to that percentile. Random selection of customers results in an average proportion of 32.5% cardholders. Credit cardholdership can be predicted quite well: the top 10% of cardholder prospects according to the model contains around 50-65% cardholders. The added logarithmic trend lines indicate that the models which include fusion variables are better in 'creaming the crop', i.e. selecting the top prospects.
Welfare class |
Income household above average |
Is a manager |
Manages which number of people |
Time per day of watching television |
Eating out (privately): money per person |
Frequency usage credit card |
Frequency usage regular customer card |
Statement current income |
Spend more money on investments |
Table I:. Fusion variables in recipient
strongly correlated with credit card ownership
Figure 2: Lift chart linear regression models
for predicting credit card ownership (7 randomly selected runs)
|
SCG Neural Network |
Linear Regression |
Naive Bayes Gaussian |
k-NN |
Only commons |
C=0.692 ± 0.012 |
C=0.692 ± 0.014 |
C=0.701± 0.015 |
C=0.702± 0.012 |
Commons & correlated fusion vbls |
C=0.703 ± 0.015 p=0.041 |
C=0.724± 0.012 p=2.1e-5 |
C=0.720 ± 0.012 p=0.0034 |
C=0.716 ± 0.013 p=0.0093 |
Commons & all fusion vbls |
C=0.694 ± 0.019 p=0.38 |
C=0.713 ± 0.013 p=0.0017 |
C=0.719 ± 0.012 p=0.0049 |
C=0.720 ± 0.012 p=0.0023 |
Table II:. C indexes
One could argue that in theory by applying data
fusion no information is added to the recipient survey, because this
information is derived directly from the commons. However, in practice data
fusion can still be a valuable tool. For descriptive data mining tasks, the
fusion variables and the patterns derived from these variables can be more
understandable and easier to interpret for an end user than patterns derived
solely from commons. Furthermore it is a well known practical fact that it
often makes sense to include derived variables to improve prediction quality.
In this case, fusion can make it easier for ‘imperfect’ algorithms such as
linear regression to discover complex non-linear relations between commons and
target variables, by exploiting the information in the fusion variables. It is
highly recommended to use appropriate variable selection techniques to remove
the noise that is added by ‘irrelevant’ fusion variables (to counter the ‘curse
of dimensionality’).
It goes without saying that evaluating the
quality of data fusion is crucial for acceptance. We hope to have demonstrated
that this is not straightforward. A lot of interesting research can be done in
this area, especially in the field of evaluating the recipient fusion variable
predictions, for which no targets are available. Even a relatively simple
question as determining the optimal set of commons has interesting research
dimensions. To structure all these choices we have started to build a data
fusion process model, analogously to the CRISP_DM model for data mining [6].
Also, the core fusion algorithms provide a lot
of room for research and improvement. There is no fundamental reason why the
fusion algorithm should be based on k-nearest neighbor prediction instead of
clustering methods, decision trees, regression, the expectation-maximization
(EM) algorithm or other statistical and machine learning algorithms (see f.i.
[8]). By shifting from fusing surveys to fusing customer databases with surveys
an extra challenge must be faced: scoring millions of customer database records
instead of thousands of surveys. All these efforts work towards a single vision:
keeping all knowledge about a customer up to date, including soft information
such as predictions based on measurements from different sources.
The promise of data fusion is indeed
attractive: getting insight about individual customers against a fraction of
the price it would have cost to collect all this information in a single source
survey. The application of data fusion will increase the value of data mining,
because there is more integrated data to mine in. However, there is still a lot
of interesting research to be done to evaluate data fusion quality and improve
the still rather straightforward data fusion algorithms.
Acknowledgements
We would like to thank Michel de Ruiter,
Martijn Ramaekers, Evelien Langendoen, Michiel van Wezel and Joost Kok for
their comments. Part of this work has been performed within "The Fusion
Factory" project, which is supported by the Dutch Ministry of Economic
Affairs, through the KREDO stimulation initiative for development of electronic
services.
References
[1] Antoine, J. and G. Santini (1987)
European Research, 15 (August), 178-187.
.
[2] K. Baker, P. Harris and J. O’Brien (1989)
Data Fusion: An Appraisal and Experimental Evaluation. Journal of the Market
Research Society, 31 (2), 152-212.
[3] Radner D.B, A. Rich, M.E. Gonzalez, T.B..
Jabine and H.J. Muller (1980). Report on Exact and Statistical Matching
Techniques. Statistical Working Paper 5, Office of Federal Statistical Policy
and Standards, US DoC
[4] O’Brien, S. (1991) The Role of Data Fusion
in Actionable Media Targeting in the 1990’s. Marketing and Research Today 19
(February), 15-22.
[5] Bronner, A.E. (1989). Einde van de fusie fobie in Nederland?. In:
Jaarboek van de Nederlandse vereniging van marktonderzoekers 1988/1989, 9-18
[6] Chapman, P., J. Clinton, T. Khabaza, T.
Reinartz., R. Wirth. (1999). The CRISP-DM Process Model. Draft Discussion
paper, Crisp Consortium, March 1999. http://www.crisp-dm.org/.
[7] Harris, P. and Baker, K. (1998). Data
Fusion. Admap, June 1998
[8] Kamakura, W.A. and M. Wedel, (1996).
Statistical Data-Fusion For Cross-Tabulation. Research Report SOM Institute,
Groningen University, The Netherlands.
[9] Lokker, R. (1998). Bereikstudies Pers,
Bioscoop en PMP. Centrum voor Informatie over de Media, Brussel, Belgium.
[10] van Noordwijk, A.J. (1983). Technical Notes on a Statistical
Matching Experiment. Chapter
8 in: Koppelling van Databestanden, Sociaal en Cultureel Planbureau, Rijswijk,
the Netherlands.
[11] Paass, G. (1986). Statistical Match:
Evaluation of Existing Procedures and Improvements by Using Additional
Information. In: Microanalytic Simulation Models to Support Social and
Financial Policy. Orcutt, G.H. and Merz, K, (eds). Elsevier Science Publishers
BV, North Holland.
[12] Press, W.H, S.A. Teukolsky, W.T. Vetterling
and B.P. Flannery (1992). Numerical Recipes in C. The Art of Scientific
Computing. Cambridge University Press, Cambridge MA, 2nd edition
[13] van der Putten, P.(1999). Datamining in Direct Marketing
Databases. In: Baets, W. (ed.). (1999). A Collection of Essays on Complexity
and Management. World Scientific, Singapore.
[14] van der Putten, P. (1999). A Datamining Scenario for
Stimulating Credit Card Usage by Mining Transaction Data. Proceedings of Benelearn-99.
[15] de Ruiter, M. (1999). Bayesian classification in data
mining: theory and practice. MSc. Thesis, BWI, Free University of Amsterdam,
The Netherlands
[16] Jephcott, J. and T. Bock (1998). The application and validation of data fusion. Journal of the Market Research Society\, vol 40, nr 3 (July), p. 185-205.
b The author is also affiliated with the Leiden Institute of Advanced Computer Science (LIACS), P.O. Box 9512, 2300 RA Leiden, The Netherlands, putten@liacs.nl.