Conceptual Modeling for ETL Processes
Panos Vassiliadis Alkis Simitsis Spiros Skiadopoulos
National Technical University of Athens,
Dept. of Electrical and Computer Eng.,
Iroon Polytechniou 9, 157 73, Athens, Greece,
Tel: +30-10-772-1602
pvassil@dbnet.ece.ntua.gr asimi@dbnet.ece.ntua.gr spiros@dbnet.ece.ntua.gr
ABSTRACT
Extraction-Transformation-Loading (ETL) tools are pieces of
software responsible for the extraction of data from several
sources, their cleansing, customization and insertion into a data
warehouse. In this paper, we focus on the problem of the
definition of ETL activities and provide formal foundations for
their conceptual representation. The proposed conceptual model is
(a) customized for the tracing of inter-attribute relationships and
the respective ETL activities in the early stages of a data
warehouse project; (b) enriched with a 'palette' of a set of
frequently used ETL activities, like the assignment of surrogate
keys, the check for null values, etc; and (c) constructed in a
customizable and extensible manner, so that the designer can
enrich it with his own re-occurring patterns for ETL activities.
Categories and Subject Descriptors
H.2.1 [Database Management]: Logical design - data models,
schema and subschema.
General Terms
Design
Keywords
Data warehousing, ETL, conceptual modeling
1. INTRODUCTION
Extraction-Transformation-Loading (ETL) tools is a category of
specialized tools with the task of dealing with data warehouse
homogeneity, cleaning and loading problems. [29] reports that
ETL and Data Cleaning tools are estimated to cost at least one
third of effort and expenses in the budget of the data warehouse
while [8] mentions that this number can rise up to 80% of the
development time in a data warehouse project. [14] mentions that
the ETL process costs 55% of the total costs of data warehouse
runtime. Still, due to the complexity and the long learning curve
of these tools, many organizations prefer to turn to in-house
development to perform ETL and data cleaning tasks. In fact,
while data warehouse expenses are expected to come up to 14
billion dollars worldwide, projected sales for ETL and data
cleaning tools are expected to rise to only (!) 300 million dollars.
Thus, it is apparent that the design, development and deployment
of ETL processes, which is currently performed in an ad-hoc, in
house fashion, needs modeling, design and methodological
foundations. Unfortunately, as we shall show in the sequel, the
research community has a lot of work to do to confront this
shortcoming. In the rest of the paper, we will not discriminate
between the tasks of ETL and Data Cleaning and adopt the name
ETL for both these kinds of activities.
In Fig. 1, we abstractly describe the general framework for ETL
processes. In the bottom layer we depict the data stores that are
involved in the overall process. On the left side, we can observe
the original data providers (typically, relational databases and
files). The data from these sources are extracted (as shown in the
upper left part of Fig. 1) by extraction routines, which provide
either complete snapshots or differentials of the data sources.
Then, these data are propagated to the Data Staging Area (DSA)
where they are transformed and cleaned before being loaded to the
data warehouse. The data warehouse is depicted in the right part
of Fig. 1 and comprises the target data stores, i.e., fact tables and
dimension tables. Eventually, the loading of the central warehouse
is performed through the loading activities depicted on the upper
right part of the figure.
Sources DSA DW
Extract
Transform
& Clean
Load
Figure 1. The environment of ETL processes
In this paper, we focus on the conceptual part of the definition of
the ETL process. More specifically, we are dealing with the
earliest stages of the data warehouse design. During this period,
the data warehouse designer is concerned with two tasks which
are practically executed in parallel. The first of these tasks
involves the collection of requirements from the part of the users.
The second task, which is of equal importance for the success of
the data warehousing project, involves the analysis of the
structure and content of the existing data sources and their
intentional mapping to the common data warehouse model.
Related literature [19] and personal experience [33] suggest that
the design of an ETL process aims towards the production of a
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
DOLAP’02, November 8, 2002, McLean, Virginia, USA.
Copyright 2002 ACM 1-58113-590-4/02/0011…$5.00.