6 DATA ANALYSIS AND DATA MINING
methods with high computational cost become less feasible. Clearly, in such cases,
we cannot identify an exact rule, because various factors other than those already
mentioned come into play, such as available resources for calculation and the time
needed for results. However, the effect unquestionably exists, and it prevents the
use of some tools, or at least renders them less practical, while favoring others of
lower computational cost.
It is also true that there are situations in which these aspects are of only marginal
importance, because the amount of data is not enough to influence the computing
element; this is partly thanks to the enormous increase in the power of computers.
We often see this situation with a large-scale problem, if it can be broken down into
subproblems, which make portions of the data more manageable. More traditional
methods of venerable age have not yet been put to rest. On the contrary, many of
them, which developed in a period of limited computing resources, are much less
demanding in terms of computational effort and are still valid if suitably applied.
1.1.3 SQL, OLTP, OLAP, DWH, and KDD
We have repeatedly mentioned the great availability of data, now collected in an
increasingly systematic and thorough way, as the starting point for processing.
However, the conversion of raw data to “clean” data is time-consuming and
sometimes very demanding.
We cannot presume that all the data of a complex organization can fit into a
single database on which we can simply draw and develop. In the business world,
even medium-sized companies are equipped with complex IT systems made up
of various databases designed for various aims (customers and their invoices,
employees’ careers and wages, suppliers, etc.). These databases are used by various
operators, both to insert data (e.g., from outlying sales offices) and to answer
queries about single entries, necessary for daily activities—for example, to know
whether and when customer X has paid invoice Y issued on day Z. The phrase
referring to methods of querying specific information in various databases, called
operational,isOnLine Transaction Processing (OLTP). Typically, these tools are
based on Structured Query Language (SQL), the standard tool for database queries.
For decision support, in particular analysis of data for CRM, these operational
databases are not the proper sources on which to work. They were all designed for
different goals, both in the sense that they were usually created for administrative
and accounting purposes and not for data analysis, and that those goals differ. This
means that their structures are heterogeneous and very often contain inconsistent
data, sometimes even structurally, because the definitions of the recorded variables
may be similar but are not identical. Nor is it appropriate for the strategic activities
of decision support to interfere with daily work on systems designed to work on
operational databases.
For these reasons, it is appropriate to develop focused databases and tools. We
thus construct a strategic database or Data WareHouse (DWH), in which data
from different database systems merge, are “cleaned” as much as possible, and are
organized round the postprocessing phase.
The development of a DWH is complex, and it must be carefully designed for
its future aims. From a functional point of view, the most common method of