Dance of Giants Starts Slowly
By Philip J. Gill
Someday, massively parallel hardware and very large databases may make
beautiful music for large-scale data warehouse applications.
Just 18 months ago the idea of a commercial market for massively parallel
processing (MPP) systems seemed imperiled. Once-promising start-up companies,
including Thinking Machines Corp. and Kendall Square Research Corp., floundered
in their attempts to bring MPP systems to market. And not just hardware
was proving problematic. There was little if any software, including relational
database management systems (RDBMSs), that could exploit the potential of
MPP architectures.
Today, even though parallel versions of some popular RDBMSs are still in
beta testing, the affinity of MPP systems for very large databases (VLDBs)
is starting to show promise, particularly for large-scale data warehousing
applications. In particular, some early adopters are having success in using
such combinations for sales and marketing ventures. For instance, communications
firm MCI Corp. has used its data warehouse as a focal point in shifting
from product-driven to customer- or relationship-driven marketing. (For
more, see the accompanying case study.)
This is not to say that these large systems are widespread. "The technologies
are in their early days," says Nagraj Alur, a principal at DataBase
Associates, a Morgan Hill, CA, database consulting firm, which recently
completed the first comprehensive study of IBM SP/2 users. Now known as
the RS/6000 SP, it is IBM's entry into the MPP market.
"This is still an immature area," agrees Herb Edelstein, president
and cofounder of Two Crows Corp., a Potomac, MD, consulting firm that specializes
in data warehousing and data mining technologies. Nevertheless, it seems
likely that it will take something on the grand scale of MPP to manage and
keep pace with the incessant growth of VLDBs. "It is one of those marriages
made in heaven," says Alur.
Early Emergence
By definition, an MPP system can support from a few to a few hundred independent
processors, each with its own memory and copy of the operating system. Analysts
generally agree that a database of 100GB or more qualifies as a VLDB, though
they are quick to add that few databases of that size exist today. "The
number of these systems in use today is on the order of dozens, not hundreds,"
says Ken Rudin, a managing partner at Emergent Corp., a San Mateo, CA, database
consulting firm that advises customers on implementing and maintaining large-scale
data warehouse systems.
Because the technologies are in their early stages, they present users with
some unique challenges--including buying them. There are few DBMSs available
for MPP systems. IBM has promised the DB2 Parallel Edition, while Informix
has promised Informix-XPS. Neither database is in commercial release yet,
though some early adopters, such as MCI, are using beta versions of the
Informix system.
Oracle's Parallel Server, based on Oracle 7 release 2, has been available
since last May, but it does not support partitioned disks, which analysts
say is necessary for a truly parallel database. Also available is the Teradata
parallel database software from NCR Corp. of Dayton, OH. However, it runs
only on a proprietary Teradata computer or NCR's Unix-based 3700 MPP platform.
Not surprisingly, given the state of the technologies, managing these MPP
platforms also can be difficult. "There are very few tools for managing
these systems today," says Alur.
An Ongoing Argument
Software isn't the only issue. Despite the current linkage of MPP platforms
to VLDB data warehouse applications, the issue of symmetric multiprocessing
(SMP) systems versus MPP systems isn't finally settled. Although MPP took
a clear lead in serving large-scale data warehouse systems about two years
ago, SMP has begun to catch up in both scalability and quality.
Users shouldn't be fooled into thinking the larger the MPP system the better,
says Edelstein. Two years ago, SMP systems topped out after eight to 12
CPUs, and more importantly they didn't offer nearly linear increase in performance
with each additional new processor. Today, however, SMP systems have matured,
and some support 30 or more processors. "SMP technology has improved,"
says Edelstein. "Today, they are more scalable, and the individual
CPUs are better."
In addition, vendors have begun to cluster their SMP systems into large,
loosely coupled configurations. Digital Equipment Corp.'s TruCluster software
allows up to four 12-CPU SMP AlphaServers to function in a cluster of up
to 48 CPUs. DEC has promised to raise that limit.
For some users, then, SMP may be the appropriate solution, says Edelstein.
For others, however, even the best SMP system or SMP clusters won't do.
"In some applications, they need to keep adding more and more and more
data," he says. "So the need for MPP is there."
Knowing the difference is crucial, Edelstein adds. Determining which platform
is best is not as simple as estimating the size of the database. "The
kind of operations and the complexity of the information are probably more
important," he says. "It will vary from application to application.
In a retail data warehouse, there are typically fewer tables, but they are
very large. In an insurance company, the data warehouse has many more tables,
but they are smaller. So what pushes you to MPP in insurance is probably
not the same as what pushes you to MPP in retail."
In short, according to Edelstein, four factors should determine the choice
MPP or SMP: "the nature of the application, the complexity of the database,
the size of the database and the number of users."
One argument users waste a lot of time with, says Emergent's Rudin, is the
debate over shared disk (SMP) versus shared nothing (MPP). He calls this
controversy little more than a marketing ploy by which some database software
vendors are attempting to distinguish themselves from their competitors.
There's no sound technical basis for it, says Rudin. "This is nothing
but a lot of noise. Users need to pick a database based on its query optimization
capabilities."
Building the Warehouse
Once users have jumped the hardware and software hurdles, some important
implementation and management issues still stand in the way of building
successful data warehouse applications. Rudin says the first challenge users
typically face is justifying the expense of a data warehouse investment.
These systems can cost hundreds of thousands of dollars, and the payoff
isn't always clear-cut. Since marketing and sales are the main benefactors
of many data warehouses today, he says, users "should make the same
business case they would for any marketing campaign."
Extraction, transformation and cleansing of data form a major challenge
that must be met before the data warehouse becomes functional. Users have
to establish the processes by which data from the operational systems get
copied into the data warehouse. "It always takes a lot longer than
anyone thinks," Rudin says. "Users have to make a lot of decisions
up front about what the data means. A simple thing like "male/female"
could be different in each system--an X in a check box in one system, an
M or an F in another, a 1 or a 2 in a third." The information has to
be translated into a common format before going into the data warehouse.
Second, companies should build a data warehouse that will scale. "People
are always underestimating the size of a data warehouse," says Rudin.
"It's not just the amount of data that increases rapidly, it's also
the complexity of the queries and the number of users. Data begets data,
and usage begets usage. The size of the database will double in a year to
18 months, and the number of users will increase by an order of magnitude."
Rudin suggests that the best solution for users may be to pick a vendor
that delivers SMP systems that can scale up to MPP. That allows the user
to start small and grow the system. However, this shift in hardware also
involves modification of software, including applications.
A few vendors offer specialized software tools for data warehousing. Rudin
recommends against using these special tools for a central, enterprise-wide
data warehouse. "This is a business decision, not purely a technical
decision," he says. "These specialized databases are powerful
tools, but do you want to introduce yet another database into your company?
And will you need a specially trained database administrator to support
it?"
Rudin believes companies are better off using their corporate standard database
platform for the core data warehouse. That way they can leverage their existing
administration expertise from operational systems to the data warehouse.
Performance is another common issue users must face. Rudin recommends most
companies implement their data warehouse in a star-schema layout to optimize
performance. A star schema has one large table in the middle and
many smaller tables that hang off it.
Enterprise-Wide and Local
Organizations also should select front-end tools through which users, usually
business or marketing analysts, can access, query and analyze the data.
Although Rudin says he doesn't believe in using a special data warehouse
DBMS for the central data store, he does support using a variety of front-end
tools for querying and decision support.
The most popular of these are multidimensional databases (MDDs), which he
says are most appropriate for data marts: smaller, targeted "slices"
of the central data warehouse that contain information tailored to the specific
needs of a division, department or line of business.
Rudin recommends a two-tier approach to data warehousing: building a central
data store for enterprise-wide information and data marts to serve local
needs. Data marts typically download selected information from the central
data warehouse at periodic intervals, either at random or predetermined
times. Local users then perform queries and analyses on the smaller, specialized
data mart, keeping the central system free for other operations. The in-line
or divisional IS organization, or even power users, may be able to supply
the support these systems need.
The newest front end, as everywhere else, is a World Wide Web browser. Rudin
sees a Web front end to data warehouses as the simplest and most cost-effective
way to provide universal access. Of course, with the potential to go from
a few dozen to hundreds or even thousands of users accessing data overnight,
that only increases the need to marry large-scale data warehouses to MPP
systems.
Philip J. Gill is a free-lance writer and editor based
in San Diego. He can be reached at philipgill@aol.com.