Dance of Giants Starts Slowly

By Philip J. Gill

Someday, massively parallel hardware and very large databases may make beautiful music for large-scale data warehouse applications.

Just 18 months ago the idea of a commercial market for massively parallel processing (MPP) systems seemed imperiled. Once-promising start-up companies, including Thinking Machines Corp. and Kendall Square Research Corp., floundered in their attempts to bring MPP systems to market. And not just hardware was proving problematic. There was little if any software, including relational database management systems (RDBMSs), that could exploit the potential of MPP architectures.

Today, even though parallel versions of some popular RDBMSs are still in beta testing, the affinity of MPP systems for very large databases (VLDBs) is starting to show promise, particularly for large-scale data warehousing applications. In particular, some early adopters are having success in using such combinations for sales and marketing ventures. For instance, communications firm MCI Corp. has used its data warehouse as a focal point in shifting from product-driven to customer- or relationship-driven marketing. (For more, see the accompanying case study.)

This is not to say that these large systems are widespread. "The technologies are in their early days," says Nagraj Alur, a principal at DataBase Associates, a Morgan Hill, CA, database consulting firm, which recently completed the first comprehensive study of IBM SP/2 users. Now known as the RS/6000 SP, it is IBM's entry into the MPP market.

"This is still an immature area," agrees Herb Edelstein, president and cofounder of Two Crows Corp., a Potomac, MD, consulting firm that specializes in data warehousing and data mining technologies. Nevertheless, it seems likely that it will take something on the grand scale of MPP to manage and keep pace with the incessant growth of VLDBs. "It is one of those marriages made in heaven," says Alur.

Early Emergence

By definition, an MPP system can support from a few to a few hundred independent processors, each with its own memory and copy of the operating system. Analysts generally agree that a database of 100GB or more qualifies as a VLDB, though they are quick to add that few databases of that size exist today. "The number of these systems in use today is on the order of dozens, not hundreds," says Ken Rudin, a managing partner at Emergent Corp., a San Mateo, CA, database consulting firm that advises customers on implementing and maintaining large-scale data warehouse systems.

Because the technologies are in their early stages, they present users with some unique challenges--including buying them. There are few DBMSs available for MPP systems. IBM has promised the DB2 Parallel Edition, while Informix has promised Informix-XPS. Neither database is in commercial release yet, though some early adopters, such as MCI, are using beta versions of the Informix system.

Oracle's Parallel Server, based on Oracle 7 release 2, has been available since last May, but it does not support partitioned disks, which analysts say is necessary for a truly parallel database. Also available is the Teradata parallel database software from NCR Corp. of Dayton, OH. However, it runs only on a proprietary Teradata computer or NCR's Unix-based 3700 MPP platform.

Not surprisingly, given the state of the technologies, managing these MPP platforms also can be difficult. "There are very few tools for managing these systems today," says Alur.

An Ongoing Argument

Software isn't the only issue. Despite the current linkage of MPP platforms to VLDB data warehouse applications, the issue of symmetric multiprocessing (SMP) systems versus MPP systems isn't finally settled. Although MPP took a clear lead in serving large-scale data warehouse systems about two years ago, SMP has begun to catch up in both scalability and quality.

Users shouldn't be fooled into thinking the larger the MPP system the better, says Edelstein. Two years ago, SMP systems topped out after eight to 12 CPUs, and more importantly they didn't offer nearly linear increase in performance with each additional new processor. Today, however, SMP systems have matured, and some support 30 or more processors. "SMP technology has improved," says Edelstein. "Today, they are more scalable, and the individual CPUs are better."

In addition, vendors have begun to cluster their SMP systems into large, loosely coupled configurations. Digital Equipment Corp.'s TruCluster software allows up to four 12-CPU SMP AlphaServers to function in a cluster of up to 48 CPUs. DEC has promised to raise that limit.

For some users, then, SMP may be the appropriate solution, says Edelstein. For others, however, even the best SMP system or SMP clusters won't do. "In some applications, they need to keep adding more and more and more data," he says. "So the need for MPP is there."

Knowing the difference is crucial, Edelstein adds. Determining which platform is best is not as simple as estimating the size of the database. "The kind of operations and the complexity of the information are probably more important," he says. "It will vary from application to application. In a retail data warehouse, there are typically fewer tables, but they are very large. In an insurance company, the data warehouse has many more tables, but they are smaller. So what pushes you to MPP in insurance is probably not the same as what pushes you to MPP in retail."

In short, according to Edelstein, four factors should determine the choice MPP or SMP: "the nature of the application, the complexity of the database, the size of the database and the number of users."

One argument users waste a lot of time with, says Emergent's Rudin, is the debate over shared disk (SMP) versus shared nothing (MPP). He calls this controversy little more than a marketing ploy by which some database software vendors are attempting to distinguish themselves from their competitors. There's no sound technical basis for it, says Rudin. "This is nothing but a lot of noise. Users need to pick a database based on its query optimization capabilities."

Building the Warehouse

Once users have jumped the hardware and software hurdles, some important implementation and management issues still stand in the way of building successful data warehouse applications. Rudin says the first challenge users typically face is justifying the expense of a data warehouse investment. These systems can cost hundreds of thousands of dollars, and the payoff isn't always clear-cut. Since marketing and sales are the main benefactors of many data warehouses today, he says, users "should make the same business case they would for any marketing campaign."

Extraction, transformation and cleansing of data form a major challenge that must be met before the data warehouse becomes functional. Users have to establish the processes by which data from the operational systems get copied into the data warehouse. "It always takes a lot longer than anyone thinks," Rudin says. "Users have to make a lot of decisions up front about what the data means. A simple thing like "male/female" could be different in each system--an X in a check box in one system, an M or an F in another, a 1 or a 2 in a third." The information has to be translated into a common format before going into the data warehouse.

Second, companies should build a data warehouse that will scale. "People are always underestimating the size of a data warehouse," says Rudin. "It's not just the amount of data that increases rapidly, it's also the complexity of the queries and the number of users. Data begets data, and usage begets usage. The size of the database will double in a year to 18 months, and the number of users will increase by an order of magnitude."

Rudin suggests that the best solution for users may be to pick a vendor that delivers SMP systems that can scale up to MPP. That allows the user to start small and grow the system. However, this shift in hardware also involves modification of software, including applications.

A few vendors offer specialized software tools for data warehousing. Rudin recommends against using these special tools for a central, enterprise-wide data warehouse. "This is a business decision, not purely a technical decision," he says. "These specialized databases are powerful tools, but do you want to introduce yet another database into your company? And will you need a specially trained database administrator to support it?"

Rudin believes companies are better off using their corporate standard database platform for the core data warehouse. That way they can leverage their existing administration expertise from operational systems to the data warehouse.

Performance is another common issue users must face. Rudin recommends most companies implement their data warehouse in a star-schema layout to optimize performance. A star schema has one large table in the middle and many smaller tables that hang off it.

Enterprise-Wide and Local

Organizations also should select front-end tools through which users, usually business or marketing analysts, can access, query and analyze the data. Although Rudin says he doesn't believe in using a special data warehouse DBMS for the central data store, he does support using a variety of front-end tools for querying and decision support.

The most popular of these are multidimensional databases (MDDs), which he says are most appropriate for data marts: smaller, targeted "slices" of the central data warehouse that contain information tailored to the specific needs of a division, department or line of business.

Rudin recommends a two-tier approach to data warehousing: building a central data store for enterprise-wide information and data marts to serve local needs. Data marts typically download selected information from the central data warehouse at periodic intervals, either at random or predetermined times. Local users then perform queries and analyses on the smaller, specialized data mart, keeping the central system free for other operations. The in-line or divisional IS organization, or even power users, may be able to supply the support these systems need.

The newest front end, as everywhere else, is a World Wide Web browser. Rudin sees a Web front end to data warehouses as the simplest and most cost-effective way to provide universal access. Of course, with the potential to go from a few dozen to hundreds or even thousands of users accessing data overnight, that only increases the need to marry large-scale data warehouses to MPP systems.

Philip J. Gill is a free-lance writer and editor based in San Diego. He can be reached at philipgill@aol.com.