The Race for Data Access

by Bill Roberts

[Chart]: Uniprocessor System
[Sidebar]: Schubert Ticketing Services
[Sidebar]: Pyxis Corporation
[Sidebar]: Thinking Massively

Parallel processing is neither cheap nor simple. How do you know when your data access requirements justify it?

Has data access got your enterprise stumped? Consider the New England farmer who faced a similar problem clearing his field: a stump he couldn't budge. He retired his mule for the day, borrowed two teams of draft horses and harnessed the power of four to solve his problem.

The late Grace Hopper (the U.S. Navy's first female admiral, who was instrumental in bringing advanced computing to the armed forces) often used that anecdote to explain parallel processing, the harnessing of two or more CPUs to return a query or solve a calculation faster than one processor can. Once the sole domain of expensive scientific computers, parallel processing has plunged in price, and in response commercial applications have proliferated. It now replaces many mainframe systems for online transaction processing and online analytical processing, including decision support.

A parallel computer paired with a parallel database engine increasingly is touted by vendors, analysts and users as the preferred solution for the rapid access, flexible queries, modeling and analysis needed in decision support systems. It finds many ready uses. A stock trader who can analyze derivatives faster than competitors makes more money. Large retailers can study the buying trends of groups or individuals. Banks can detect credit card fraud. Airlines can analyze travel patterns. Pharmaceutical companies can tailor sales pitches to individual doctors based on their prescription writing patterns. Because decision support applications like these are mission-critical, parallel processing earns consideration.

Before getting into the technical details, an organization should identify its database and data query needs. Once IS staff and business managers understand the data they have or will have, as well as their business processes, what access they need and how much they are willing and able to spend, they can decide whether parallel processing may be the answer. If that decision is affirmative, they are ready to choose a database engine and a hardware architecture.

A Commercial Challenge

Parallel processing is based on a simple idea. Software splits large problems into smaller subtasks and distributes them to different processors, which simultaneously handle the subtasks and so reach a solution faster than a uniprocessor system does. (see Chart).

Various trends are driving commercial parallelism. Data in legacy systems is not easily accessible. Multiple-processor architectures have become increasingly fast, scalable and easier to administer. Tools have improved to install and administer the multithreaded operating systems and database engines for these platforms. Application development is less onerous now.

More than any of these reasons, the driving force is that databases are growing. Databases of 2GB to 3GB, once considered large, are dwarfed by those of 10GB and more. Some enterprise-wide databases exceed 50GB.

From the hardware perspective, parallel processing is almost a moot question. Database servers increasingly have parallel capability. Many servers that run the Unix operating system house two, four or more CPUs. Two CPUs are common in the Intel environment. Low-end solutions can be bought for $10,000 or less. Even so, just because you have the capacity for more processors doesn't mean you must buy and use them. If you're not ready for parallel processing today, you could lay out a migration path to your next database server.

In tandem with data growth, more users demand access to databases across the company, and their queries are getting more complex, more ad hoc and more sophisticated. The benefit of parallel data querying depends on the number of queries, their complexity, the complexity of the data and the number of users accessing the system. In a typical executive information system, the usual hit rate is low, because only a few people use it, but the queries might be especially complex and the data could be large. Once decision support spreads to the masses, the database will sustain an enormous number of hits. When users find out what kind of questions they can ask, they start doing more complicated queries. One vendor calls it the "potato chip syndrome"--the more you eat, the more you want.

Parallel processing can be less efficient for applications that let users ask a series of questions of small subsets of data, drilling deeper with each subsequent query. In a parallel system, each query would scan far more data than needed. For these applications, data indexing that leads more directly to the categories most often used could achieve quicker response times. But for dealing with large amounts of data and complex analysis, parallel processing can be the answer. If your query requires multiple operations and you want to improve response time, you probably need parallel processing.

It's not an easy path to choose. It can be costly. Applications can require recoding. Moving data to a parallel architecture is work, although tools exist to help. You must train people for the environment. A key issue is figuring how to spread your data properly, which requires understanding the nature of the data and how it will be used.

Given the increasing reliance of many organizations on data manipulation, a boom in parallel processing for data querying probably is on the way. "Any scalable database application will demand parallel processing," says John Oltsik, an analyst at Forrester Research in Cambridge, MA. "Applications tend to grow, and data tends to grow a lot faster than anticipated. You need to plan for the future."

Comparing Platforms

A decision support database is oriented toward large-scale sorting and searching in large databases, usually those comprised of historical data. I/O is large and sequential. Queries typically involve full-table scans requiring minimal indexing. In this environment, speed is everything.

All the major vendors of open systems databases strive to enhance the performance of their products as new processors and higher clock speeds appear on the market. Oracle of Redwood Shores, CA, was first to market with a parallel database--albeit as an add-on--and the first to support 300MHz and 350MHz processors, though Sybase of Emeryville, CA, wasn't far behind. Informix Software of Menlo Park, CA, has a uniquely architected parallel processing database. Red Brick Systems of Los Gatos, CA, focuses on the data warehouse.

According to Rob Tholemeier, vice president and database analyst for the Meta Group in Burlingame, CA, Informix, Oracle and Red Brick products offer "dynamic parallel processing," which does not depend heavily on the data layout and is able to determine which CPUs are loafing and optimize individual queries as appropriate. Dynamic parallel processing requires a symmetric multiprocessing (SMP) architecture, also called "shared memory" or "shared everything." AT&T Teradata, IBM and Tandem Computers sell solutions that are examples of what Tholemeier calls "static parallel processing," which requires a more coherent layout of data and runs in a massively parallel processing (MPP) architecture, sometimes called "shared nothing." The problem with static parallelism is that, as data changes, so do the layout needs. "It's relatively easy to do the first load [of the database in such a system]," he says. "The continuing administration is much more difficult."

Commercial parallel databases from Informix, Oracle and other vendors perform best in SMP environments. "Running lots of jobs where all of them share memory and share resources, SMP is the biggest bang for the buck," says Tholemeier, an SMP proponent. A recent report he authored concludes, "Users who can predefine a path to data for time-critical transactions can use shared nothing architectures (MPP) by effectively partitioning data. Otherwise, shared everything (SMP) architectures will support both defined path and more flexible database designs."

SMP uses shared memory and disk I/O subsystems to service more than one CPU. SMP systems run a single copy of the operating system and share a single copy of the application. CPUs can be added without impacting the operating system, the application or the data. Tasks or processes are automatically distributed among CPUs through dynamic load balancing. As more CPUs are added, applications need not be retuned, and system management gets no more complicated. The bus bandwidth eventually gets used up, but it can be optimized by taking advantage of the efficiencies of sharing.

Analysts and vendors estimate that uniprocessor and SMP systems together account for 80 to 90 percent of the commercial database server market. Increasingly the larger share is SMP. The most powerful SMP chips can handle most commercial database applications, but of course there are trade-offs. Due to systems overhead, efficency of systems declines after reaching a certain number of processors--eight or 16 in most Unix systems. At that point, the choice becomes moving to clusters of SMP boxes or to MPP. In a cluster, a few or several computers are tied together with a high-speed bus and talk to the same database. Each computer could contain one or more CPUs. Clusters now comprise between 10 and 20 percent of the commercial database market.

Open systems hardware vendors have become accomplished in SMP. "There are many terrific boxes out there that do really good SMP up to 64 CPUs," Tholemeier says. Although a high-end SMP machine can run $50,000 and SMP clusters $100,000 or more, a more common, less expensive SMP system is a 5GB database on a single machine with four processors, which may cost $20,000 to $30,000.

A Rule of Thumb

An MPP architecture can have hundreds or more processors in one computer. Each node in an MPP machine is a self-contained computer with CPU and memory. Connected to the others by a high-speed bus or switch, each node functions on its own, and hardware resources are generally not shared. It's more scalable than SMP at the hardware level and good for very large data warehouses of 200GB or more. Although theoretically such a system could have thousands of nodes, commercial applications typically use less than 100 nodes. (For MPP adoption criteria, see "Thinking Massively," at left.)

Today this leading-edge technology probably has less than 2 percent of the commercial database market. The Standish Group International of Dennis, MA, which has studied the MPP market, views the leaders as AT&T Teradata, IBM and Tandem. These systems cost up to $2 million.

SMP proponents assert that it's much easier to implement and maintain than MPP. Jim Johnson, Standish chairman, doesn't necessarily agree. "There aren't a lot of people who have done both. I haven't found an end user who will stand up and say SMP is easier." Size is the best metric, he believes. "Anybody who has 200GB should go MPP. Between 100 and 200 is no man's land. For 100GB down to 50GB, SMP is the way to go."

Bill Roberts is a free-lance writer who covers business, technology and management. He can be reached at wcrober@aol.com.