The Race for Data Access
by Bill Roberts
[Chart]:
Uniprocessor System
[Sidebar]:
Schubert Ticketing Services
[Sidebar]:
Pyxis Corporation
[Sidebar]:
Thinking Massively
Parallel processing is neither cheap nor simple. How do you
know when your data access requirements justify it?
Has data access got your enterprise stumped? Consider the New England farmer
who faced a similar problem clearing his field: a stump he couldn't budge.
He retired his mule for the day, borrowed two teams of draft horses and
harnessed the power of four to solve his problem.
The late Grace Hopper (the U.S. Navy's first female admiral, who was instrumental
in bringing advanced computing to the armed forces) often used that anecdote
to explain parallel processing, the harnessing of two or more CPUs
to return a query or solve a calculation faster than one processor can.
Once the sole domain of expensive scientific computers, parallel processing
has plunged in price, and in response commercial applications have proliferated.
It now replaces many mainframe systems for online transaction processing
and online analytical processing, including decision support.
A parallel computer paired with a parallel database engine increasingly
is touted by vendors, analysts and users as the preferred solution for the
rapid access, flexible queries, modeling and analysis needed in decision
support systems. It finds many ready uses. A stock trader who can analyze
derivatives faster than competitors makes more money. Large retailers can
study the buying trends of groups or individuals. Banks can detect credit
card fraud. Airlines can analyze travel patterns. Pharmaceutical companies
can tailor sales pitches to individual doctors based on their prescription
writing patterns. Because decision support applications like these are mission-critical,
parallel processing earns consideration.
Before getting into the technical details, an organization should identify
its database and data query needs. Once IS staff and business managers understand
the data they have or will have, as well as their business processes, what
access they need and how much they are willing and able to spend, they can
decide whether parallel processing may be the answer. If that decision is
affirmative, they are ready to choose a database engine and a hardware architecture.
A Commercial Challenge
Parallel processing is based on a simple idea. Software splits large problems
into smaller subtasks and distributes them to different processors, which
simultaneously handle the subtasks and so reach a solution faster than a
uniprocessor system does. (see Chart).
Various trends are driving commercial parallelism. Data in legacy systems
is not easily accessible. Multiple-processor architectures have become increasingly
fast, scalable and easier to administer. Tools have improved to install
and administer the multithreaded operating systems and database engines
for these platforms. Application development is less onerous now.
More than any of these reasons, the driving force is that databases are
growing. Databases of 2GB to 3GB, once considered large, are dwarfed by
those of 10GB and more. Some enterprise-wide databases exceed 50GB.
From the hardware perspective, parallel processing is almost a moot question.
Database servers increasingly have parallel capability. Many servers that
run the Unix operating system house two, four or more CPUs. Two CPUs are
common in the Intel environment. Low-end solutions can be bought for $10,000
or less. Even so, just because you have the capacity for more processors
doesn't mean you must buy and use them. If you're not ready for parallel
processing today, you could lay out a migration path to your next database
server.
In tandem with data growth, more users demand access to databases across
the company, and their queries are getting more complex, more ad hoc and
more sophisticated. The benefit of parallel data querying depends on the
number of queries, their complexity, the complexity of the data and the
number of users accessing the system. In a typical executive information
system, the usual hit rate is low, because only a few people use it, but
the queries might be especially complex and the data could be large. Once
decision support spreads to the masses, the database will sustain an enormous
number of hits. When users find out what kind of questions they can ask,
they start doing more complicated queries. One vendor calls it the "potato
chip syndrome"--the more you eat, the more you want.
Parallel processing can be less efficient for applications that let users
ask a series of questions of small subsets of data, drilling deeper with
each subsequent query. In a parallel system, each query would scan far more
data than needed. For these applications, data indexing that leads more
directly to the categories most often used could achieve quicker response
times. But for dealing with large amounts of data and complex analysis,
parallel processing can be the answer. If your query requires multiple operations
and you want to improve response time, you probably need parallel processing.
It's not an easy path to choose. It can be costly. Applications can require
recoding. Moving data to a parallel architecture is work, although tools
exist to help. You must train people for the environment. A key issue is
figuring how to spread your data properly, which requires understanding
the nature of the data and how it will be used.
Given the increasing reliance of many organizations on data manipulation,
a boom in parallel processing for data querying probably is on the way.
"Any scalable database application will demand parallel processing,"
says John Oltsik, an analyst at Forrester Research in Cambridge, MA. "Applications
tend to grow, and data tends to grow a lot faster than anticipated. You
need to plan for the future."
Comparing Platforms
A decision support database is oriented toward large-scale sorting and searching
in large databases, usually those comprised of historical data. I/O is large
and sequential. Queries typically involve full-table scans requiring minimal
indexing. In this environment, speed is everything.
All the major vendors of open systems databases strive to enhance the performance
of their products as new processors and higher clock speeds appear on the
market. Oracle of Redwood Shores, CA, was first to market with a parallel
database--albeit as an add-on--and the first to support 300MHz and 350MHz
processors, though Sybase of Emeryville, CA, wasn't far behind. Informix
Software of Menlo Park, CA, has a uniquely architected parallel processing
database. Red Brick Systems of Los Gatos, CA, focuses on the data warehouse.
According to Rob Tholemeier, vice president and database analyst for the
Meta Group in Burlingame, CA, Informix, Oracle and Red Brick products offer
"dynamic parallel processing," which does not depend heavily on
the data layout and is able to determine which CPUs are loafing and optimize
individual queries as appropriate. Dynamic parallel processing requires
a symmetric multiprocessing (SMP) architecture, also called "shared
memory" or "shared everything." AT&T Teradata, IBM and
Tandem Computers sell solutions that are examples of what Tholemeier calls
"static parallel processing," which requires a more coherent layout
of data and runs in a massively parallel processing (MPP) architecture,
sometimes called "shared nothing." The problem with static parallelism
is that, as data changes, so do the layout needs. "It's relatively
easy to do the first load [of the database in such a system]," he says.
"The continuing administration is much more difficult."
Commercial parallel databases from Informix, Oracle and other vendors perform
best in SMP environments. "Running lots of jobs where all of them share
memory and share resources, SMP is the biggest bang for the buck,"
says Tholemeier, an SMP proponent. A recent report he authored concludes,
"Users who can predefine a path to data for time-critical transactions
can use shared nothing architectures (MPP) by effectively partitioning data.
Otherwise, shared everything (SMP) architectures will support both defined
path and more flexible database designs."
SMP uses shared memory and disk I/O subsystems to service more than one
CPU. SMP systems run a single copy of the operating system and share a single
copy of the application. CPUs can be added without impacting the operating
system, the application or the data. Tasks or processes are automatically
distributed among CPUs through dynamic load balancing. As more CPUs are
added, applications need not be retuned, and system management gets no more
complicated. The bus bandwidth eventually gets used up, but it can be optimized
by taking advantage of the efficiencies of sharing.
Analysts and vendors estimate that uniprocessor and SMP systems together
account for 80 to 90 percent of the commercial database server market. Increasingly
the larger share is SMP. The most powerful SMP chips can handle most commercial
database applications, but of course there are trade-offs. Due to systems
overhead, efficency of systems declines after reaching a certain number
of processors--eight or 16 in most Unix systems. At that point, the choice
becomes moving to clusters of SMP boxes or to MPP. In a cluster, a few or
several computers are tied together with a high-speed bus and talk to the
same database. Each computer could contain one or more CPUs. Clusters now
comprise between 10 and 20 percent of the commercial database market.
Open systems hardware vendors have become accomplished in SMP. "There
are many terrific boxes out there that do really good SMP up to 64 CPUs,"
Tholemeier says. Although a high-end SMP machine can run $50,000 and SMP
clusters $100,000 or more, a more common, less expensive SMP system is a
5GB database on a single machine with four processors, which may cost $20,000
to $30,000.
A Rule of Thumb
An MPP architecture can have hundreds or more processors in one computer.
Each node in an MPP machine is a self-contained computer with CPU and memory.
Connected to the others by a high-speed bus or switch, each node functions
on its own, and hardware resources are generally not shared. It's more scalable
than SMP at the hardware level and good for very large data warehouses of
200GB or more. Although theoretically such a system could have thousands
of nodes, commercial applications typically use less than 100 nodes. (For
MPP adoption criteria, see "Thinking Massively," at left.)
Today this leading-edge technology probably has less than 2 percent of the
commercial database market. The Standish Group International of Dennis,
MA, which has studied the MPP market, views the leaders as AT&T Teradata,
IBM and Tandem. These systems cost up to $2 million.
SMP proponents assert that it's much easier to implement and maintain than
MPP. Jim Johnson, Standish chairman, doesn't necessarily agree. "There
aren't a lot of people who have done both. I haven't found an end user who
will stand up and say SMP is easier." Size is the best metric, he believes.
"Anybody who has 200GB should go MPP. Between 100 and 200 is no man's
land. For 100GB down to 50GB, SMP is the way to go."
Bill Roberts is a free-lance writer who covers business,
technology and management. He can be reached at wcrober@aol.com.