However, the character of much of the data generated by businesses today does not match the strengths of the RDBMS in virtually any respect. This mismatch is revealed within the context of Information Lifecycle Management, or assessing the handling of data from the time of its creation to its obsolescence. ILM is rapidly gaining favor within enterprise IT departments as an effective approach for coping with rapidly growing volumes of corporate data.
Consider two of the hot-button IT issues on the top of everyone?s list--the requirements of RFID and Sarbanes-Oxley. From a raw data perspective, they have a great deal in common with other less pervasively covered IT challenges, such as mobile service carriers? call data records or manufacturers? bill-of-material information.
The huge volumes of data these sources generate are related to past business events. This category of data possesses three key characteristics:
1) The data records occur at high transaction rates (usually from automated sources), resulting in a large volume of stored data.
2) The data records never change once they are created.
3) The data records must be saved primarily for historical-reference purposes and will be infrequently (if ever) accessed.
Two weeks of mobile call records easily fill a database to four terabytes (four thousand gigabytes) or more, and this volume will be multiplied by ten- or twenty-fold for so-called "3G" mobile networks. In the case of RFID records, major retailers and distributors are expected to generate between tens of terabytes to, by some incredible estimates, millions of terabytes of these records daily.
Herein lies the mismatch. Relational databases--with their transactional, dynamic and multi-user features--come with functionality that far exceeds what's needed for simply storing and accessing write-once/read-maybe business data. This excess functionality requires sizable hardware and software investments that grow in proportion to the amount of data handled. With costs easily in the seven-figure range, even the most well-funded datacenter would have a difficult time spending its way out of this problem.
The answer likely resides in pairing the RDBMS with a complementary technology that is particularly suited to the demands of capturing and storing large volumes of this write-once data. Ironically, a technology previously destined for the history books may well fit current and future requirements perfectly: the flat file.
Long relegated to application-embedded databases and desktop programs, a flat file that borrows a key feature from the relational database--the index--meets all of the requirements previously described for digital-business event data.
In databases, an index speeds up query access to large volumes of data by providing an entry for each field (such as username, phone number, etc.) and the location of the specific matching record(s). Applying an index to a flat file results in a very accessible repository--much more accessible than a tape library--that can respond quickly to enterprise reporting needs. Further, it can do so using comparatively modest server hardware. Coupled with the ever-decreasing cost of disk-based storage, using a flat file becomes a highly cost-effective approach.
Equally important, moving large volumes of business event data from the RDBMS to a complementary flat-file-based solution enhances the performance of the RDBMS for the tasks it's meant for. At the same time, this approach also delivers on the promise of ILM by putting the right data in the right place for the right cost without sacrificing support for the business.
The time is right to rethink how to deal with the looming explosion in data volumes. The relational database is an impressive technology, but it is also the most expensive way to store large volumes of static data simply to provide for potential access some time in the future.
It is a frequently referenced fact that 80 percent of the data stored in relational databases is never accessed once it is written to the database. For digital business event data, this percentage will be much higher and we know right now that it is unlikely the records will ever be accessed once they are written, let alone be altered by multiple users.
Simply put, the relational database is too much hammer for the digital business-event nail.
Biography
Kate Mitchell is CEO of CopperEye. She previously served as senior vice president for marketing and development at SeeBeyond Technology.
See more CNET content tagged:
RDBMS,
terabyte,
volume,
information technology,
RFID





But that's okay, because Put Down Pete loves environments where babble is the order of the day.
Relational database under attack? Yah? I can't wait for the author's next article, "The continuing benefits of COBOL programs in the 21st century."
Over the last 25 years (started writing software in high school, about the time Bill Gates was setting up M$), there has been a continual battle to capture as much data as possible & then try to find a way to store it (and possibly to retrieve it too). This problem is nothing new.
In 1999, I designed an application for a major credit card company to store all of their transactions for North America, for the last 6 months. It faced similar problems. I created a partitioned DB2 table for the main data repository. Each days load overlaid the partition with the oldest data. It also updated a table to indicate what data was in each partition. On the rare occassions the data changed, a concurrent Reorg' would run (normally saved up for a weekend) and the partition was Reorg'ed. At each change, concurrent image copies were taken & that was all we needed to store & maintain the data.
That approach worked in that situation. But won't work here - just TOO MUCH data. But looking at the requirements, let's go back before RDMS's became so popular, to use a set of concatenated tape-based VSAM files ! Before technology became so enamoured with adding bells & whistles, VSAM was pretty cool as a basic indexed flat file.
BTW, with things such as OO COBOL (even being used in .NET) and Unix System Services (a product that allows a mainframe program to write a file onto a Unix server, in almost the same way it would write a mainframe file), COBOL is still very much alive & kicking. All those DB2 tables in the EIS Tier have to be maintained somehow & although I had fun writing PL/I & assembler, COBOL is simply better. With (OO) COBOL getting the job done, maiframe Java is not yet essential.
There is a reason why SMTP servers use flat files, because they scale well, are efficient, their portable, and no corruption will occur. Ask a MS Exchange Admin who's on his 50th mailstore recovery if he likes the "all eggs in one basket" approach.
Data retention will be an ever growing problem, not just with space allotment, but with future data formats. How do you know in 20 years if the data you have now can even be read? Archiving records in plain-text is the only way to insure your data can be read 5, 10, even 50 years from now.
- data request (time-framed)
- get related DB file(s) from tape and cache them to disk
- mount DB file(s)
- exec query against mounted DB's
- unmount DB file(s)
- save updated DB's back to tape (if needed).
Of course, such automation may require some middle steps to be fast/reliable enough, but one gets the idea.
It's OK to ask the DBA/data modeler to optimize their designs. But if the data doesn't need to go in a DB, you may be talking to the wrong people.
Just because something is old does not mean it is obsolete junk. Pete, I guess you have given up on the wheel and fire.
transactions shows just how much you don't know. Major
rdbms engines do:
-marshalling resources for thousands of users
-security
-sql query parsing and optimization
-data abstraction and normalization
-recovery of data from system failure (hardware & software)
-relational mathematics to join desparate data structures
-advanced connectivity of a variety of user platforms
-advanced MPP processing & parallelization of queries
against partitioned data structures.
to name but a few.....
I doubt you came up with this idea on your own, but
whoever put you up to it is demostrating both of your
ignorance. Go back to doing marketing for Spencers.
Leave the complicated stuff to professionals...
It is a high performance object/relational persistence
service for JAVA applications.
Most professionals are using it with great success, in
conjunction with Websphere or Weblogic.
Also, your complaint about performance doesn't wash,
consider Walmart uses an rdbms as their data warehouse,
which has about 50 TB or more.... This is where MPP
processing comes into play...
All of the major rdbms vendors have MPP / Grid scalability
as well as initiatives on RFID data. Use Google to query, for
instance, type in: Teradata RFID. or Oracle RFID, or Sybase
RFID... you'll see that these companies are investing very
heavily, as are all of the storage vendor (Network
Appliance, EMC, etc...).
<<The use of relational databases easily doubles if not triples the cost of OO projects, and cripples the performance that could be achieved if OO databases were used for those kinds of applications. >>
Okay... first of all, there is no such thing as "ReadMaybe." If you "Might" have to read the data, then you need read capability. This is a WriteOnce/ReadMany scenario, where not all data may be read. Nothing unique here.
Overhead? The author wants to use FLAT FILES "instead" of a relational database? Silly. First of all, you can creat "flat tables" in a relational database. There is the really interesting concept some of us have heard of... its called, "Data Warehousing." And it address the vast majority of shortcomings in transaction-based databases for this type of storage. Ask a good DBA. They'll tell you that database designs can vary widely depending on the expected data usage. Apparently, the author thinks that RDBMS systems are only for transaction processing.
But lets look at the other example here.... recording thousands of 3G telephone calls? WHY? What company creates voice-recording of EVERY phone call right now? Is this REALLY a practical application? Even RFID applications only store a small code. I assume somebody probably records phone calls digitally, though... and I assume, they are using a DATABASE in the most robust of implementations. Their reason for taking this approach is likely the same reason they'll instantly reject the idea of flat-files for data storage: Raw, flat-file IO is more costly than database lookups in terms of system resources. Read: LESS PERFORMANCE.
Now, I'm not hung up on relational databases... with some of the object-oriented work going on, and more recently, aspect-oriented data storage, there are certainly alternative ways to contemplate storing data. But if you are going to suggest that NEW storage systems are needed, to meet NEW demands, the last thing you want to do is to suggest OLD technology that is far less suited for the job than a RDBMS.
Its not that RDBMS is bad, its just that its not a good fit for simple but big data, and that if you are stuck with Oracle its damn expensive for what you actually use.
If the CEO whose name is on the byline wrote it, recommedation to investors: get out quick.
If the CEO whose name is on the byline DIDN'T write the article, recommendation to CEO: FIRE the author. They don't know a thing.
If you guys were in charge, you would have made Google run on Oracle, queries would come back in a week and they would have never gotten off the ground due to horrible performance and pathetic scalability. So much for your RDBMS vision.
They store Petabytes of data in flat files, not relational databases. The do this for a reason.
RDBMSs are tools that are good for certain situations, but not all of them. The type of data the author of the article mentioned is one of them where RDBMSs are not suitable.
Yeah, sure you can partition the database, offline the partitions and bring 'em back when you need them. This is way too slow and cumbersome just to look up a single record.
The bigger problem is that corporations are data packrats. In most situations all this data is useless and mining 5 year old data to find out a conclusion that may not be valid anymore is also idiotic.
Of course, in some industries there are legal reasons you have to retain the data and doing so in an RDBMS can be expensive and slow.
Of course bringing offline/online partitions is time consuming, but after all the whole point of this story was the write once/read maybe scenario. Our proposed approach in the comment above should keep the physical layer homogeneous despite of TB/PB ranges and the existing level of partitioning/clustering. If you're going to use (archived) flat files for a search it's going to be slow either and more cumbersome to manage if not your native format.
The google story is interesting by itself but it doesn't relate to this story either. Data there is write/read a lot.
Retrospective Data Mining is simply an option and depends of type of business and data being collected. Sometimes patterns are a statistical event and it could take a considerable amount of time to collect that data. Found nothing "idiotic" of being able to make a research from old data. What's the point of collecting it if not usable when/if needed.
As said by others, OODBMS is still a very promising approach, but it consists of adding more structure to the DBMS, not removing it. So future sounds not very promising for a bunch of dull flat files.
We have a marketplace to sort out the nonsense of all this.
If Oracle is nonsense, show us the better way.
The point is that relational databases has been made out to be the best way, period. That is wrong. Sometimes it is the best solution and sometimes it is not. But yet, many think it is, always.
Google is a perfect example of why relational databases are not always the best way to go.
Flat file may be 'boring', but it is often as fast and efficient as anything else, many times it is faster by a lot. Same goes for a database that persistantly stores objects.
This whole discussion reminds me of the #1 rule of professional, efficient programming: "know your data".
If you don't understand the implications to that rule, then you will be condemned to use what is more popular, without knowing whether it is the best solution for you in any specific case.
In both cases data volumes are very high.
Structured::
This type of data is already structured (sits in Databases), there is very minimal scope for rearranging and giving a unique context for the data.
Unstructured:
This type of data can deliver a set of data depending on the context.
The Solution:
Any form of data can be retrieved and arranged with more of context driven tools, which in turn maintatins only the context of each end-user.
The static data which never changes or which has a permanent scope can be made available in database and the remaining data can be there as "unstructured data". Which are driven by "context engines".
Follow the principle of 80-20, where 20% of static data moved into database and retain 80% of data in unstructured way... which serves all type of contexts..!!
- Thanks
Ramesh N T
- Trite nonsense
-
by ERK
August 15, 2005 9:10 AM PDT
- Your article uses common (as in vulgar) industry buzzwords to attempt to justify unconditional acceptance of "trends." That's not what I'd call "rethinking". While it's always useful to attack orthodoxies, one needs some real weapons (in this case reason) to do so.
-
Reply to this comment
-
See all 32 Comments >>> However, the character of much of the data
> generated by businesses today does not match the
> strengths of the RDBMS in virtually any respect.
Completely false - the cases where XML and OODBMS might make some sense are the niches.
> This mismatch is revealed within the context of
> Information Lifecycle Management, or assessing
> the handling of data from the time of its
> creation to its obsolescence.
Giving something an acronym (ILM) does not legitimize it, nor make it a useful field of study.
> ILM is rapidly gaining favor within enterprise
> IT departments as an effective approach for
> coping with rapidly growing volumes of corporate
> data. The time is right to rethink how to deal
> with the looming explosion in data volumes.
So ILM has happened prior to this "rethinking" that we need. And yet you're using the admittedly ad hoc ILM to justify a "rethinking"?
> Relational databases--with their transactional,
> dynamic and multi-user features--come with
> functionality that far exceeds what's needed for
> simply storing and accessing write-once /
> read-maybe business data.
"Relational" has nothing to do with the fact that you're describing (I guess) limitations on current SQL DBMSs. DBMS != "database". Relational != SQL.
> This excess functionality requires sizable
> hardware and software investments that grow in
> proportion to the amount of data handled.
You're saying that because I have some extra subroutines and such, that this requires not only linearly-scaled hardware, but additional software? How do you justify this complete nonsequitur and mischaracterization of DBMS software?
Besides, if current DBMSs are doing things badly, they should improve. There are no fundamental theoretical issues here - merely bad implementations which have nothing to do with "the nature of data."
> The answer likely resides in pairing the RDBMS
> with a complementary technology that is
> particularly suited to the demands of capturing
> and storing large volumes of this write-once
> data. Ironically, a technology previously
> destined for the history books may well fit
> current and future requirements perfectly: the
> flat file.
Any DBMS worth its salt should be able to offload infrequently-access data to another disk or even a flat file, without any modification to the query language. A simple concept in computer systems: HIDE IT FROM THE USERS. Make whatever optimizations you must, but make the interface clean and seamless.
> Long relegated to application-embedded databases
> and desktop programs, a flat file that borrows a
> key feature from the relational database--the
> index--meets all of the requirements previously
> described for digital-business event data.
Except speed, query languages, logical organization, physical organization, separation of logical and physical, etc. etc. etc.
> At the same time, this approach also delivers on
> the promise of ILM by putting the right data in
> the right place for the right cost without
> sacrificing support for the business.
And why exactly can't the DBMS vendors update their software to hide this manifestation of physical storage from users? Sheer laziness, if anything. This is a burden for the DBMS vendors.
> The time is right to rethink how to deal with
> the looming explosion in data volumes. The
> relational database is an impressive technology,
> but it is also the most expensive way to store
> large volumes of static data simply to provide
> for potential access some time in the future.
Relational has nothing to do with this. You are confusing the implementation with the model; the SQL "model" with the relational model; and physical storage handlers with the DBMS itself. Give this some more thought, I beg you, for the sake of your readers.
> It is a frequently referenced fact that 80
> percent of the data stored in relational
> databases is never accessed once it is written
> to the database.
Nonsense, but if it were true... which 80%?
> Simply put, the relational database is too much
> hammer for the digital business-event nail.
A trite bit of nonsense. Better implementations, not "models" of I.T. like "ILM", are the rational solution.