X
Business

The elephant's dilemma: What does the future of databases really look like?

MIT's leading database researcher Michael Stonebraker talks about receiving the Alan Turing Award and the future of databases.
Written by Colin Barker, Contributor
stonebraker-2.jpg
Stonebraker: "The legacy implementations from Oracle, IBM and Microsoft are good for, essentially, nothing".
Photo: MIT

As a man in his seventies, Michael Stonebraker could be forgiven for taking things easy but there seems little chance of that. The database research pioneer still works most of the week at MIT and for the rest he is involved with three of his own start-ups.

On top of that he has just won the Alan Turing Award , the annual prize for excellence in computing, in recognition of his "fundamental contributions to the concepts and practices underlying modern database systems".

The award is now financed by Google and comes with a $1m prize: previous winners have included Alan Kay, for his work on object-oriented programming and the development of Smalltalk, and Doug Engelbart who invented the mouse, hypertext and programming on split screens, among others.

Stonebraker's achievements include the development of the first fully relational database, Ingres as well as Postgres. ZDNet caught up with Stonebraker to find out what motivates him and to get his view on the future for relational databases.

ZDNet: Congratulations on winning the Alan Turing Award. Have you thought how you are going to spend the money?

Stonebraker: I have no idea and I am still trying to absorb the idea that I won it and deal with the deluge of journalists and photo requests so I am, all of a sudden, really, really, really busy.

I think that since I play bluegrass banjo, I will probably treat myself to a classy banjo and after that I will probably use some money for research and after that I have no idea.

Q: You are working at MIT most of the week but what do you do the rest of the time?

A: I am also the chief technology officer (CTO) of three start-ups on the side so I am three to four days a week at MIT and the rest doing outside stuff.

Q: In an interview you did recently you talked about the long period of dominance in the database market of companies like Oracle and you suggested that those days are now over. Do you still think that?

A: The database market until 2000 or so was 'one size fits all' and 'Oracle is the answer'. Now if you are the guy with the hammer then everything looks like a nail.

I think that all abruptly changed during the first decade of the 2000s.

Now I think that the database market is a third transaction processing, a third data warehouses and a third everything else. Now what has happened over the last 15 years is that the data warehouse market has converted almost completely from row stores to column stores. Now column stores are just wildly faster than row stores.

That leaves all of the legacy vendors on the wrong side of the technical disconnect because they are all selling row stores.

Also in the transaction processing world, main memory has gotten cheap enough that you can put most, but not all, transaction databases in main memory.

Now I know that Facebook is gigantic but in most cases a terabyte is a very, very big transaction processing database. Now you can buy a terabyte of memory for $25,000 so, in my opinion, the transaction processing market is in the middle of morphing from disk-based row stores to main memory systems with very, very different transaction implementations.

So [SAP] HANA is an example of this, Hekaton [an OLTP implementation from Microsoft on SQL Server] also and VoltDB, a company I started, is an example of this. I think the market is in the process of moving to main memory solutions. Again, the legacy vendors are on the wrong side of this disconnect.

Now if you look at the 'everything-else' market that is in four pieces. There are the NoSQL guys, which is a 100 or so vendors. With various, different data models and various, different capabilities.

Mongo is popular, Redix is popular, and there are bunches of bunches of them. None of them look like a relational database system.

The second group is people who want to do complex analytics which is machine learning, data clustering, singular value decomposition and so on. That stuff is all defined on arrays and not on tables.

Now that stuff - complex analytics - is going to get much more popular off into the future but that is array based. So either you simulate that stuff by simulating arrays on top of tables or you use an array-based engine.

We will see what happens but this is a market where there is no competitive advantage to a relational database system.

The third group is graph processing, so if you are Facebook, your social network is a big graph. So if you want to find the average distance from me to you that is a graph calculation. Again, table-based systems have no obvious advantage and we'll see how that market unfolds as it gets bigger.

And then there is the Hadoop market.

So, as I see it, in two thirds of the market, the legacy vendors are on the wrong side of the technical disconnect. In the other one third, they have no obvious competitive advantage.

In 2005 I wrote a paper that said, "One size fits all: an idea whose time has come and gone" and in 2015 I make a stronger statement which is, "One size fits none". The legacy implementations from Oracle, IBM and Microsoft are good for, essentially, nothing.

Q: Isn't that a very strong statement?

A: If you believe my categorization there are two pieces that are each a third and four pieces that are each a twelfth. And in all cases the current implementations from the big relational vendors, who I'll affectionately call the elephants, are not particularly good at any of those markets.

Or, to put it differently, there is something else at every point that is better than every single one of them.

Q: So it is fair to say that you see the RDBMS market becoming not just more of the same with a handful of supplier, like Oracle, dominating but instead many different, specialised implementations instead?

A: Well, firstly yes, but it is unclear just exactly how many of these there are going to be. For instance, let's look at the Hadoop market.

OK, so the current poster child in the Hadoop market is Spark [specialists in Apache-based, fast cluster computing] which according to Matei Zaharia (CTO of Spark and co-founder of Databricks) with some 79 percent of the accesses to Spark are from SQL.

So Spark is a SQL market. And if you look at Cloudera, they are pushing Mpower and Impala is a SQL implementation so I think the MapReduce [a model for generating large data sets] piece of the Hadoop stack is, I think, dead on arrival.

Basically the real important Hadoop market is going to be a SQL market. So if you look at Mpower, architecturally Impala is a cloud store and it looks exactly like [HP] Vertica or [Amazon] Redshift or any of the other relational, column store implementations. So I think most of the Hadoop market is going to coalesce with the data warehouse market.

Then you have the NoSQL market is in two pieces. There was a system called Sleepycat [a UC Berkeley database] which was sold to Oracle some years ago and if you want a low-end system that's fine and there is a lot of room for systems like Sleeepycat and that is largely what the NoSQL market looks like.

Other than that there are 100 or so vendors involved in that market who can't possibly survive without some standards and the most likely standard is going to be SQL in the end.

If you look at ones like [Apache] Cassandra or Mongo they both have higher level languages that look a lot like SQL.

There are eventually going to be four or five and I expect there to be vertical market implementations, times four, five, six or seven. Some number like that.

If you are one of the established vendors who are selling a legacy, relational, row store and you are looking out at this landscape and the possibility of having to have four or five different systems it will be interesting to see how they plan to morph from what they have got off into this future.

There is this great book by Clayton Christensen called The Innovator's Dilemma and I think that the legacy database vendors are up against the innovator's dilemma in spades.

I think it is an exciting time to be a database professional because the market is at this time of transition.

Q: So what about these large vendors like HP, IBM and so on. How are they going to get on?

A: HP is a good one. They bought Vertica which is a very, very, very good data warehouse product. I don't have anything to do with them but I hear they are doing very well.

So the question for them is what is your strategy going forward? I don't work for HP so I can't talk about what they might do but I can talk about what I think might be a winning strategy.

I think data warehouses never exist in isolation so if you take the example of a guy I spoke to at Amazon a few years ago. He said, what I want to do is to compare today against yesterday. Yesterday is in my data warehouse and today is in my transaction processing system. So upstream from every warehouse is either a stream processing or operational data system so the minute you reach over to try and compare real-time against history you have got to integrate systems that are upstream from a warehouse. So I think a winning strategy is to find out what upstream systems you need to integrate with and then integrate with them.

Another winning strategy is that if you look at someone like FedEx they have got something like 5,000 operational data systems and that is not atypical - the large telcos like Verizon have 10,000.

So if you look at a company like that now, they are getting data from 10, maybe 20 operational data systems and they are using traditional extract, transform and load technology.

Now, what about the other 4,980 data systems? They are currently just silos. And what about public data off the web and so on. So, in my opinion, the desire to integrate silos is huge.

And following on from that the desire for the business analysts to make better decisions is the next step. I would be investing very heavily in data integration technology, especially in the integration with upstream systems.

Another point that is significant is that right now this data is used by business analysts using the various tools, like [IBM] Cognos and so on. Now a BI tool is simply a graphical GUI that lets you submit SQL queries.

Now suppose you are Walmart. Walmart has a system that monitors any item that goes under any wand in the Walmart system. Now we had a lot of snow in Boston last winter and if you are the Walmart guy responsible for re-provisioning those stores you would want to run queries like show me the stuff that sold in the stores the week before the storm and show me the stuff that sold in the week after the storm.

Now suppose that instead of hiring a business intelligence person you hired a data scientist instead. He will build a predictive model of what will sell. So ask yourself, would you rather have a predictive model or a big table that told you what you sold?

So what I think is going to happen is that over the next decade or so data scientists are going to replace business analysts.

Right now there are not enough data scientists so the supply is going to be limited because there are not enough talented people. So eventually that will get fixed and so we will upgrade to more sophisticated analytics.

Q: So how would you define a good data scientist?

A: Well you aren't going to retrain your analysts because a data scientist has to understand statistics, data mining so it is a skill-set that is just now becoming a graduate programme.

Further Reading:

Why 2015 will be big for NoSQL databases: Couchbase CEO

Splice Machine's SQL on Hadoop database goes on general release

AWS tackles relational databases, possibly incumbents

Editorial standards