I want to propose something that’s mere imagination at this point. There’s a lot of promises made by database vendors whether closed or open source that they have created an RDBMS that can scale across nodes, making a data storage service that can expand with user needs as well as deal with hardware failures. We have the traditional players:
- Oracle has RAC
- MySQL has their cluster product
- Postgres has synchronous replication
- SQL Server has replication and fail-over clustering
The sad truth is this: none of these are fully there yet. Most of them trade the ability to transparently scale and fail over for availability. These solutions do not require major application refactoring. Oracle RAC has a shared disk, which means no IO scaling and a single point of failure (SPF). MySQL Cluster has heterogeneous nodes (although you can remove all SPFs), and it does not perform foreign key constraints. Postgres and SQL Server use replication to scale reads, but not writes. In addition, all of these solutions have performance gotchas that are not simple trade-offs, but are instead major pain points. A lot of progress has been made, and I would much, much rather use these clustering technologies than not. If MySQL Cluster supported online schema changes, hot backups, and foreign keys, I would be satisfied.
Another batch of new database systems has also arrived, termed NewSQL. These are systems that can scale and recover from failures transparently. They do not have a clustering as an option; most of these require multiple nodes in order to guarantee performance and reliability.
These put performance above all else. However, the vendors I have looked at require a major or even total replacement of database interaction code. VoltDB requires a total cluster restart when schema changes are made, and application accesses data through stored procedures. Drizzle does not implement stored procedures, views, or triggers. ScaleBase is pretty complete, but it uses sharding to implement scaling, and it looks like it requires a developer to configure the initial sharding scheme. Several solutions that look promising are still in beta (NuoDB, GenieDB, ScaleDB).
I think we can do better. I want a database that combines that power of existing SQL standards as well as allowing me to add more servers to the system. I also know that trade-offs must be made. I think an application could take a significant hit (25% – 50%) in throughput per node if it meant that to regain that performance, you could just add another node. There’s also a ton of features that most applications do not need all of the time. XML, large files, replication, non-SQL stored procedures (Java, CLR), message queues, etc., auditing, are all features that are superfluous for many applications. Here’s my wish list that I think would great a killer NewSQL database.
- Full support of standard SQL (SQL:1999, SQL:2003, or SQL:2008)
- Full support of one proprietary SQL (Oracle, T-SQL, or MySQL)
- Full ACID support
- Foreign keys and check constraints
- Materialized / indexed views
- Stored procedures / functions / triggers
- Hot schema changes
- ODBC and JDBC support
- Ad-hoc querying
- General, spatial, and bitmap indexes
- Transparent, linear scalability of reads and writes
- Homogeneous nodes
- Ability to add and remove nodes without taking cluster down
- Hot backups
- Fully script-able (programmatic API would be nice as well)
I am looking forward to trying out some of the previously-mentioned RDBMS as soon as they go into public beta or GA. I think my vision is achievable, and I think it will help solve the issues that keep arising as many applications now appeal to (hundreds of) thousands of users at a time, who are also demanding better insight into their data. There is a large need for a database that is distributed and fast such as Riak combined with the power and flexibility of development and reporting found in SQL. I hope it comes around soon!