CPUs, Memory, Storage and Database Engines: The Shape Of Things To Come

flying carThe problem with predicting the future is that you can sometimes end up with things wildly out of kilter with the way in which reality unfolds. As an illustration of this; take the “Flying car” on the right, this is the sort of thing we should all be travelling around in by now if predictions made back in 1960s had come true.

I will start of by outlining where have come from and where we are right now in terms of Intel Xeon CPU architecture. This is ( at the time of writing ) the previous incarnation of the Xeon, the core 2 series to be exact:

Core 2The key points to take away from this are that it the Core 2 architecture is not modular, for example, four core versions of Core 2 Xeons were created by fusing two two core CPUs together. All memory access was performed via the “North bridge” memory controller and all IO device access via the “South bridge”, the ramification of this is the latency in talking to the north and south bridge chips even before talking to memory and IO devices respectively. Finally, only the very last generations Xeon’s based on the core 2 architecture had any L3 cache.

In 2008 the “Core i” architecture appeared, here is what the first iteration of this: ‘Nehalem’ looks like:

NehalemThis is a genuinely modular architecture, the memory controller is now integrated onto the CPU die, the front side bus is replaced by the quick path inter-connector architecture. Hyper-threading died out after the Pentium 4 ( thanks to Joe Chang for information on this ) and never realised its potential due to two threads per physical core choking on three execution units at the CPU’s back end. Hyper-threading returned in the core i series, more execution units and the use of dead CPU cycles when accessing main memory to service a second thread per physical core resulted in something that fulfilled its potential. Finally the cache hierarchy now has a level 0 cache for storing decoded micro operations ( the RISC like instructions that the CPU runs on internally ).

Here is an illustration of how hyper threading works in the Core i series Xeon’s:

hyperthreadingTo recap on the pipelined architecture that modern CPUs have, here is a slide from my deck:

pipeline

Empty slots in pipeline are dubbed ‘Bubbles’:

bubbles

The second generation micro architecture of the core i series was dubbed ‘Sandybridge’:

sandybridgeThis improved on the previous generation through an integrated PCI 2.0 controller, a bidirectional ring bus to connect the level three cache to the CPU cores. Advanced vector extensions (AVX) are introduced; the implementation of a new “Single instruction multiple data” instruction set and finally there is something quite interesting called data direct IO, Data Direct IO (DDIO) allows data to be read from and written to the level three cache whilst bypassing main memory:ddioData direct IO was primarily developed in order to reduce latency when talking to gigabit ethernet cards, however I understand that there are quad port host bus adapters out there which can use this. The graph below is reproduced from an Intel white paper and illustrates the performance gains to be had from leveraging data direct IO:

ddio perf

The third generation of the Core i series is ushered in by the Haswell microarchitecture. This is the publicly visible road map of Core i processors:

Intel road mapAll of this begs the question as to where we are going to see in the future. Lets rewind a bit to Moore’s law, a ‘Law’ that was coined by Intel co-founder Gordon E. Moore:

Moores Law“Moore’s law” states that every two years the density of transistors on an integrated circuit doubles. The graph above is largely based on chips being fabricated using the Complementary Metal Oxide process. The foundation integrated circuits as we know them is the transistor, charge is applied at a ‘Gate’ to make a transistor work like a switch by allowing current to travel from a part of the transistor called the ‘Source’ to another part called the ‘Drain’. The gate uses an insulator to put the transistor in an off state. When the manufacturing process gets to a certain size, somewhere in the region of 5nm has been muted, an electron tunneling effect can take place whereby electrons can travel from the source to the drain irrespective of whether any charge has been applied at the gate. If you will accept my apology for this brief diversion into the world of physics, this is when a new manufacturing processes will be required; photonics and the use of graphene have been bandied around as potential successors to CMOS.

The future can sometimes be predicted by extrapolating out what has happened in the past. The general trend in CPU design has been for CPUs to evolve into systems on a chip and integrate more and more features which were formerly not on the CPU die onto the CPU die. I see this general trend continuing:

CPU future

Perhaps the most interesting things are going to happen in the world of storage, at some stage in the future primary storage and main memory will converge. Hewlett Package are devoting a significant amount of their R & D budget to a project called “The machine”. I will not go into this in detail, suffice it to say that it focusses on collapsing the memory and storage hierarchy such that everything is stored in memristor non volatile RAM ( NVRAM ) and the use of photonics instead of copper to glue everything together.

future storage

Storage and CPU technology have not followed the same curve according to Moore’s law, column store technology exists as a means of bridging this gap. The tenets on which spinning disk storage have been based hark back to this beast, the IBM RAMAC, the first commercial computer to use a moving head disk drive:

RAMAC

When NVRAM becomes viable on a large commercial scale, we might well the same quantum leap experienced when mainframes made the transition from tape to spinning disks. I still don’t think that storage operating at the speed of RAM will be fast enough, this is where stacked memory comes in. There are two competing consortiums, the “Hybrid Memory Cube” consortium includes Intel and Micron amongst others and the other, which is behind “High bandwidth memory” includes AMD and Hynix, both technologies are  very similar. This quote taken directly off the Micron site summarises what stacked memory technology can deliver:

With HMC, you can move data up to 15 times faster than with a DDR3 module and use up to 70% less energy and 90% less space than with existing memory technologies

In the here and now of storage 2015/16 should see non volatile memory express ( NVMe ) become the de facto standard for talking to flash storage. NVMe is much more efficient in terms of latency and CPU cycles than legacy protocols ( SCSI etc ) because the stack used to communicate between CPUs and the flash is much smaller, also it has been designed from the ground up to exploit parallel IO. Anandtech has a very good article on NVMe here. Following on from this will be the ability of CPUs to talk to flash storage over network fabrics via “NVMe over fabrics”.

Finally I’d like to cover one or two things that I see coming to the world of databases. Column store databases will be the standard by which most sub web scale ( I haven’t used the term big data – because I hate it !!! data is data is data !!! ) data warehouses and marts will be processed. Most people probably know this already. In the Microsoft research paper Enhancements To SQL Server Column Stores, references are made to database project from the University of Amsterdam called MonetDb, if you read up on this it includes database cracking and query recycling. An in depth look into database cracking can be found in this research paper, Database Cracking: Towards Auto-Tuning Database Kernels, to give you a flavour of what Database Cracking is about, here is an excerpt taken from it:

Let us give a simplified example using a simple selection query. Cracking is applied at the attribute level, thus a query results in physically reorganizing the column (or columns) referenced, and not the complete table. Assume a query that requests A < 10 from a table. A cracking DBMS clusters all tuples of A with A < 10 at the beginning of the column, pushing all tuples with A ≥ 10 to the end. A future query requesting A > v1, where v1 ≥ 10, has to search only the last part of the column where values A ≥ 10 exist. Similarly, a future query that requests A < v2, where v2 < 10, has to search only the first part of the column.

In a nutshell Database Cracking is about re-organizing the data in a database online based on the query workload. “Query recycling” is also mentioned in the MonetDb project, this is the storing of partial query results such that where possible results can be returned for queries without having to construct the entire result set from base table data from scratch. Both Oracle and MySql have very basic result caching, from what I understand MonetDb has something more sophisticated whereby the entire query workload is scanned and intermediate results are cached on the fly.

If what is muted about non volatile memory comes to pass, IO will largely be a solved problem. The focus in optimising database engine performance at a low level will shift to crunching data as fast as possible once its on the CPU, the answer to this problem to a large degree includes the leveraging of vectorisation. More simply put vectorisation is the ability to process data in sets using as few CPU cycles as possible. Single instruction multiple data instructions permit this as does parallel processing. Graphics processing units ( GPUs) are ideally suited for crunching data in parallel on a large scale. There are research papers available which discuss the leveraging of GPUs by database engines and there is also a startup company called SQREAMDB who claim to have developed a column store engine that uses GPU technology. There is one glaring problem with this, and that it that you cannot run an operating system on a GPU, therefore unless the GPU co-exists on the same die as the CPU performance is bound by the PCI. This might not be a problem for much longer as Intel have muted the possibility of a socketed version of their Xeon Phi Knights landing co-processor, this could talk to a conventional CPU via the quick path interconnect. The Xeon Phi is part of Intel’s “Many integrated core” ( MIC ) product range. People of a similar age to myself may remember that before the 80486 came out, you could buy a maths co-processor to sit alongside the 80386, this was then integrated onto the 80486 processor. A Xeon Phi is essentially a massively parallel processing co-processor on steroids.

In summary below is what I think the hardware architecture of the future will be which database engines will have to align to. The SIMD capabilities of conventional processors will evolve considerably and be augmented by GPU/APU cores on the same die. A socketed version of a Xeon Phi co-processor will make it practical to offload activities that can be broken down into many smaller tasks for parallel execution on a massive scale. Stacked memory cube technology may replace what we currently know as main memory and mass storage will be replaced by non-volatile random access memory.

futureI should give a final word to AMD, because whilst its well known that Intel is the undisputed single threaded performance king of the x84/64 world, there are lots of things we use in the x86/64 world of CPUs today which are down to AMD:

amd

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s