In the past couple of years, we have seen the legacy data access and delivery system industry falter while a new software-defined data access and delivery (SDS) industry has been emerging. Hitherto the definition of SDS has been nebulous, vaguely leaning on software-defined networking (SDN) and virtualization. Data access and delivery system buyers need a criterion that allows them to assess whether a system is for pioneers crossing the chasm or an old legacy product for laggards. We argue that flexible multi-tiering is Occam’s razor — a singular criterion that enables us to winnow down the field.
Server Hardware Is Now a Commodity
Like most technologies, SDS emerges at a confluence of trends and players. Factors include server commoditization, virtualization, fast networking and software-defined networking (SDN), the sunsetting of mechanical media and rigid architectures.
Servers have matured to a point that there is no longer a differentiation between the brands other than price, leading to a decline in prices and the transition from brands to white boxes based on initiatives like the Open Compute Project (OCP). Original design and manufacturing (ODM) companies bid with proposals based on OCP specifications. This has been hurtful for a traditional data access and delivery industry that has been relying on generous profit margins on their custom hardware.
Virtualization and SDN allow moving the physical locus of computation around in the data center to spread heat generation evenly in the facility and to shift network bandwidth to wherever there is a bottleneck. Fast 40GBE and 100GBE local area networks (LAN) no longer require proximity of computation and data. We are no longer interested in geometry, just in topology.
Storage Homogeneity is an Oxymoron in the Age of SSDs
The drive manufacturers have evolved from mechanical hard disk drives (HDD) to solid state drives (SSD) based on flash memory. By the third quarter of 2015, with the introduction of 3D flash memory chips, SSDs have reached the magic $1.50 per GB price point for high-performance enterprise-class 15,000 rpm HDDs. Concomitantly, new write algorithms optimized at the system level instead of the drive level have propelled the life of SSDs to dozens of years, compared to the typical three years of HDDs. Last but not least, SSDs require less power and cooling than HDDs.
The typical three-year HDD lifetime leads legacy manufacturers to design rigid systems with limited expandability because after three years the customer is supposed to do a forklift upgrade and replace the entire data access and delivery system. With the assumption of forklift upgrades, it is possible to dictate that all media be of the same model. A pain point with the forklift upgrade is that when the new system is being installed, the datacenter operation is disrupted for weeks while the data is migrated from the old to the new system.
An SDS system must be hardware agnostic and run on any server using any media. In particular, it must be possible to mix and match any collection of drives, because now they are no longer replaced when they have too many errors but when they are technologically obsolete. Obsolescence happens at varying rates.
Multi-tiering Represents the Future of Data Access Systems
At this point, in our quest for Occam’s razor we should shave away the property of being scale-up or scale-out. Both legacy and SDS can be either of the two. We should also shave away compression, deduplication, striping and error correction because they all apply to old and new architectures alike. Also, virtualization is not a discriminating factor because it is just a way to provision a datacenter. Eventually, also object storage can be shaved because both legacy and emerging data access and delivery systems can be based on blocks, objects, or files.
What is new is that in the rigid legacy systems there are two to four media tiers: HDD for capacity, SDD for IOPS, sometimes RAM for ultimate performance, and maybe the cloud for disaster recovery. With the longevity of SSDs, an SDS system will have drives in a range of capacities and performances. The system must behave more like an organism, where higher performance drives are continuously added. Today’s fast SSD for hot data will become tomorrow’s slow drive for cold archival data. The forklift upgrade is an evil of the past.
Instead of just three tiers, an SDS system should have hundreds of tiers in which drives are categorized. This is multi-tiering. Flexibility refers to the ability to move the files to the most appropriate storage tier depending on usage patterns. Of course, there should be no need for a skilled IT architect to plan a well-balanced system: the user can write scripts (or the system can generate them based on a form in the GUI) to automate workflows that move files or directories to a different tier.
At Peaxy, we call Aureum’s flexible multi-tiering storage classes because they are much richer. For example, by defining policies the user can write scripts to automate workflows to move files or directories to a different tier. With auto-tiering, the data access and delivery system autonomously moves files or directories to tiers best suited for the actual usage pattern. The namespace and tiers are orthogonal: files can be moved to a different storage class without renaming them, or rename them without moving them.
Speed is only one factor. Storage classes use other parameters to categorize the media into classes. In a petabyte scale system, backups are no longer meaningful and there are a number of mechanisms to produce high availability: replicates, error correction and striping. Other parameters differentiating classes are file contents indexing, encryption, remote replication for disaster recovery and document lifecycle management.
Smart Multi-tiering Requires “Common Consciousness”
A local drive is active when IO requests are issued. In a file system, drives are more active because the system must be able to detect and remedy bit-rot, which is a problem when there are hundreds or thousands of drives. Flexible storage classes require even more IO activity because the system is continuously monitoring itself and moving files to different storage classes, performing asynchronous replications, etc.
Coherency is a key performance parameter. The implementer must be well-versed in the design of concurrent systems, a difficult combination of science and art that requires skill, experience, and the patience to perform Montecarlo simulations to find potential deadlocks.
Take a look at the image included with this post. A whale takes about 5 minutes to turn 180º. On the other hand, a school of small fish switches direction in an instant. The blue whale has no escape!
The secret is that each fish in the school must act autonomously but according to common patterns — otherwise coherency explodes out of control. The nodes in a distributed file system must behave like they are entangled and have a common consciousness. If the red fish changes direction, all other fish immediately follow him. It is the school’s consciousness, which entangles them by virtue of a shared self-organization model.
We have described Occam’s razor for assessing the SDS-ness of a data access and delivery system. The commodification of server hardware and the mixing of different storage hardware means that smart multi-tiering is what differentiates modern data access. This criterion allows you to determine if an SDS will allow you to cross the chasm and propel your business to new horizons or get left behind.