MENU

Two Lessons: Minimize Data Movement, Store Only What You Need

November 1, 2016 Big Data, Data Access Platform

These days storage is cheap. You can buy a 2.5-inch 16 TB SSD (actual capacity 15.36 TB) based on third-generation 3D V-NAND with 48 TLC layers and 256Gbit per die — with the promise of 40 percent higher density, twice the speed, and half the power requirement of the previous generation. You can also buy a Samsung server rack with 48 such drives for a total capacity of 768 TB and 2 million IOPS. This might be good cause for a CIO to break out into a “petabyte-scale” jig. But hold on there, if you’re going to scale that high, there are a few reality checks to consider.

Lesson 1 – Minimize the Movement of Data

Let’s start with a few back-of-the-envelope calculations. The gizmo I used to type this post has an internal SSD. Let us also attach a fresh external 7200 RPM SATA III disk over FireWire 800. To get realistic numbers, instead of bare metal speed, we need to use an actual application. Let us do a fresh backup on the reformatted external disk. The program claims there are 399.4 GB needing backup (479.45 GB with padding) with a backup duration of 2:39:02 (9,542 seconds). That means for 1 TB, it would take (9,542 / 479.45 * 1024 = 20,380 seconds, or 5:39:39.

OK, data centers have beefier hardware than what is sitting on my desk. Still, gigabit Ethernet maxes out at about 120-ish MB per second, factoring in a bit of overhead. The first lesson is that moving data takes a lot of time and you do not want to move it around more than absolutely necessary.

Lesson 2 – Store Only the Data You Need

Do you really have petabytes of useful data? For a reality check, let us look at what the big boys have. The CERN Data Centre has recorded over 100 PB of physics data over the last 20 years; collisions in the Large Hadron Collider (LHC) generated about 75 PB of this data in the first three years; of which 13 PB are stored on the EOS disk pool system — a system optimized for fast analysis access by many concurrent users. For the EOS system, the data are stored on over 17,000 disks attached to 800 disk servers; these disk-based systems are replicated automatically after hard-disk failures and a scalable namespace enables fast concurrent access to millions of individual files.

The scientists at CERN are not the biggest game in town: the defenders of the free universe have 5 ZB in their Bufferdale data farm. Expense is not an issue to fulfill this duty and to synchronize the data farm in Utah with the farm at the HQ in Meade (Maryland), they have a backbone which on a 10 Gbps tap can process the full contents of 1.5 GB worth of packet data per second, which is 5,400 GB per hour, or 129.6 TB per day.

So the second lesson here is that you should only store the data that you can really use for your business. When shopping for a data access system, be realistic on the amount of valuable data you really have, then choose a system that minimizes the movement of data for your advanced analytics methods. By the way, as a closing reality check, the data farm in Bufferdale draws 65 MW of energy.

The Modern Way to Manage Data Access

Data access platforms like Peaxy Aureum are hardware agnostic. You can start with a small system that satisfies your data needs for the next half a year or so. When you start running out of capacity, you can just add whatever storage devices make sense at that moment. They can be any kind, and if they are higher performance, Aureum’s storage class policies will automatically and transparently migrate the hottest files to them.

How is a data access platform different from software-defined storage? Besides having storage classes that automatically move data that you need to more accessible media, Aureum has a distributed namespace. In other words, your system can grow to myriad files without losing performance.

Regardless of which storage system you use, be realistic about the time it takes to move around your data; be humble and store only the data you really need. You can build an inexpensive on-premise system that keeps your data inside the local area network if you’re working with a data access platform like Aureum. The HDFS API allows you to run MapReduce directly on your data access systems, instead of having to first move the input data onto an HDFS storage system and then move the results out at the end. If you start with a large system, there’s a temptation to store everything, whether you will need it or not.