The Peaxy Executive Summary Series is designed to explain quickly and simply what business leaders need to know about big data and data access systems.
The world is being “datafied” and the result is Big Data—too large, complex, and dynamic for conventional data tools to capture, store, manage, and analyze. For background on the proliferation of data and the many ways companies are using it, see our previous posts – What is Big Data? and Big Data Examples.
WHAT ARE THE CHALLENGES FOR BIG DATA?
Tech consulting firm Gartner has expressed them as the three Vs shown in the figure.
Volume: The magnitude and scale of data generated every second.
Velocity: The pace at which data is generated, flows through the system, and is analyzed.
Variety: The myriad types of data we want to store and analyze, both structured (tabular) and unstructured (free-form).
WHAT ARE THE ISSUES WITH DATA VOLUME?
The primary challenges of this huge amount of data are storage, access, and analysis, all of which exceed the capabilities of conventional relational databases. Storage is mainly a question of choosing and implementing a scalable data infrastructure.
Access and analysis are more problematic. Many companies already store large quantities of archived data (such as logs), but cannot readily process and analyze it in place. A hyperscale data management system like Peaxy Hyperfiler® can enable rapid access across a vast volume of data.
To query and analyze Big Data, you can choose between two broad architectures—a data warehouse using massively parallel processing (MPP) or a Hadoop-based solution. The decision may hinge on another “V”—the variety of your data. While data warehouses are oriented toward structured data, Hadoop places no restrictions on data type or structure
HOW SHOULD I HANDLE DATA VELOCITY?
Velocity is the speed at which data is created, stored, analyzed, and visualized. In some cases, the data streams in too fast even to store, so some level of preprocessing must occur. (The Large Hadron Collider generates so much data that scientists have to discard most of it and hope they’ve retained what’s useful.) Otherwise, it’s relatively straightforward to stream fast-moving data into bulk storage for later “batch” analysis.
The key challenge of data velocity is, again, to access, analyze, and respond to that firehose of data. For example, online retailers can gain competitive advantage if they can quickly analyze the products their customers view and immediately recommend additional purchases.
If quick response is a priority, consider using Apache HBase instead of straight Hadoop, which was designed for after-the-fact batch processing. HBase is a real-time database front-end to the Hadoop Distributed File System (HDFS), which Facebook uses for messaging and other quick-response services. Other non-traditional (specifically “NoSQL”) databases have arisen for fast retrieval of information when relational database models don’t fit.
WHAT CAN I DO ABOUT DATA VARIETY?
Today’s incoming data sources—including emails, photos, video, social media posts, medical records, voice recordings, and sensor data—far outstrip the numbers, dates, and short strings of text anticipated by traditional relational databases. These types of free-form data don’t fit neatly into an RDBMS and can’t easily be retrieved or analyzed. While database systems have evolved somewhat to handle unstructured data, they generally require the datatypes and content to be defined beforehand.
Peaxy Hyperfiler is optimized for accessing unstructured data files both large and small and can also manage structured data.
ARE THOSE THE ONLY CHALLENGES?
To these three Vs, two more are sometimes added, along with an L:
Veracity: The uneven accuracy and credibility of raw data as it flows in, which may need to be corrected and transformed. Credit rating bureaus must constantly correct credit reports that are inaccurate due to misreporting from creditors or the ambiguity of multiple people with the same name (or the same person going by different names, such as nicknames, maiden names, or initials).
To deal with data “messiness,” data marketplaces such as Gnip offer common data sources (such as social media feeds) with some cleanup and correction applied. These sources can be useful if you don’t need to source your own data.
Variability: The inconsistency of raw data as it flows in, which may need to be interpreted. If an article refers to Washington, does that mean the person, the city, the state, or any other entity with that name? If the same word has different meanings in different contexts, it can complicate sentiment analysis in tweets or Facebook posts.
Longevity: The need to keep data around long enough to be accessible when needed. Although incoming data is often used immediately, its useful life may span years or decades. Health status in middle or old age may correlate longitudinally with health measures in childhood. Predictive maintenance in aircraft engines and other machinery may require access to design documents, simulation data, and service records from years earlier.
CAN BIG DATA HELP MY BUSINESS?
Big Data can open up new analytical opportunities that were previously infeasible. For more details and some suggested next steps, see our upcoming post, Big Data Opportunities.
WHERE CAN I LEARN MORE ABOUT BIG DATA?
We also recommend these resources for a deeper dive into Big Data:
- What the heck is … big data? by Bernard Marr on LinkedIn
- Volume, velocity, variety: What you need to know about Big Data by O’Reilly Media in Forbes
WHO IS PEAXY? WHY ARE YOU TELLING ME THIS?
Peaxy software empowers universal data access for enterprise – so you can save, find, analyze, manage and reuse your data – whenever it was created, wherever it is located. Our Executive Series is designed to help to get you up to speed quickly on the key topics related to big data and data access. Because the more you know, the more you’ll prefer Peaxy.