A data lake is “a massive, easily accessible data repository built on (relatively) inexpensive computer hardware for storing ‘big data.’ ” The term was invented by James Dixon of Pentaho to describe the vast data repositories used in modern Big Data applications.
Enterprises often use data lakes as repositories for reduced-order, structured data sets. Post-processing on massive data sets is usually done at the Edge to solve the bandwidth bottleneck between on-premises data and the cloud. The data lake makes data sets available across the company, where advanced analytics can be performed to solve particular business problems. But the truth is that the data lake is a fluid concept (pun intended) – here’s one illustration of the concept from EMC.
What are the advantages of data lakes?
First, they avoid duplication by consolidating data into a single repository, which (Gartner continues) “theoretically results in increased information use and sharing, while cutting costs through server and license reduction.”
Second, data lakes expand to encompass a company’s full data store. This is important since most companies, even those that implement data life cycle policies, end up keeping everything forever, “just in case.” Few system architects or admins can get the sign-offs required to actually press the “delete” button.
Third, they simplify the data acquisition and storage process. If high fidelity data – for example, simulation data from heavy industrial equipment with many sensors – is too large to be transferred to the data lake, post-processing work can tease out the most useful parts needed for analytics. Then advanced analytics apps can be used to analyze the data in the data lake.
Finally, and perhaps most importantly, it democratizes data within the enterprise. As Andrew Oliver observes, business units wishing to analyze company data no longer need to design, budget, get approvals for, and implement a data mart project that answers only a few predetermined questions. The data is all in the lake; given proper access, they can simply extract and analyze it.
What are the risks and challenges of data lakes?
The primary challenges are actually the flip sides of benefits:
- Access to data across the organization
- Data quality and curation
- Security and access control
We’ll elaborate on these in our upcoming post, Meeting the Challenges of Data Lakes.
Where can I learn more about data lakes?
We recommend this resource for a deeper dive into data lakes:
- How to create a data lake for fun and profit by Andrew C. Oliver in InfoWorld
The Peaxy Executive Series is designed to explain quickly and simply
what business leaders need to know about managing big data and data access systems.