MENU

SEARCH ENABLING SERENDIPITY

OCTOBER 30, 2015 Big Data

Our working memory can hold only 7 ± 2 chunks of information. When the number of chunks increases, we recode the information by breaking it into categories that each contains 7 ± 2 chunks. We also saw how in the case of data we prefer to use a hierarchical system of managing files in which the categories are represented as folders (a.k.a. directories). Recoding, in this case, consists in creating new subfolders, or as engineers call it, implementing an incrementally expandable pathname hierarchy.

As Immanuel Kant noted, categories are not natural or genetic entities, they are purely the product of acquired knowledge. One of the functions of the school system is to create a common cultural background, so people learn to categorize according to similar rules and understand each other’s classifications. For example, in the biology class we learn to organize botany according to the 1735 Systema Naturae compiled by Carl Linnaeus.

CATEGORIZATION CHANGES OVER TIME

As we know from Jean Piaget’s epistemological studies with children, there is assimilation when a child responds to a new event in a way that is consistent with an existing classification schema. There is accommodation when a child either modifies an existing schema or forms an entirely new schema to deal with a new object or event. Piaget conceived intellectual development as an upward expanding spiral in which children must constantly reconstruct the ideas formed at earlier levels with new, higher order concepts acquired at the next level.

Peaxy caters to engineers and their vital datasets. In a successful organization, the engineers continuously assimilate new science and technology. To cope with the increasing cognitive load, their classification schemata are in a state of periodic accommodation. For business reasons, accommodation is not applied to artifacts produced in the past: it is not commercially meaningful to recode bygone projects.

SEARCH TRANSCENDS CATEGORIES

In enterprise, it is common to need access to the design documentation for devices that were created long ago. How can we find a file that had been classified according to a schema we can no longer remember in sufficient detail?

Fortunately, computers are good at processing vast amounts of data. It is easy to index the contents of every file. When we are looking for a design from an old project we can then just describe it with a few keywords and then look them up in an index. A popular incarnation of such an information retrieval system is Solr, which is based on the Lucene indexing engine. This technology was created for building the various search engines that allow us to retrieve information from the World Wide Web.

Since the Web is an amalgam of unrelated websites, it is natural to use divide and conquer methods to partition the Web and to index them in parallel. This led to the creation of MapReduce and Hadoop.

Because the Hyperfiler is already a distributed system, we do not have to use MapReduce for indexing. Each data node in the Hyperfiler indexes only its own files and does so incrementally as the files are modified. This avoids unnecessary network and disk I/O, and continuous crawling: each data node’s index is always up-to-date. Queries are sent to all data nodes, which respond in parallel with their results for aggregation in the Hyperfiler GUI.Hyperfiler-Indexing-Subsystem

At this point, you might interject, if you give me such a good global search functionality, why do I need to categorize files into folders at all? First, while a collection of files is in active use, you want to keep it organized in folders of sizes that fit into your working memory to keep the cognitive load low. Second, when the folders are no longer actively used, searching for one piece of data will result not only in producing the desired file, but also all the context of the related files that surround it. In a well-organized file system, search enables serendipity.

 

There have been efforts to create universal standards for naming and tagging resources, like the Dublin Core. However, creating and maintaining these systems has by itself a high cognitive load. For the moment, a distributed file system plus a distributed search engine appears to be the best compromise for locating and accessing the vital datasets in engineering applications. This is the basis for the enhanced Search functionality we have added to Peaxy Hyperfiler 3.0.