Large organisations like Facebook are "packed to the gills" with various types of data artefacts: tables that store raw data, AI data sets, dashboards, and many other resources. As these companies continue to grow, so does the distance (physical and organisational) between the teams that create the data and the teams that need to be able to find and consume it.
When it is hard to find the most relevant and accurate information, it is hard to make an informed decision and take action-a common challenge faced by companies of all stripes and sizes around the world. To address this, some organisations turn to various off-the-shelf data management solutions while others embark on building and maintaining their own custom search systems that facilitate internal data discovery at scale. One such system, Facebook's Nemo, hits the nail precisely on the head by taking into account all the intricacies of the company's vast data landscape.
While most of Nemo's intriguing implementation details remain largely unknown, the platform description published in the company's engineering blog does shed some light on its overall architecture.
At a high level, Nemo consists of two main components: indexing and serving. Its primary search backend is the inverted-index system called Unicorn which is also used for many other projects at Facebook including the very social graph and which replaces the Elasticsearch search engine used by Nemo's predecessor. The old data discovery solution only supported plaintext-based search and could not keep up with the growing amounts of data while maintaining the quality of search results.
As with all search engines, an important part of Nemo's implementation is its ranking system. Nemo's ranking process is known to incorporate various sophisticated signals that reflect the properties of the indexed data artefacts such as recency (freshness of the data), quality (how likely it is that the result is a reliable source of data), and usage (how often the table has been accessed over the past month). The ranking process also takes into account the user's role within the company which is used to return more personalised and therefore relevant search results.
Additionally, the search can be performed using natural language queries, e.g. "How many weekly active users are there on Instagram?"-which are parsed by a spaCy-based NLP library and answered by pointing to the tables that contain the relevant data. This follows one of the recent trends in search-the shift towards fulfilling the user's intent rather than simply finding keyword matches.
A sophisticated data discovery engine, Nemo makes sure that the right data is quickly put in the right hands and supports the decision-making process and analysis performed by Facebook's data engineers, product managers, production engineers, and other users. It incorporates a variety of search signals to surface the most relevant, accurate, recent, and trusted results, and thus promotes data health and trustworthiness across the organisation.