Data Architectural Patterns

Data Fabric

Data fabric is an architecture that facilitates the end-to-end integration of various data pipelines and cloud environments through the use of intelligent and automated systems. Over the last decade, developments within hybrid cloud, artificial intelligence, the internet of things (IoT), and edge computing have led to the exponential growth of big data, creating even more complexity for enterprises to manage. This has made the unification and governance of data environments an increasing priority as this growth has created significant challenges, such as data silos, security risks, and general bottlenecks to decision making. Data management teams are addressing these challenges head on with data fabric solutions. They are leveraging them to unify their disparate data systems, embed governance, strengthen security and privacy measures, and provide more data accessibility to workers, particularly their business users.

These data integration efforts via data fabrics allow for more holistic, data-centric decision-making. Historically, an enterprise may have had different data platforms aligned to specific lines of business. For example, you might have a HR data platform, a supply chain data platform, and a customer data platform, which house data in different and separate environments despite potential overlaps. However, a data fabric can allow decision-makers to view this data more cohesively to better understand the customer lifecycle, making connections between data that didn’t exist before. By closing these gaps in understanding of customers, products and processes, data fabrics are accelerating digital transformation and automation initiatives across businesses.

Data Virtualization

Data virtualization is one of the technologies that enables a data fabric approach. Rather than physically moving the data from various on-premises and cloud sources using the standard ETL (extract, transform, load) processes, a data virtualization tool connects to the different sources, integrating only the metadata required and creating a virtual data layer. This allows users to leverage the source data in real-time.

Data Mesh

Data mesh is an architectural and organizational paradigm that challenges the age-old assumption that we must centralize big analytical data to use it, have data all in one place or be managed by a centralized data team to deliver value. Data mesh claims that for big data to fuel innovation, its ownership must be federated among domain data owners who are accountable for providing their data as products (with the support of a self-serve data platform to abstract the technical complexity involved in serving data products); it must also adopt a new form of federated governance through automation to enable interoperability of domain-oriented data products. Decentralization, along with interoperability and focus on the experience of data consumers, are key to the democratization of innovation using data.

If your organization has a large number of domains with numerous systems and teams generating data or a diverse set of data-driven use cases and access patterns, we suggest you assess data mesh. Implementation of data mesh requires investment in building a self-serve data platform and embracing an organizational change for domains to take on the long-term ownership of their data products, as well as an incentive structure that rewards domains serving and utilizing data as a product.

Data Lake

Standardize and document the collection point for your data. Create a tracking plan which documents how data is generated. Have an API or Libraries which enforce the schema you want. If the sources of data are inconsistent, it’s going to be hard to link them together over time. - Load all of the raw (but formatted) data onto S3 into a consistent format. This can be your long term base to start building a pipeline. And the source for loading data into a warehouse. - Load that data into BigQuery (or potentially Postgres) for interactive querying of the raw data. For your dataset, the cost will be totally insignificant and results should give your analysts a way to explore your data from the consistent base. - Have a set of airflow jobs which take that raw data and create normalized views in your database. Internally we call these “Golden” reports, and they are a more approachable means of querying your data for the questions you might ask all the time. The key is that these are built off the same raw data as the interactive queries.

We use Segment to manage all of the top three bullets (collect consistently, load into S3, load into a warehouse). Then we use airflow to create the golden reports that analysts query via Mode and Tableau. As other commenters have mentioned, there are a number of tools to do this (Stitch, Glue, Dataflow), but the key is getting consistency and a shared understanding of how data flows through your system.

This is a pattern we’ve started to see hundreds of customers converge on: a single collection API that pipes to object storage that is loaded into a warehouse for interactive queries. Custom pipelines are built with spark and Hadoop on this dataset, coordinated with airflow.

Gold Records: Single Source of Truth

Build “gold” records, for specific sets/types that are valuable, and get them piped from S3 into Redshift/Bigquery (we built a streaming layer for this). This speeds up querying, makes governance easy, and is extremely reliable

Related Notes