Creating a Data Strategy
What is a data strategy?
Imagine this familiar situation: as an analyst in your company, you’ve been tasked with the daunting task of assimilating all of your organization’s data to collect unique and comprehensive insights. But this is easier said than done. Business development has much of their data siloed into a proprietary CRM solution, finance keeps theirs hidden away in spreadsheets, and application developers have SDK and IoT data streaming in to separate on-prem databases with no fault-tolerance built in. On top of that, compliance and security issues were never even considered. There seems to be no rhyme or reason to how everything works, it’s impossible to get a unified view from all of the enterprise data. And “data science” is mostly done around the organization by way of sampling data from different pools and then making a “seat of the pants” guesstimate from arbitrarily sampled data, which is neither productive nor reliable. What a mess!
You need to have a strategy for your data. How will you do this? What data will you collect? Which data will you store — and where? Who is the audience for your data? Who consumes your analyzed data? What kind of access controls and permissions do you want to have on your data?
This blog will guide you through the kinds of questions you’ll need to explore as you’re planning your data strategy and starting to think about your data architecture.
But, first off, what is a data strategy?
- A vision and action steps toward an organization’s ability to be data-driven;
- A plan for how an organization obtains, stores, assimilates, controls, administers, and uses data;
- A roadmap and feature list for today’s — and tomorrow’s — data pipeline;
- A checklist for developing that roadmap;
- A set of methods, procedures, and tools for everything that an organization must do with data.
Why do you need a data strategy?
If there’s no data strategy, the organization is likely to be ad-hoc and chaotic about their use and application of data within the organization, with an attendant lack of ownership and purpose surrounding data-driven functions. Additionally, it will be much harder to reliably consolidate data from multiple sources to get consistent, reproducible, and comprehensive insights from that data.
A data strategy makes certain things possible:
- data can be managed and deployed like an asset;
- data assets can be utilized, tracked, allocated, and moved with minimal effort;
- all procedures and methods for manipulating data are repeatable, reliable, and consistent;
- regulatory compliance and security requirements for data are addressed;
- when problems arise, there is a predictable method for recognizing and implementing required changes across the data pipeline and data assets themselves.
What’s more, a data strategy may be driven by the following needs:
- A desire for data infrastructures and functions to work and provide value without continual intervention.
- Alignment of vision, across business and IT, on leveraging data as an asset;
- Definition of metrics and success criteria for data management across the organization and user base;
- The drive toward a profitable, or at least zero-net-cost, data-management effort;
- Elimination of technical debt, along with the tendency to accumulate technical debt.
If your data systems and procedures work without requiring you to constantly put out fires, you can focus on using your data assets to add value to your business.
The fundamentals
When starting out on developing a data strategy, there are a number of questions you’ll want to ask. Let’s go through each one in turn:
What do I want my data to communicate (or do)? What purpose do I want my data pipeline (or pipelines) to serve, specifically? What metrics do I want to collect? What correlations (between disparate data sets) do I want to find? What other insights do I want to derive from my data? What actions (automated or human) do I want performed as a part of this strategy?
Who are the stakeholders for my data and/or data pipeline? Are they customers? Professional peers? Senior management or shareholders? Who else?
What is my data pipeline specifically? Is it a pipeline that supports a function (or many functions) within a business? Is the pipeline a core of a data-driven system or service? Is the pipeline a product or service in and of itself?
Is the data pipeline a singular one-way monolith, or does it nest many, two-way pipelines (e.g. a core-to edge system)? Are there smart devices (e.g. sensors or responsive UI) that provide combined means for input and output?
What other purpose should my data infrastructure serve? For example, is my pipeline just relaying data, or is the output from one system the input to another, like a data set for machine learning training?
Collecting data
The goal for almost any data infrastructure is to pull data from many disparate systems into a single comprehensive, canonical data store for unique insights and analysis. Here are some questions you can ask about collecting data.
Is my data to be ingested in real time, near real time, in batch, or some hybrid approach? Some systems require data to be updated immediately, while others can tolerate — or even require — some delay.
From where and how will you collect data? There are many potential data sources to consider, which include:
- Software SDKs (events fired when specific code is executed);
- Web logs (like click through rates, originating IP address, timestamp, etc.);
- Sensors (like those for IoT, energy consumption, manufacturing systems, or infrastructure);
- Smart devices like TVs, meters, or security systems;
- Transactional RDBMSs;
- NoSQL data stores;
- Data Warehouses;
- Integrated platforms like Google Ads and Google Analytics;
- Enterprise systems like those for ERP, CRM (for example Salesforce, Splunk, or Marketo).
Transporting and loading data
Data transport involves not only ingesting data (this category overlaps somewhat with collecting data), but also moving it from one place to another (or not). If your data pipeline is not a one-way monolith, you may need to consider several transport and load methods as you develop your strategy.
How will you load data?
- SDKs
- Connectors
- Open source or ‘proprietary’ log collectors (like Fluentd or Fluentbit)
- Bulk loaders (like Embulk)
- Real time message queues (like RabbitMQ or ZeroMQ)
- A pub-sub architecture (like Apache Pulsar or Apache Kafka)
Storing and analyzing data
Given that in the future, data volume will grow much faster than storage capacity, you may also need to decide which data to store and which data remains ephemeral.
When data requires persistence, though, where will you store your data? That depends on your source data format, allowable speed of writes and lookups, and purpose of the data, to name a few criteria. Some options and considerations:
- A cloud data warehouse. There are many possibilities to choose from.
- An RDBMS system for schema-rigid, transactional data.
- Apache Hadoop (using MapReduce or Apache Spark). Useful for extremely large datasets running on commodity hardware. Hadoop must read from — and write to — disk, while Apache Spark can work much faster in memory, with an SQL-like dialect to simplify queries.
- An eventually-consistent, distributed database, for writing events related to distributed ledgers, where a tradeoff between availability and consistency can be adjusted.
- Columnar storage, for acid-like transactions while allowing for some schema flexibility.
- Time-series databases like InfluxDB, optimized for timestamped data like that which comes at high volume and velocity from IoT systems.
- A key-value store, for fast data lookups by key.
- A document database, like MongoDB, for fast lookups on schema-loose, voluminous data.
- A text-search system like Lucene or Elasticsearch, for storing and querying unstructured, text-based data.
- Amazon S3 for storing and accessing data written to flat files, CSVs, or JSON. S3 also gives you access to Amazon’s ML and AI capabilities.
- A lambda architecture combines more than one storage method for different data requirements (for example, a document driven database for fast lookups on recent data and a Hadoop cluster for slower lookups on historical data.)
Consolidating data from multiple sources
To get real insights, you’ll need to explore correlations between different data sets, and this involves transforming data and combining it to one canonical store. Where you store this combined data depends very much on the requirements of the data itself, as we’ve examined above.
Securing the data
Data security and compliance merit important consideration, particularly when there is personal identifying information (PII) involved. For example, architectures which use a pub-sub mechanism have to deal with the challenge of writing all events to an immutable, ordered event log, while keeping PII secure, up-to-date, deletable at the user’s request, and restricted to the minimum necessary. Do you discard all PII before writing it? Partition the event log and secure the different partitions? Or not move data containing PII into your system at all (and run queries remotely)? To be secure and comply with guidelines like GDPR, you may need to choose among these alternatives.
Outputting data
Now that you’ve ingested, processed, and stored your data, what’s next? Is your refined data intended for:
- further analysis?
- historical trend tracking?
- visualization?
- conversion to automated commands for a subsequent system?
- machine learning training?
- anomaly detection, k-nearest neighbor analysis, or similar examination?
Depending on its purpose, your data may require more refinement, adjustment, or destination schema changes.
Using data exploration to understand data sets
Data exploration means testing out different queries, SQL manipulations, and even dashboard visualizations to verify different query results and outputs. By most accounts, data exploration using tools like D3.js/Vega, R, or Python require that the data is in a two- or more-dimensional array (or a data frame).
There are similar requirements for data that’s used in BI tools like Tableau, Qlikview, Looker, Pentaho, or even Excel.
All popular data destinations support these arrays, making data exploration techniques and BI tools an everyday part of the data science toolbox.