Data Ingestion: The First Step to Generate Valuable Business Insights
Your business may accumulate a lot of data from multiple sources. But if the sources are not connected, it’s difficult to leverage that data to its full value, particularly as your business grows. The analytics within each transactional system, such as Microsoft Dynamics, provides limited reporting capabilities out of the box.
Perhaps you have deployed a Microsoft Dynamics ERP solution along with other enterprise applications to assist with sales, marketing and operations. To tap into the synergy of combining these data sources, you need to extract the data, ingest it into a centralized data lake, and then enrich the data—where a solution like Microsoft Power BI can generate more sophisticated reports. And those insights become even more powerful when you connect additional data sources.
Just Having the Data Is Not Enough
This blog is the second in our series on Leveraging Data to Generate Business Insights. In the first blog, we discussed why data management is important. Just having data doesn't make the data a business enabler. To help your company increase revenue, lower operational costs, and improve employee productivity, you need to deploy processes for ingesting, transforming, modeling, analyzing, visualizing, collaborating, and governing the data.
For this blog, we examine data ingestion, the initial phase in the data management lifecycle. One of the keys is to design an ingestion process that optimizes the speed of the data loads—striking a balance between loading the data and querying the data. Many businesses focus on query speed because end-users need to get insights fast. But if data loading is a cumbersome, slow process, it can affect the downstream steps.
Key Questions for Designing the Ingestion Process
As you design your data ingestion process, there are several questions to answer:
How many data sources and where is the data coming from?
The source of operational and transactional data could be internal systems such as Dynamics 365 Business Central, Finance and Operations, or Customer Engagement. You may have other business applications such as an e-commerce website or a warehousing application. There could also be external sources, perhaps market data from third-party providers such as Nielsen, Gartner, or IDC, as well as sales data from distributors and business partners. You might collect customer sentiment data from social networks.
Other potential data sources are your on-premises legacy systems that store historical data. Perhaps you are sunsetting Dynamics AX or GP as you migrate to a cloud-based Dynamics 365 ERP system or starting to use SaaS applications and don't want to lose data from your legacy systems that might span multiple years. You might also collect streaming data from your production machines on the manufacturing floor, or from environmental or building sensors on an IoT (Internet of Things) network.
How often is data refreshed?
Is your data uploaded in batches, in real time, or near real time? Perhaps you’re on an hourly, daily, weekly or monthly schedule—or there’s a mix among your various data sources.
It’s also important to consider if you need incremental updates (deltas) or full updates. You might want data snapshots to compare data from this year to last year at the same point in time, for example.
Where and how is data stored?
At many businesses, data from different sources is siloed in multiple disparate systems, and that creates a big challenge with respect to getting data insights. Some businesses solve this problem by storing information in a data lake, which provides the benefit of an open format where all types of data can be stored, and open APIs can access the data. A data lake also scales easily, and the cost is usually low.
Conversely, data warehouses and traditional databases are typically more established and have structured, high-quality data, and fine-grained security and governance. Data warehouses are usually more expensive than data lakes and use a closed proprietary format.
An alternative is the data lakehouse—a hybrid between a data lake and a data warehouse that gives you the best of both worlds. A date lakehouse combines the open format of data lakes with the structure of SQL (Structured Query Language) warehouses. You get the rigor of the warehouse with the openness and scalability of the lake. This approach also offers one single data architecture for many workloads.
How is the data formatted?
A structured tabular format—with rows and columns—is easy for most end-users to make sense of. Another common format is CSV (comma-separated values), to which you can export from almost any data source, making CSV tables good for data interchanges.
There’s also semi-structured data, such as text or chat transcriptions. They're not totally unstructured as you can usually recognize words, sentences and phrases. Unstructured data usually comes in the form of pictures, videos or audio.
Then there are the advanced modern formats, which are becoming popular. Apache Parquet is an open source, column-oriented data file format that works well for data lakes and facilitates point-in-time data storage. Apache Avro is a serialization format for record data and a good choice for streaming data. JSON, the JavaScript Object Notation format, is text-based and commonly used for transmitting data in web applications.
How is data accessed and is cleansing required?
Most of the time, data management systems access databases and data warehouses using SQL. For data lakes, open APIs are common, but you can also use Python. Most businesses use a combination of open APIs and SQL, depending on the data source.
Data is rarely usable as-is, so you will likely need to cleanse your data as you ingest it, which will greatly increase the data quality. A key consideration here is whether to cleanse the data prior to or after loading—we will address this in our next blog on data transformation.
Answers Provide Better Understanding of the Required Design
In addition to answering the questions above, also consider how robust the ingestion process needs to be. For example, if an ingestion workflow fails, you may want to program it to retry automatically and then alert you when it’s completed or if it keeps failing. Security controls are also a key consideration.
By answering these questions, your organization will better understand how to set the design goals and the key considerations for the data ingestion process. You will also identify the tools and methods you need to utilize in this phase.
If you need help leveraging your data to generate business insights, Western Computer is here to help. Contact us today to learn more about our Microsoft solutions and how they can help you aggregate, analyze and govern all your data sources.