With all new shiny and ever changing systems/solutions in the market, isn’t it the right time to rethink about enterprise business intelligence solution? Is using the same traditional design still the best approach where
- You have SSIS packages along with so many console applications extracting the data from the source systems, then manipulating it and storing in the staging database.
- Another layer of SSIS packages/stored procedures will prepare the data and move them into datawarehouse.
- Your rigid reporting tool will connect to Data warehouse and to the Multi-dimensional model.
- You need a report developer creating new reports and your SSIS developer working on new packages for any change.
Some of the organization are considering moving to cloud thus avoiding the ever-increasing storage and computational power needed to retrieve the data from terabytes of storage. So how does the above design approach fit into cloud first design or is there a better solution?
What about all new online sources like social feeds, streams of data about the footfall of your consumers, surveys from survey monkey, interactions with your bot etc. Is it the best approach to fetch these using the same traditional packages, thus purchasing connectors and adapters to connect to these systems and implement what has already been done so many time, or can we utilize cloud bases services having all these connectors and logic available as part of the platform?
With all these in mind, we can go with cloud-only (Azure only) with minimal development efforts. Obviously design and implementation efforts for databases and SQL queries for data cleansing and preparation will still be needed.
Azure Data Warehouse (ADW) is a very good contender with Massively Parallel Processing capabilities and options to elastically increase/decrease computational nodes.
Azure SQL database
Why do you need SQL database when you have data warehouse. You need landing and staging areas to clean, process and prepare the data before pushing it to data warehouse. ADW is not the right contender for data manipulation. It is best for querying. Secondly, conceptually, it does not make sense to use your data warehouse for landing and staging purposes.
Following are the ETL technologies that I propose:
- Azure Data Factory (ADF)(preferred)
- Microsoft Flow
Azure Data Factory (ADF)
With a growling list of supported data sources and ability to develop custom .Net activity or use SSIS (in v2), you can use it to connect any data source. The list of supported data sources is available here.
You will need to install data gateway within an on-premises server having access to all internal sources and to Azure SQL database to integrate on-premises sources.
ADF will also be used to move the data from staging SQL database to Azure Data Warehouse.
If we can use ADF for all kind of sources, why do we need Microsoft Flow. Although the list of sources supported by ADF is extensive but still there are many common sources that are not so easy to connect with ADF and require you to write code. For example Excel files lying in your network share, OneDrive, SharePoint, SFTP or DropBox, or data within SharePoint online etc.
Microsoft Flow has a much more extensive list of connectors. What I like most about Microsoft Flow is it is event triggered. You will not need to pull the data from your budget file every month. It can be pushed as soon as updated file is available. Please note Microsoft Flow is not meant to move large volumes of data. It is good for small chunks.
If you need to pull your budgets from OneDrive/SFTP or need to pull Footfalls from a dropbox, then I will recommend Microsoft Flow. It will not require you to write code and you can even add approval/review steps.
Please note, you will need to install a gateway for Microsoft Flow within your network to connect to on-premises sources. Its gateway is different from ADF but same works for other services like PowerApps, Azure Analysis Services and PowerBI.
In the above design, I divided all sources into 4 broader categories:
- Internal Sources
- Azure/Office 365 (Azure EventHub, Azure Stream, Dynamics 365, SharePoint Online etc)
- Other public sources
- Data entered by users (PowerApps) (e.g. metadata, external factors like public holidays etc.)
Azure Analysis Services (AAS)
This will be your semantic layer on-top of ADW. It will provide necessary aggregations and performance that you need for your reporting.
AAS will be processed on regular intervals by Azure Automation Runbook.
Azure Data Factory and Microsoft Flow have its own schedulers. But we will not be using them to process AAS database. For this, you can use scheduler of Azure Automation for Azure Automation Runbooks.
Not recommended, but if you have to integrate with sources not supported by Azure Data Factory and Microsoft Flow and you don’t want to write any .Net code or develop SSIS package, you can utilize Azure Automation Runbooks. They support Python as well as PowerShell with an extensive list of modules.
PowerBI is my preferred choice for reporting with an extensive list of capabilities and ease of use. You can hook it with Azure Analysis Services with a live connection and then be creative with your reports. Being PowerBI report designer is as easy as being a Power User of Excel.
I made the overall design very simple 🙂 but it is not as simple when you start implementing it. Let me know your thoughts and feedback.