In this blog I describe a few of the challenges you might face when you have a lot of dataflows. I will describe a few tips and tricks I am applying to sanitize your dataflow approach, organize dataflows and easy to browse through.
Dataflows are increasingly used as shared resource or staging layer inside the Power BI platform. While having dataflows, you can push down logic and reuse across different datasets. This lowers the impact on the source by extracting the data once from source to Power BI, helps in centralizing logic, having one version of the truth and lots of other advantages.
If you are not familiar yet with dataflows, I advise you to first read this documentation before you continue reading this blog.
Dataflow best practices
Building dataflows is very similar to building queries in Power BI Desktop. Although there was a great improvement of the user interface to build dataflows, I personally still prefer building the queries in Power BI desktop. Afterwards you can easily copy-paste the query from the advanced editor into a dataflow.
In the below video Patrick Leblanc highlights some more advantages of building queries in Power BI desktop and then move them to dataflows.
Besides all these personal best practices, Microsoft has also put a bunch of best practices together. I’m definitely not going to repeat everything that is already written down in the docs, but let’s point out a few of the things I always advise to others.
- Start every new solution by using dataflows from the beginning! By leveraging dataflows, you can take advantage of separate refresh schedules and easier error traceability. Is one of your dataflows failing to refresh, it will still contain the last successful set of data and not affect the data model refresh directly. This all results in a higher success rate on your data model refresh.
- Avoid ending up with one big mess! Push all your transformations down to dataflows and avoid adding any logic in the data model. By doing this, you keep it all well organized and consistent in one place.
- If you can, take advantage of linked and computed entities. This requires Power BI premium. In line with the Microsoft best practices, you can split data ingestion from transformation.
- Give every dataflow a reasonable name and description. For you it might be clear what is inside the dataflow, but your colleagues might start using the dataflows as well in the future. A reasonable name and description will help them to understand what is inside.
Even when applying all the above best practices, some challenges remain.
Challenges in organizing dataflows
A lot of things can be implemented in organizational best practices and ways of working. Still the challenge remains. Below I describe two challenges in more detail that I have faced recently.
If you centrally build dataflows and want to share the end-result only with others. To do so, you need to grant viewer permissions on workspace level, since there is currently no other way of sharing only a dataflow with other users. If you do so, you also grant read access to everything inside the workspace. This includes the data ingestion dataflows if they are in the same workspace.
A mistake is made quickly, and you want to ensure that they leverage the correct dataflows. According to Microsoft documentation, you can start build cross-workspace linked entities.
When you start leveraging linked entities in dataflows, you actually start building one dataflow on top of another. This is a great way of splitting the data ingestion from transformation steps. By doing so, you simply put the ingestion dataflow in another workspace then the transformation dataflows. But here we directly face the challenge.
- Linked Entities inside the same workspace, will trigger the next dataflow in line, when the first one finished refreshing.
- Cross workspace linked entities, will not automatically trigger the next workspace in line and it will behave like an external data source. When the Ingest dataflow is refreshed, it will not directly affect the linked entities.
In other words, by sharing only the end-result dataflows with our users, we need setup cross-workspace linked entities. This approach brings up two questions for me. The first one is fairly easy to solve, but the second (more complex) one, is much more complex.
Centrally manage dataflow refreshes
The cross workspace linked entities require separate schedules that must be aligned. This brings additional admin work and makes it more complex to see cross dataflow dependencies.
Although it is suboptimal, this challenge can be solved fairly easy by setting up a centrally managed dataflow refresh schedule as I described in one of my earlier blogs. With that approach, you can trigger one dataflow after the other and centrally maintain schedules.
Enhanced Compute Engine
The second question is more a technical one and requires a technical understanding of how dataflows work. Maybe it is even more a consideration then a challenge. I will try to explain it as simple as possible.
When a dataflow is triggered to refresh, it computes everything in Power Query (mash-up) engine. After processing is completed, it will save the resulting dataset as a CSV file in a CDM folder in Azure Datalake. This can be the default location or your own custom setup Azure Datalake gen2 as described in the docs. Available memory for processing is limited to the parameters set on the Premium capacity.
If you leverage the enhanced compute engine, you leverage a SQL based cache to process the dataflows. By leveraging the enhanced compute engine, you increase the process power by being able to push certain operations that can be folded upstream to process as much as possible inside the SQL based cache. This all is handled by query folding. Read the query folding documentation if you want to know which operations are supported.
If you want to know more about the enhanced compute engine and how it works, I advise you to watch this great video by Matthew Roche where he explains it in detail.
While doing a bit more research in the advantages and disadvantages of the enhanced compute engine, I had a talk to Shuvro Mitra, Principal Architect in the Power BI team. He mentioned that the SQL cache is usually permanent but can be dropped under some conditions. If this happens, it will be recreated in the next refresh iteration.
Though, there is a disadvantage of putting linked entities in a separate workspace. The enhanced compute engine caches per dataflow. If the dataflow is referenced by another dataflow (linked entity) in the same workspace, it re-uses the cache. Even while the linked entity uses the SQL cache, you will pay the penalty of caching in the second (linked entity) dataflow refresh, since this is another workspace. Shuvro mentioned that measures have shown an additional load time around 10% for each entity to refresh, though this number is not very scientific.
Confusing names with linked entities
Last but not least, by having multiple layers of dataflows the names can get very confusing. In line with one of the earlier mentioned best practices, you want to give an explanatory name to every dataflow. You might end up with very similar names for your linked entities as the ingestion workspaces.
To easily recognize the placing of the dataflow, type of the dataflow and where it relates to, applying naming convention might help here. Of course there is also the lineage view that can help you to get some more order in the chaos. Unfortunately, this will only be relevant to a limited group of people. End users with viewer permissions on the workspace, cannot see the lineage view.
So, what to do to add some more structure to the dataflow setup? One of our recently added best practices is to apply naming convention to your dataflows. This will help to get some more structure, nicely group the related dataflows together in the workspace and being able to recognize which part of the process is represented by the dataflow (ingest or transform). The following subjects can be relevant to be part of your naming convention:
- Data source where the data is extracted from. Of course, this is visible in the lineage view as well, but the name will help here as well for other users that cannot access the lineage view.
- Type of dataflow, whether it only extracts data (E), or is built as linked entity on top to transform (T) the data.
- Order number, in case you have multiple linked entities with transformations stacked.
- Some free explanator text. In most cases describing what data can be read from the dataflow.
Below a setup of three dataflows in line with naming convention applied.
- ALA-E-PowerBiAuditLog, which represents a dataflow that reads data from Azure Log Analytics, is the Extract dataflow and includes data from the Power BI Audit Log.
- ALA-T1-PowerBiAuditLog, represents a linked entity on top of the above dataflow, which is the first layer of transformations, represented by T1.
- ALA-T2-PowerBiAuditLog, represents a second linked entity (T2) reading data from transformation layer 1.
Wrap up and remarks
Organizing your dataflows comes down to a few different things. First, clear agreements within the organization for usage of dataflows will bring you halfway. It is important to all work according to the same principals. Simply to know what you can expect and where to find what. This can be achieved by applying naming convention, next to a lot of other best practices. Obviously, I cannot define your naming convention. This is something where I can only inspire and share how we are doing this. You might have other naming conventions already applied within your organization, that should be in line.
Besides the agreements, there are some considerations to make. You need to decide if you want to split the dataflows in multiple workspaces, but by doing that you agree in paying the (small) penalty for additional load time. Though, this might add more clarity for you end users if you allow them to leverage your dataflows.
Third, is a depending on the decision you made for using multiple workspaces or not. If you decide to split in multiple workspaces, you must deal with the separate refresh schedules. As said, centrally managing the schedules might give you a little more insight already. Personally, I would opt for a new setting in a dataflow that allows refresh triggers for linked entities, even if they are in a different workspace. This will add some more flexibility. I noticed the idea was already on the ideas forum, so I directly voted for it and added a comment. If you agree, raise your voice, comment and vote for it!