Data Flow: Making Data Processing A Lot More Efficient

The end-of-year holidays are around the corner and ClicData is nothing short of bringing you some sweet and awesome features!

Our latest product release includes some features that will bring your data processing to the next level! With 3 new nodes as well as designer and performance improvements, the Data Flow module now goes even beyond the capabilities of our former system using Views, Merges, and Fusions.

As a quick summary, Data Flow is designed to become the central tool for data processing, transformation, and augmentation in ClicData. Now is the time to switch to Data Flow for all your data cleaning, enhancing, and modeling operations. And this is why:

Clear Overview of Data Transformations

The name is a hint: you are building flows from A to Z with Data Flow. This means that all the data cleaning, calculations, and such now happen in a unique visual interface where each step can be easily documented and understood by all users.

From input table(s) to output table(s), build a continuous suite of nodes interconnected to each other that transform the data step by step.

dataflow ui

Powerful Capabilities and Endless Possibilities

Because you can build as many output tables as you wish in a data flow, from as many input tables as needed, and as many branches, multiple data transformations can be executed either in parallel or in sequence, making sure that all are up to date at the same time.

Side management steps like building Views, Fusions, and Merge and maintaining a flexible and clear naming convention as well as folder organization are no longer necessary. Data flows and Tables are kept in separate explorers in the UI. On one hand the data, and on the other the processing.

Previews of the current flow will only be shown when you ask for them! This ensures quick-flow editing and an efficient work environment.

Better Control

A Data Flow is executed when you want it to be executed; either on a schedule or on demand. Compared to Views/Merges and Fusions which were automatically rebuilt each time the data was refreshed. This means that you can time it to load multiple sets of data and only prepare the new data for analysis or visualization when you are ready. You decide when a table output really needs to be updated.

dataflow schedule

It also allows you to preview the flow prior to actually executing it, ensuring that any kinks are worked out beforehand.

Higher Performance

Data Flow can be viewed as the process of taking data from one or many physical tables, processing the data, and then putting the results in another physical table.

Unlike our views and merges and fusions that needed to be cached, data flows are always building data on tables that do not require to be re-calculated or re-processed each time we use them in dashboards, reports, or other areas of ClicData.

You don’t need to bother anymore about certain constraints and actions that previously were needed, such as cache management and dependency checking.

The nodes themselves are done using a variety of techniques including:

  • Common Table Expressions (CTE),
  • the automatic building of joins for faster lookups or aggregations based on a number of items,
  • the use of Columnar Store Indexes,
  • and more.

Logical Flow Sequences

You can organize flows one after the other so that you can build logical processing groups.

For example, you can have an initial flow that cleans the data and standardizes the output schema.

In the screenshot below we can see how historical LinkedIn metrics are prepared for further processing (we work here with monthly followers count by country, industry, and sector). This flow has been run manually only once, as it relates to historical data that won’t change any more.

dataflow historical linkedin

After that, a second flow can use the output table to then calculate standard metrics for various reports and dashboards.

Continuing with our example, we see how live metrics are processed for the 3 dimensions of LinkedIn followers in the flow below. They are then combined with the historical outputs from the first Data Flow.

dataflow linkedin

And potentially, a third flow can then select some of the previous flows to pre-group or pre-aggregate some data.

This allows to keep the processing of data clean and modular while at the same time identifying potential areas of performance improvement.

Seamless Collaboration

Because all node operations are contained in one single Data Flow, it makes it very easy to share flows with other users, make copies of entire processes and simply change the input and/or output tables.

Each node can get a specific name and description, making it easy to maintain a good understanding of the data processing, with low documentation efforts.

The same security system as with other objects in the platform applies to Data Flow: users can be editors or viewers or not have access to certain or all data flows.

Time to Embrace the Power of Data Flow

First of all, rest assured that Views, Fusions, and Merges will stay active for customers already using them for a few more years. You will have time to master flows and experience all its benefits for yourself. Eventually, Data Flow is meant to replace the former system.

Data Flow will continue to be enhanced with the next releases to add more features like new nodes, debug help, UI enhancements, and more.

Start today!

Data Flow is available to all users in ClicData. If you don’t see the access in your Main Menu under the Data section, please reach out to your admin who can give you permission. If you need help from us, don’t hesitate to reach out to our Support team.

On January 10, 2023, we will broadcast a live webinar dedicated to Data Flow, where we will show some real-life examples of building a full data flow. You will get the chance to ask questions directly to the ClicData product team. You can register here.

We hope you enjoy using Data Flow as much as we do.