Workflow
Incremental Data Processing with Parquet
In this workflow, we will use the NYC taxi dataset to show case a continous preprocessing and publishing of event data. Instead of the Group Loop Start node this workflow could executed once per week in order to preprocess and publish all data that has arrived during the week. The result is written as a separate Parquet file within the same folder for each run. To ensure the uniquness of the file for each run we use the year and week of each run as file prefix that is set via flow variable.
Since the folder stays the same and Parquet is reading all files within the same folder independent of their file name, this folder can be exposed as external table (e.g. in Hive or Impala) to power further analysis processes.
External resources
Used extensions & nodes
Created with KNIME Analytics Platform version 4.3.0
- Go to item
- Go to item
- Go to item
- Go to item
- Go to item
- Go to item
Loading deployments
Loading ad hoc jobs
Legal
By using or downloading the workflow, you agree to our terms and conditions.