You can't use this until you register with us!




Description

Transform CSV files to Parquet format.


Purchasing Options



Available Actions

Transform Format

Read CSV files from an S3 bucket path source, create new files with the content transformed to Parquet format and save them in an S3 bucket path destination.


Variables

Source Location
source_s3_bucket_connection required

Obtain source files from this S3 Bucket location.

source_s3_folder_path

Folder path from the Folder Path Prefix in the Source S3 Bucket Connection to the folder where the source files are located.


Source Options
reprocess required

If enabled, the next run of this Task will transform source files that have been recorded as successfully transformed previously again. Any existing Parquet files created from the same source files will be deleted. After the next successful run of this Task, this field will automatically become disabled.

columns

List which columns in the source files to include in the output Parquet files. Column names specified should be separated by commas. Column numbers can be used in place of names. The first column is 0. Leave this field blank to include all columns.

header required

If enabled, specify which row to treat as header values in the Header Row Number field.

header_num

Specify which row number to treat as header values. The first row in a file is referenced as 0, not 1. This field is ignored if Header Row Exists is not enabled.

delimiter

Specify the character to treat as the column delimiter in the source files. Comma is the default.

pattern

If this field not blank, only files under the Source S3 Folder Path that match the specified Jinja pattern will be included for processing. If date based variables are included, they will be used when combining data from multiple files. Do not include the Source S3 Folder Path in the pattern.

date_col

If Source Pattern does not contain date based variables and this field is not blank, then the column name or column number specified will become the index for the combined data and will be used when the data is redivided based on Combination Granularity. The first column is 0.

lookback

If variables representing a full date are included in the Source Pattern field, optionally specify a number of days in this field. The number specified is subtracted from the current day when the task runs to create a range of dates. Only files with a full date in their file path that falls within the range will be included for potential transformation.


Destination Location
dest_s3_bucket_connection required

Create Parquet formatted files in this S3 Bucket location.

dest_s3_folder_path

Folder path from the Folder Path Prefix in the Destination S3 Bucket Connection to the folder where the Parquet formatted files will be created.


Destination Options
comb_level required

If date based variables are included in the Source Pattern or a Date Column is specified, data read from multiple source files will be combined and then split and recombined into Parquet files based on the selection in this field. If None is selected, then no combination will occur and each source file will become a separate Parquet file.

deduplicate required

If enabled, when combining data from source files with Parquet data from previous Task runs, any new rows that are identical to existing rows will be removed.

parquet_compression required

Select the type of compression (if any) to apply to all columns in the Parquet file.


Tracking
tracking_db required

Connection for the MySQL database to use to record which files have been processed by this task in the past.

tracking_table required

Schema and table name where processed file information is stored within the MySQL database.