Skip to content

KeepColumns

Keeps selected columns in the dataset.

KeepColumns is a Step that implements the process method that keeps only the columns specified in the columns attribute. Also KeepColumns provides an attribute columns to specify the columns to keep which will override the default value for the properties inputs and outputs.

Note

The order in which the columns are provided is important, as the output will be sorted using the provided order, which is useful before pushing either a dataset.Dataset via the PushToHub step or a distilabel.Distiset via the Pipeline.run output variable.

Attributes

  • columns: List of strings with the names of the columns to keep.

Input & Output Columns

Inputs

  • dynamic (determined by columns attribute): The columns to keep.

Outputs

  • dynamic (determined by columns attribute): The columns that were kept.