Skip to content

Quickstart

To start off, distilabel is a framework for building pipelines for generating synthetic data using LLMs, that defines a Pipeline which orchestrates the execution of the Step subclasses, and those will be connected as nodes in a Direct Acyclic Graph (DAG).

That being said, in this guide we will walk you through the process of creating a simple pipeline that uses the OpenAILLM class to generate text. The Pipeline will load a dataset that contains a column named prompt from the Hugging Face Hub via the step LoadDataFromHub and then use the OpenAILLM class to generate text based on the dataset using the TextGeneration task.

from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration

with Pipeline(  # 
    name="simple-text-generation-pipeline",
    description="A simple text generation pipeline",
) as pipeline:  # 
    load_dataset = LoadDataFromHub(  # 
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    text_generation = TextGeneration(  # 
        name="text_generation",
        llm=OpenAILLM(model="gpt-3.5-turbo"),  # 
    )

    load_dataset >> text_generation  # 

if __name__ == "__main__":
    distiset = pipeline.run(  # 
        parameters={
            load_dataset.name: {
                "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
                "split": "test",
            },
            text_generation.name: {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_new_tokens": 512,
                    }
                }
            },
        },
    )
    distiset.push_to_hub(repo_id="distilabel-example")  # 

Minimal example

distilabel gives a lot of flexibility to create your pipelines, but to start right away, you can omit a lot of the details and let default values:

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
from datasets import load_dataset


dataset = load_dataset("distilabel-internal-testing/instruction-dataset-mini", split="test")

with Pipeline() as pipeline:  # 
    TextGeneration(llm=InferenceEndpointsLLM(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"))  # 


if __name__ == "__main__":    
    distiset = pipeline.run(dataset=dataset)  # 
    distiset.push_to_hub(repo_id="distilabel-example")
Was this page helpful?