Quickstart¶
To start off, distilabel
is a framework for building pipelines for generating synthetic data using LLMs, that defines a Pipeline
which orchestrates the execution of the Step
subclasses, and those will be connected as nodes in a Direct Acyclic Graph (DAG).
That being said, in this guide we will walk you through the process of creating a simple pipeline that uses the OpenAILLM
class to generate text. The Pipeline
will load a dataset that contains a column named prompt
from the Hugging Face Hub via the step LoadDataFromHub
and then use the OpenAILLM
class to generate text based on the dataset using the TextGeneration
task.
from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline( #
name="simple-text-generation-pipeline",
description="A simple text generation pipeline",
) as pipeline: #
load_dataset = LoadDataFromHub( #
name="load_dataset",
output_mappings={"prompt": "instruction"},
)
text_generation = TextGeneration( #
name="text_generation",
llm=OpenAILLM(model="gpt-3.5-turbo"), #
)
load_dataset >> text_generation #
if __name__ == "__main__":
distiset = pipeline.run( #
parameters={
load_dataset.name: {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
text_generation.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
},
)
distiset.push_to_hub(repo_id="distilabel-example") #
Minimal example¶
distilabel
gives a lot of flexibility to create your pipelines, but to start right away, you can omit a lot of the details and let default values:
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
from datasets import load_dataset
dataset = load_dataset("distilabel-internal-testing/instruction-dataset-mini", split="test")
with Pipeline() as pipeline: #
TextGeneration(llm=InferenceEndpointsLLM(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct")) #
if __name__ == "__main__":
distiset = pipeline.run(dataset=dataset) #
distiset.push_to_hub(repo_id="distilabel-example")