Dataset management¶
This guide provides an overview of datasets, explaining the basics of how to set them up and manage them in Argilla.
A dataset is a collection of records that you can configure for labelers to provide feedback using the UI. Depending on the specific requirements of your task, you may need various types of feedback. You can customize the dataset to include different kinds of questions, so the first step will be to define the aim of your project and the kind of data and feedback you will need. With this information, you can start configuring a dataset by defining fields, questions, metadata, vectors, and guidelines through settings.
Question: Who can manage datasets?
Only users with the owner
role can manage (create, retrieve, update and delete) all the datasets.
The users with the admin
role can manage (create, retrieve, update and delete) the datasets in the workspaces they have access to.
Main Classes
Check the Dataset - Python Reference to see the attributes, arguments, and methods of the
Dataset
class in detail.
rg.Settings(
fields=[rg.TextField(name="text")],
questions=[
rg.LabelQuestion(
name="label",
labels=["label_1", "label_2", "label_3"]
)
],
metadata=[rg.TermsMetadataProperty(name="metadata")],
vectors=[rg.VectorField(name="vector", dimensions=10)],
guidelines="guidelines",
allow_extra_metadata=True,
)
Check the Settings - Python Reference to see the attributes, arguments, and methods of the
Settings
class in detail.
Create a dataset¶
To create a dataset, you can define it in the Dataset
class and then call the create
method that will send the dataset to the server so that it can be visualized in the UI. If the dataset does not appear in the UI, you may need to click the refresh button to update the view. For further configuration of the dataset, you can refer to the settings section.
The created dataset will be empty, to add the records refer to this how-to guide.
import argilla_sdk as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
settings = rg.Settings(
guidelines="These are some guidelines.",
fields=[
rg.TextField(
name="text",
),
],
questions=[
rg.LabelQuestion(
name="label",
labels=["label_1", "label_2", "label_3"]
),
],
)
dataset = rg.Dataset(
name="my_dataset",
workspace="my_workspace",
settings=settings,
client=client,
)
dataset.create()
Accessing attributes
Access the attributes of a dataset by calling them directly on the dataset
object. For example, dataset.id
, dataset.name
or dataset.settings
. You can similarly access the fields, questions, metadata, vectors and guidelines. For instance, dataset.fields
or dataset.questions
.
Create multiple datasets with the same settings¶
To create multiple datasets with the same settings, define the settings once and pass it to each dataset.
import argilla_sdk as rg
settings = rg.Settings(
guidelines="Select the sentiment of the prompt.",
fields=[rg.TextField(name="prompt", use_markdown=True)],
questions=[rg.LabelQuestion(name="sentiment", labels=["positive", "negative"])],
)
dataset1 = rg.Dataset(name="sentiment_analysis_1", settings=settings)
dataset2 = rg.Dataset(name="sentiment_analysis_2", settings=settings)
# Create the datasets on the server
dataset1.create()
dataset2.create()
Create a dataset with settings from an existing dataset¶
To create a new dataset with settings from an existing dataset, get the settings from the existing dataset and pass it to the new dataset.
import argilla_sdk as rg
# Get the settings from an existing dataset
existing_dataset = client.datasets("sentiment_analysis")
# Create a new dataset with the same settings
dataset = rg.Dataset(name="sentiment_analysis_copy", settings=existing_dataset.settings)
# Create the dataset on the server
dataset.create()
Define dataset settings¶
Fields¶
The fields in a dataset consist of one or more data items requiring annotation. Currently, Argilla only supports plain text and markdown through the TextField
, though we plan to introduce additional field types in future updates.
A field is defined in the TextField
class that has the following arguments:
name
: The name of the field.title
(optional): The name of the field, as it will be displayed in the UI. Defaults to thename
value.required
(optional): Whether the field is required or not. Defaults toTrue
. At least one field must be required.use_markdown
(optional): Specify whether you want markdown rendered in the UI. Defaults toFalse
. If you set it to True, you will be able to use all the Markdown features for text formatting, as well as embed multimedia content and PDFs.
Note
The order of the fields in the UI follows the order in which these are added to the fields attribute in the Python SDK.
Questions¶
To collect feedback for your dataset, you need to formulate questions that annotators will be asked to answer. Currently, Argilla supports the following types of questions: LabelQuestion
, MultiLabelQuestion
, RankingQuestion
, RatingQuestion
, SpanQuestion
, and TextQuestion
.
A LabelQuestion
asks annotators to choose a unique label from a list of options. This type is useful for text classification tasks. In the UI, they will have a rounded shape. It has the following configuration:
name
: The name of the question.title
(optional): The name of the question, as it will be displayed in the UI. Defaults to thename
value.description
(optional): The text to be displayed in the question tooltip in the UI. You can use it to give more context or information to annotators.required
(optional): Whether the question is required or not. Defaults toTrue
. At least one question must be required.labels
: A list of strings with the options for these questions. If you'd like the text of the labels to be different in the UI and internally, you can pass a dictionary instead where the key is the internal name and the value will be the text displayed in the UI.
A MultiLabelQuestion
asks annotators to choose all applicable labels from a list of options. This type is useful for multi-label text classification tasks. In the UI, they will have a squared shape. It has the following configuration:
name
: The name of the question.title
(optional): The name of the question, as it will be displayed in the UI. Defaults to thename
value.description
(optional): The text to be displayed in the question tooltip in the UI. You can use it to give more context or information to annotators.required
(optional): Whether the question is required or not. Defaults toTrue
. At least one question must be required.labels
: A list of strings with the options for these questions. If you'd like the text of the labels to be different in the UI and internally, you can pass a dictionary instead where the key is the internal name and the value will be the text displayed in the UI.visible_labels
(optional): The number of labels that will be visible at first sight in the UI. By default, the UI will show 20 labels and collapse the rest. Set your preferred number to change this limit or setvisible_labels=None
to show all options.
rg.MultiLabelQuestion(
name="multi_label",
title="Does the response include any of the following?",
description="Select all that apply.",
required=True,
labels={
"hate": "Hate Speech",
"sexual": "Sexual content",
"violent": "Violent content",
"pii": "Personal information",
"untruthful": "Untruthful info",
"not_english": "Not English",
"inappropriate": "Inappropriate content"
}, # or ["hate", "sexual", "violent", "pii", "untruthful", "not_english", "inappropriate"]
visible_labels=4
)
A RankingQuestion
asks annotators to order a list of options. It is useful to gather information on the preference or relevance of a set of options. Ties are allowed and all options will need to be ranked. It has the following configuration:
name
: The name of the question.title
(optional): The name of the question, as it will be displayed in the UI. Defaults to thename
value.description
(optional): The text to be displayed in the question tooltip in the UI. You can use it to give more context or information to annotators.required
(optional): Whether the question is required or not. Defaults toTrue
. At least one question must be required.values
: A list of strings with the options they will need to rank. If you'd like the text of the options to be different in the UI and internally, you can pass a dictionary instead where the key is the internal name and the value is the text to display in the UI.
A RatingQuestion
asks annotators to select one option from a list of integer values. This type is useful for collecting numerical scores. It has the following configuration:
name
: The name of the question.title
(optional): The name of the question, as it will be displayed in the UI. Defaults to thename
value.description
(optional): The text to be displayed in the question tooltip in the UI. You can use it to give more context or information to annotators.required
(optional): Whether the question is required or not. Defaults toTrue
. At least one question must be required.values
: A list of unique integers representing the scores that annotators can select from should be defined within the range [1, 10].
A SpanQuestion
asks annotators to select a portion of the text of a specific field and apply a label to it. This type of question is useful for named entity recognition or information extraction tasks. It has the following configuration:
name
: The name of the question.title
(optional): The name of the question, as it will be displayed in the UI. Defaults to thename
value, but capitalized.description
(optional): The text to be displayed in the question tooltip in the UI. You can use it to give more context or information to annotators.required
(optional): Whether the question is required or not. Defaults toTrue
. At least one question must be required.labels
: A list of strings with the options for these questions. If you'd like the text of the labels to be different in the UI and internally, you can pass a dictionary instead where the key is the internal name and the value will be the text to display in the UI.field
: This question is always attached to a specific field. You should pass a string with the name of the field where the labels of theSpanQuestion
should be used.allow_overlapping
: This value specifies whether overlapped spans are allowed or not. Defaults toFalse
.visible_labels
(optional): The number of labels that will be visible at first sight in the UI. By default, the UI will show 20 labels and collapse the rest. Set your preferred number to change this limit or setvisible_labels=None
to show all options.
A TextQuestion
offers to annotators a free-text area where they can enter any text. This type is useful for collecting natural language data, such as corrections or explanations. It has the following configuration:
name
: The name of the question.title
(optional): The name of the question, as it will be displayed in the UI. Defaults to thename
value, but capitalized.description
(optional): The text to be displayed in the question tooltip in the UI. You can use it to give more context or information to annotators.required
(optional): Whether the question is required or not. Defaults toTrue
. At least one question must be required.use_markdown
(optional): Define whether the field should render markdown text. Defaults toFalse
. If you set it toTrue
, you will be able to use all the Markdown features for text formatting, as well as embed multimedia content and PDFs.
Metadata¶
Metadata properties allow you to configure the use of metadata information for the filtering and sorting features available in the UI and Python SDK. There exist three types of metadata you can add: TermsMetadataProperty
, IntegerMetadataProperty
and FloatMetadataProperty
.
A TermsMetadataProperty
allows to add a list of strings as metadata options. It has the following configuration:
name
: The name of the metadata property.title
(optional): The name of the metadata property, as it will be displayed in the UI. Defaults to thename
value, but capitalized.options
(optional): You can pass a list of valid values for this metadata property, in case you want to run any validation.
An IntegerMetadataProperty
allows to add integer values as metadata. It has the following configuration:
name
: The name of the metadata property.title
(optional): The name of the metadata property, as it will be displayed in the UI. Defaults to thename
value, but capitalized.min
(optional): You can pass a minimum valid value. If none is provided, the minimum value will be computed from the values provided in the records.max
(optional): You can pass a maximum valid value. If none is provided, the maximum value will be computed from the values provided in the records.
A FloatMetadataProperty
allows to add float values as metadata. It has the following configuration:
name
: The name of the metadata property.title
(optional): The name of the metadata property, as it will be displayed in the UI. Defaults to thename
value, but capitalized.min
(optional): You can pass a minimum valid value. If none is provided, the minimum value will be computed from the values provided in the records.max
(optional): You can pass a maximum valid value. If none is provided, the maximum value will be computed from the values provided in the records.
Vectors¶
To use the similarity search in the UI and the Python SDK, you will need to configure vectors using the VectorField
class. It has the following configuration:
name
: The name of the vector.title
(optional): A name for the vector to display in the UI for better readability.dimensions
: The dimensions of the vectors used in this setting.
Guidelines¶
Once you have decided on the data to show and the questions to ask, it's important to provide clear guidelines to the annotators. These guidelines help them understand the task and answer the questions consistently. You can provide guidelines in two ways:
-
In the dataset guidelines: this is added as an argument when you create your dataset in the Python SDK. It will appear in the dataset settings in the UI.
-
As question descriptions: these are added as an argument when you create questions in the Python SDK. This text will appear in a tooltip next to the question in the UI.
It is good practice to use at least the dataset guidelines if not both methods. Question descriptions should be short and provide context to a specific question. They can be a summary of the guidelines to that question, but often that is not sufficient to align the whole annotation team. In the guidelines, you can include a description of the project, details on how to answer each question with examples, instructions on when to discard a record, etc.
Tip
If you want further guidance on good practices for guidelines during the project development, check our blog post.
List datasets¶
You can list all the datasets available in a workspace using the datasets
attribute of the Workspace
class. You can also use len(workspace.datasets)
to get the number of datasets in a workspace.
import argilla_sdk as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
workspace = client.workspaces("my_workspace")
datasets = workspace.datasets
for dataset in datasets:
print(dataset)
Retrieve a dataset¶
You can retrieve a dataset by calling the datasets
method on the Argilla
class and passing the name of the dataset as an argument. By default, this method attempts to retrieve the dataset from the first workspace. If the dataset is in a different workspace, you must specify either the workspace name or id as an argument.
import argilla_sdk as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
workspace = client.workspaces("my_workspace")
# Retrieve the dataset from the first workspace
retrieved_dataset = client.datasets(name="my_dataset")
# Retrieve the dataset from the specified workspace
retrieved_dataset = client.datasets(name="my_dataset", workspace=workspace)
Check dataset existence¶
You can check if a dataset exists by calling the exists
method on the Dataset
class. This method returns a boolean value.
import argilla_sdk as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
dataset = client.datasets(name="my_dataset")
dataset_existed = dataset.exists()
Update a dataset¶
You can update a dataset by calling the update
method on the Dataset
class and passing the new settings as an argument.
Note
Keep in mind that once your dataset is published, only the guidelines can be updated.
import argilla_sdk as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
dataset_to_update = client.datasets(name="my_dataset")
settings_to_update = rg.Settings(
guidelines="These are some updated guidelines.",
fields=[
rg.TextField(
name="text",
),
],
questions=[
rg.LabelQuestion(
name="label",
labels=["label_4", "label_5", "label_6"]
),
],
)
dataset_to_update.settings = settings_to_update
dataset_updated = dataset_to_update.update()
Delete a dataset¶
You can delete an existing dataset by calling the delete
method on the Dataset
class.