Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Session 1: Data Acquisition

For executing the notebook, run the requirements file (see Downloads).

Learning goals

By the end of this notebook you will know how to:

  • Get data from OpenML

  • Scrape data from web sources

  • Build a human-in-the-loop annotation pipeline

  • Aggregate crowdsourcing data

  • Prepare DataSheets for your data set

Public Data APIs

Often data for a given task can be found in public data sources, such as:

Direct downloads are often the fastest first step before scraping.

OpenML

Open Machine Learning Initiative

OpenML is an open, collaborative platform for sharing and organizing machine learning research, founded in 2012.

ComponentDescription
DatasetsCurated tabular, image, and text datasets
TasksStandardized ML problems with fixed splits and metrics
FlowsSerialized ML pipelines and algorithm descriptions
RunsLogged results of applying a Flow to a Task

As of 2024, OpenML hosts 4,000+ datasets, 10,000+ tasks, and 1,000,000+ experimental runs.

import openml
import pandas as pd

# OpenML ID for the classic Titanic dataset
dataset = openml.datasets.get_dataset(40945)

# Extract features and target
# categorical_indicator helps identify which columns to treat as categories
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format="dataframe", 
    target=dataset.default_target_attribute
)

print(f"Dataset: {dataset.name} | Rows: {X.shape[0]}, Cols: {X.shape[1]}")
X.head(3)
Dataset: Titanic | Rows: 1309, Cols: 13
Loading...

Data Scraping

When data is not available via APIs or direct downloads, we can often still get it by scraping it from the web.

Tabular scraping with pandas

Just like in some previous lectures, we start with Wikipedia tables.

Run the next cell, then choose a URL and inspect available tables interactively.

import pandas as pd
import requests
from io import StringIO

# Wikipedia started blocking requests without a user-agent header, so we need to set it to mimic a browser
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
}
response = requests.get("https://en.wikipedia.org/wiki/Berlin_population_statistics", headers=headers, timeout=20)
response.raise_for_status()

# Parse HTML text
tables = pd.read_html(StringIO(response.text))
print(f"Found {len(tables)} tables. Showing table 0 ({tables[0].shape[0]} rows, {tables[0].shape[1]} columns):")
tables[0].head(3)
Found 7 tables. Showing table 0 (13 rows, 4 columns):
Loading...

HTML parsing with BeautifulSoup

Now we move to more sophisticated data scraping not just from tables, but from the HTML page structure.

import requests
from bs4 import BeautifulSoup

url = "https://books.toscrape.com"
response = requests.get(url, timeout=20)
response.raise_for_status()

# Build the soup object parsing the HTML
soup = BeautifulSoup(response.text, "html.parser")

print(f"Title: {soup.title.get_text(strip=True)}")
Title: All products | Books to Scrape - Sandbox
# Extract product information
product_pods = soup.select("article.product_pod")

single_product = product_pods[0] # Select first product
print(single_product.prettify())
<article class="product_pod">
 <div class="image_container">
  <a href="catalogue/a-light-in-the-attic_1000/index.html">
   <img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>
  </a>
 </div>
 <p class="star-rating Three">
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
 </p>
 <h3>
  <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">
   A Light in the ...
  </a>
 </h3>
 <div class="product_price">
  <p class="price_color">
   £51.77
  </p>
  <p class="instock availability">
   <i class="icon-ok">
   </i>
   In stock
  </p>
  <form>
   <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
    Add to basket
   </button>
  </form>
 </div>
</article>

<article class="product_pod">
 <div class="image_container">
  <a href="catalogue/a-light-in-the-attic_1000/index.html">
   <img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>
  </a>
 </div>
 <p class="star-rating Three">
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
 </p>
 <h3>
  <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">
   A Light in the ...
  </a>
 </h3>
 <div class="product_price">
  <p class="price_color">
   £51.77
  </p>
  <p class="instock availability">
   <i class="icon-ok">
   </i>
   In stock
  </p>
  <form>
   <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
    Add to basket
   </button>
  </form>
 </div>
</article>

Selector Syntax

Basics:

  • tag -> all elements of that tag
    Example: p

  • .class -> elements with a class
    Example: .price_color

  • #id -> element with a specific id
    Example: #product_description

Combinations:

  • A B -> descendant of A (any depth)
    Example: article p

  • A > B -> direct child of A
    Example: article > p

  • A + B -> immediate next sibling
    Example: h2 + p

  • A ~ B -> any following sibling
    Example: #product_description ~ p

Attribute Matching:

  • [attr="value"] -> has attribute
    Example: a[id="product_description"]

# Extract title information
title_element = single_product.select_one("h3 a")
title_element
<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>
text_title = title_element.get_text() # Get content within the <a> tag
attr_title = title_element.get("title") # Get value of an attribute

print(f"Title from get_text(): {text_title}")
print(f"Title from title attribute: {attr_title}")
Title from get_text(): A Light in the ...
Title from title attribute: A Light in the Attic
# Get description from subpage
href = title_element.get("href")
response = requests.get(f"{url}/{href}", timeout=20)
response.raise_for_status()

subsoup = BeautifulSoup(response.text, "html.parser")
description = subsoup.select_one("div[id='product_description'] + p").get_text(strip=True)
description
"It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more"

Lets bring it together now:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from IPython.display import display

url = "https://books.toscrape.com"
response = requests.get(url, timeout=20)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")

def extract_books_single_page(soup: BeautifulSoup, url: str):
    # Retrieve all books
    product_pods = soup.select("article.product_pod")

    # Dictionary in which we fill our data
    scraped_data = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": []}

    # Loop over the products to extract
    for product in product_pods:
        # Info directly on main page
        title = product.select_one("h3 a").get("title")
        price = product.select_one("div.product_price p.price_color").get_text()[1:]
        stock = product.select_one("div.product_price p.instock.availability").get_text().strip()
        rating = product.select_one("p").get("class")[1]

        # Info on subpage
        href = product.select_one("h3 a").get("href")
        response = requests.get(url=f"{url}/{href}", timeout=3)
        response.raise_for_status()
        subsoup = BeautifulSoup(response.text, "html.parser")
        description = subsoup.select_one("article.product_page > p").get_text() # Sometime you might need ">" to enforce direct child selection
        
        scraped_data["Title"].append(title)
        scraped_data["Price"].append(price)
        scraped_data["Availability"].append(stock)
        scraped_data["Rating"].append(rating)
        scraped_data["Description"].append(description)
    return scraped_data

scraped_data = pd.DataFrame(extract_books_single_page(soup, url))

display(scraped_data.head())
Loading...
scraped_data.to_csv("../downloads/books.csv", index=False)

Selenium

Some websites use measures, that make it difficult to scrape simply with requests and BeautifulSoup. For example, by:

  • dynamic content loading

  • pagination

  • bot detection

For this, we can use Selenium, to automate a browser that can interact with the page as a human would. Selenium has many features such as:

  • waiting for elements to load

  • clicking buttons and links

  • filling out forms

  • handling cookies and sessions

pip install selenium

Human-in-the-loop annotation

For many tasks, labels are not available and must be created manually. For this we can use a tool like Label Studio (https://labelstud.io/) to set up a simple annotation pipeline. With Label Studio, you can define labeling schemas for various data types (text, images, tables).

Install and run Label Studio

pip install label-studio
label-studio

(runs on port 8080 by default: http://localhost:8080)

Then open the shown URL, create a new project, and import tasks from the next cell output.

Labeling Demonstration

Download the scraped book data from the Download section and import it as tasks in Label Studio. Use the following label configuration:

<View>
  <Header value="Book Record"/>

  <HyperText name="table" value="&lt;div style='font-family: -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, Helvetica Neue, Arial, sans-serif;'&gt;
    &lt;table style='width:100%; border-collapse:collapse; font-size:14px;'&gt;
      &lt;tr&gt;&lt;th style='text-align:left; padding:8px; border:1px solid #ddd; width:160px;'&gt;Field&lt;/th&gt;&lt;th style='text-align:left; padding:8px; border:1px solid #ddd;'&gt;Value&lt;/th&gt;&lt;/tr&gt;
      &lt;tr&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;&lt;strong&gt;Title&lt;/strong&gt;&lt;/td&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;$Title&lt;/td&gt;&lt;/tr&gt;
      &lt;tr&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;&lt;strong&gt;Price&lt;/strong&gt;&lt;/td&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;$Price&lt;/td&gt;&lt;/tr&gt;
      &lt;tr&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;&lt;strong&gt;Availability&lt;/strong&gt;&lt;/td&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;$Availability&lt;/td&gt;&lt;/tr&gt;
      &lt;tr&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;&lt;strong&gt;Rating&lt;/strong&gt;&lt;/td&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;$Rating&lt;/td&gt;&lt;/tr&gt;
      &lt;tr&gt;&lt;td style='padding:8px; border:1px solid #ddd; vertical-align:top;'&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;$Description&lt;/td&gt;&lt;/tr&gt;
    &lt;/table&gt;
  &lt;/div&gt;"/>

  <Header value="Target Group"/>
  <Choices name="target_group" toName="table" choice="single" showInline="true">
    <Choice value="Kids"/>
    <Choice value="Teens"/>
    <Choice value="Adults"/>
    <Choice value="Unclear"/>
  </Choices>

  <Header value="Compellingness"/>
  <Choices name="compellingness" toName="table" choice="single" showInline="true">
    <Choice value="Compelling"/>
    <Choice value="Not Compelling"/>
  </Choices>
</View>

This is how it looks in Label Studio:

Label Studio annotation interface

Crowdsourcing Data Annotation

Definition: Collecting labeled data from multiple non-expert annotators, typically via platforms like Amazon Mechanical Turk, Prolific, or Label Studio.

AspectTraditionalCrowdsourced
AnnotatorsFew expertsMany non-experts
CostHigh ($50-200/hour)Low ($5-15/hour)
ScaleLimited (100s)Massive (1000s-millions)
QualityHigh per-labelVariable, needs aggregation

Key Insight: Multiple noisy labels can yield better ground truth than a single expert label when properly aggregated.

Challenges and Solutions

Benefits:

  • Rapid scaling for large datasets (ImageNet: 14M images via crowdsourcing)

  • Lower cost per annotation

  • Diverse perspectives reduce single-annotator bias

Challenges:

ChallengeImpactMitigation Strategy
Noisy labelsReduced model accuracyAggregate multiple labels
SpammersWasted budgetQuality control (gold standard questions)
Ambiguous tasksLow agreementClear instructions, examples
Annotator biasSkewed distributionsModel annotator reliability (Dawid-Skene)

Best practice: Collect 3-5 labels per instance and aggregate statistically.

Inter-Annotator Agreement Metrics

Measure consistency across annotators to assess task clarity and label quality.

MetricTypeFormulaInterpretation
Percent Agreement2 annotators
matchestotal\frac{\text{matches}}{\text{total}}
Simple, ignores chance
Cohen’s Kappa2 annotators
κ=pope1pe\kappa = \frac{p_o - p_e}{1 - p_e}
Corrects for chance agreement
Fleiss’ Kappa3+ annotatorsExtension of Cohen’sMulti-annotator, assumes fixed set
Krippendorff’s AlphaAny setupMost generalHandles missing data, any scale

Kappa Scale: < 0 (worse than chance), 0.0-0.2 (slight), 0.2-0.4 (fair), 0.4-0.6 (moderate), 0.6-0.8 (substantial), 0.8-1.0 (near perfect)

# Code Example 1: Computing Inter-Annotator Agreement

import numpy as np
from sklearn.metrics import cohen_kappa_score
from statsmodels.stats.inter_rater import fleiss_kappa, aggregate_raters

# Simulated annotations: 50 items, 3 annotators, 4 classes
np.random.seed(42)
n_items, n_annotators, n_classes = 50, 3, 4

# Generate ground truth and noisy labels
true_labels = np.random.randint(0, n_classes, n_items)
annotations = np.array([
    np.where(np.random.random(n_items) > 0.15, true_labels, 
             np.random.randint(0, n_classes, n_items))
    for _ in range(n_annotators)
]).T  # Shape: (50 items, 3 annotators)

# Cohen's Kappa (pairwise)
kappa_01 = cohen_kappa_score(annotations[:, 0], annotations[:, 1])
kappa_02 = cohen_kappa_score(annotations[:, 0], annotations[:, 2])
kappa_12 = cohen_kappa_score(annotations[:, 1], annotations[:, 2])

print("Inter-Annotator Agreement")
print("=" * 45)
print(f"Cohen's Kappa (A0 vs A1): {kappa_01:.3f}")
print(f"Cohen's Kappa (A0 vs A2): {kappa_02:.3f}")
print(f"Cohen's Kappa (A1 vs A2): {kappa_12:.3f}")
print(f"Mean Kappa (pairwise)   : {np.mean([kappa_01, kappa_02, kappa_12]):.3f}")

# Fleiss' Kappa (all annotators)
agg_table, _ = aggregate_raters(annotations)
fleiss = fleiss_kappa(agg_table)
print(f"Fleiss' Kappa (3 raters): {fleiss:.3f}")
Inter-Annotator Agreement
=============================================
Cohen's Kappa (A0 vs A1): 0.677
Cohen's Kappa (A0 vs A2): 0.728
Cohen's Kappa (A1 vs A2): 0.703
Mean Kappa (pairwise)   : 0.703
Fleiss' Kappa (3 raters): 0.702

Label Aggregation: Majority Voting

Simplest aggregation: Choose the label that appears most often.

y^i=argmaxcj=1J1[yij=c]\hat{y}_i = \arg\max_c \sum_{j=1}^{J} \mathbb{1}[y_{ij} = c]

Where yijy_{ij} is the label given by annotator jj for item ii.

ProsCons
Simple, interpretableTreats all annotators equally
Fast, no training neededIgnores annotator reliability
Works well with high agreementFails with systematic bias

When to use: High-quality crowd, clear task, Fleiss’ Kappa > 0.6

# Code Example 2: Majority Voting with Tie-Breaking

from scipy.stats import mode

def majority_vote(annotations):
    """
    annotations: (n_items, n_annotators) array of integer labels
    Returns: (n_items,) array of aggregated labels
    """
    result = mode(annotations, axis=1, keepdims=False)
    return result.mode

# Apply majority voting
mv_labels = majority_vote(annotations)

# Evaluate against ground truth
mv_accuracy = (mv_labels == true_labels).mean()
individual_accuracies = [
    (annotations[:, j] == true_labels).mean() 
    for j in range(n_annotators)
]

print("Majority Voting Results")
print("=" * 45)
for j, acc in enumerate(individual_accuracies):
    print(f"Annotator {j} accuracy     : {acc:.3f}")
print("-" * 45)
print(f"Majority vote accuracy     : {mv_accuracy:.3f}")
print(f"Improvement over best solo : {mv_accuracy - max(individual_accuracies):.3f}")

# Show first 10 items
print("\nFirst 10 items (True | A0 A1 A2 | MV):")
for i in range(10):
    ann_str = ' '.join(map(str, annotations[i]))
    print(f"  {i:2d}: {true_labels[i]} | {ann_str} | {mv_labels[i]}")
Majority Voting Results
=============================================
Annotator 0 accuracy     : 0.840
Annotator 1 accuracy     : 0.920
Annotator 2 accuracy     : 0.860
---------------------------------------------
Majority vote accuracy     : 0.960
Improvement over best solo : 0.040

First 10 items (True | A0 A1 A2 | MV):
   0: 2 | 2 0 2 | 2
   1: 3 | 3 3 3 | 3
   2: 0 | 0 0 0 | 0
   3: 2 | 2 2 2 | 2
   4: 2 | 0 2 0 | 0
   5: 3 | 3 3 3 | 3
   6: 0 | 0 0 1 | 0
   7: 0 | 1 0 0 | 0
   8: 2 | 2 2 2 | 2
   9: 1 | 1 1 1 | 1

Dawid-Skene Model (1979)

Key idea: Model each annotator’s confusion matrix and the true (latent) labels jointly.

Probabilistic formulation:

  • ZiZ_i = true label for item ii (latent)

  • YijY_{ij} = label from annotator jj on item ii (observed)

  • πkl(j)\pi^{(j)}_{kl} = P(annotator jj labels class kk as ll)

Estimation via EM algorithm:

  1. E-step: Compute posterior P(Zi=kY,π)P(Z_i = k | Y, \pi) for each item

  2. M-step: Update annotator confusion matrices π(j)\pi^{(j)}

  3. Repeat until convergence

Benefits: Discovers low-quality annotators automatically, weights them appropriately.

# Code Example 3: Dawid-Skene Implementation (Simplified EM)

def dawid_skene(annotations, n_classes, max_iter=20, tol=1e-4):
    """
    annotations: (n_items, n_annotators)
    Returns: aggregated labels, annotator error rates
    """
    n_items, n_ann = annotations.shape
    
    # Initialize with majority vote
    labels = majority_vote(annotations)
    
    # Confusion matrices: (n_annotators, n_classes, n_classes)
    confusion = np.zeros((n_ann, n_classes, n_classes))
    
    for iteration in range(max_iter):
        old_labels = labels.copy()
        
        # M-step: estimate confusion matrices
        for j in range(n_ann):
            for k in range(n_classes):
                mask = (labels == k)
                if mask.sum() > 0:
                    for l in range(n_classes):
                        confusion[j, k, l] = ((annotations[mask, j] == l).sum() + 1) / (mask.sum() + n_classes)
        
        # E-step: estimate labels from weighted votes
        for i in range(n_items):
            log_probs = np.zeros(n_classes)
            for k in range(n_classes):
                for j in range(n_ann):
                    log_probs[k] += np.log(confusion[j, k, annotations[i, j]] + 1e-10)
            labels[i] = np.argmax(log_probs)
        
        # Check convergence
        if np.all(labels == old_labels):
            break
    
    # Compute error rates (1 - accuracy on diagonal)
    error_rates = [1 - confusion[j].diagonal().mean() for j in range(n_ann)]
    
    return labels, error_rates, confusion
# Code Example 4: Comparing Aggregation Methods

# Run Dawid-Skene
ds_labels, error_rates, confusion_matrices = dawid_skene(annotations, n_classes)
ds_accuracy = (ds_labels == true_labels).mean()

print("Dawid-Skene Results")
print("=" * 45)
for j, err in enumerate(error_rates):
    print(f"Annotator {j} error rate   : {err:.3f}")
print("-" * 45)
print(f"Dawid-Skene accuracy       : {ds_accuracy:.3f}")
print(f"Majority vote accuracy     : {mv_accuracy:.3f}")
print(f"Best individual accuracy   : {max(individual_accuracies):.3f}")
print("-" * 45)
print(f"DS improvement over MV     : {ds_accuracy - mv_accuracy:.3f}")

# Show confusion matrix for worst annotator
worst_ann = np.argmax(error_rates)
print(f"\nConfusion Matrix for Annotator {worst_ann} (worst):")
print(confusion_matrices[worst_ann].round(2))

# Count disagreements
disagree = (mv_labels != ds_labels).sum()
print(f"\nMV vs DS disagreements: {disagree}/{n_items} items ({disagree/n_items*100:.1f}%)")
Dawid-Skene Results
=============================================
Annotator 0 error rate   : 0.276
Annotator 1 error rate   : 0.271
Annotator 2 error rate   : 0.269
---------------------------------------------
Dawid-Skene accuracy       : 0.960
Majority vote accuracy     : 0.960
Best individual accuracy   : 0.920
---------------------------------------------
DS improvement over MV     : 0.000

Confusion Matrix for Annotator 0 (worst):
[[0.67 0.13 0.13 0.07]
 [0.07 0.73 0.13 0.07]
 [0.07 0.07 0.73 0.13]
 [0.05 0.05 0.14 0.76]]

MV vs DS disagreements: 0/50 items (0.0%)

Best Practices

MethodWhen to UseComplexity
Majority VotingHigh agreement (Kappa > 0.6), trusted crowdO(n)
Dawid-SkeneVariable annotator quality, systematic biasO(n × iter)
MACE, GLADAdversarial annotators, item difficulty variesHigher

Best Practices:

  1. Collect 3-5 labels per item (diminishing returns after 5)

  2. Use qualification tasks to filter low-quality workers

  3. Measure IAA early; if Kappa < 0.4, redesign the task

  4. Start with majority vote, upgrade to Dawid-Skene if annotator quality varies

  5. Report both aggregated accuracy and IAA in papers

References:

  • Dawid & Skene (1979). Maximum likelihood estimation from incomplete data. JRSS-C.

  • Snow et al. (2008). Cheap and Fast—But is it Good? EMNLP.

Datasheets for Datasets

Core Idea: Every dataset should be accompanied by a datasheet documenting its motivation, composition, collection process, and recommended uses.

Analogy: Electronics components come with datasheets — why not datasets?

ProblemSolution
Datasets used without contextDocumented motivation and scope
Unknown biases in training dataTransparent collection process
Models deployed in harmful waysClear intended and out-of-scope uses

Goal: Improve transparency, accountability, and reproducibility in ML.

The Seven Core Sections

Every datasheet should answer questions in seven areas:

SectionKey Questions
1. MotivationWhy was the dataset created? Who funded it?
2. CompositionWhat data does it contain? How many instances?
3. CollectionHow was data acquired? Was consent obtained?
4. PreprocessingWhat cleaning was done? Who labeled the data?
5. UsesWhat tasks is it intended for? What should be avoided?
6. DistributionHow is it shared? Under what license?
7. MaintenanceWho maintains it? Will it be updated?

Why Datasheets Matter

Undocumented datasets have caused measurable harm:

Case StudyHarmMissing Documentation
Facial recognition systemsHigher error rates for darker-skinned womenNo demographic breakdown
Hate speech classifiersOver-flagging African-American EnglishNo dialect information
Medical imaging AIPoor performance on non-Western patientsGeographic composition undisclosed
Recidivism predictionDiscriminatory against Black defendantsHistorical bias not documented

Datasheets shift responsibility from implicit to explicit.

# Code Example 1: Datasheet as a Structured Object

datasheet = {
    "dataset_name": "FacialExpressions-5K",
    "version": "1.0",
    "motivation": {
        "purpose": "Expression recognition for assistive technology",
        "funded_by": "EU Horizon Grant #12345"
    },
    "composition": {
        "instances": 5000,
        "type": "Facial images (224x224 JPEG)",
        "labels": ["happy", "sad", "neutral", "surprised", "angry"],
        "sensitive_data": "Contains biometric data; consent obtained"
    },
    "uses": {
        "intended": ["Affective computing research"],
        "prohibited": ["Surveillance", "Hiring decisions", "Law enforcement"]
    },
    "distribution": {
        "license": "CC BY-NC 4.0",
        "url": "https://example.org/dataset"
    }
}

import json
print(json.dumps(datasheet, indent=2))
{
  "dataset_name": "FacialExpressions-5K",
  "version": "1.0",
  "motivation": {
    "purpose": "Expression recognition for assistive technology",
    "funded_by": "EU Horizon Grant #12345"
  },
  "composition": {
    "instances": 5000,
    "type": "Facial images (224x224 JPEG)",
    "labels": [
      "happy",
      "sad",
      "neutral",
      "surprised",
      "angry"
    ],
    "sensitive_data": "Contains biometric data; consent obtained"
  },
  "uses": {
    "intended": [
      "Affective computing research"
    ],
    "prohibited": [
      "Surveillance",
      "Hiring decisions",
      "Law enforcement"
    ]
  },
  "distribution": {
    "license": "CC BY-NC 4.0",
    "url": "https://example.org/dataset"
  }
}

Datasheets are part of a broader ecosystem of responsible AI documentation:

FrameworkFocusProposed By
Datasheets for DatasetsDataset documentationGebru et al., 2021
Model CardsML model reportingMitchell et al., 2019
Data Nutrition LabelsQuick dataset overviewHolland et al., 2018
FactSheetsAI service transparencyArnold et al., 2019

Together they create a full provenance chain from data collection to deployment.

Best Practices Data Sheets

  1. Complete all seven sections before publishing a dataset

  2. Document demographic breakdowns in the Composition section

  3. Be explicit about prohibited uses, not just intended ones

  4. Report inter-annotator agreement for labeled data

  5. Version datasheets alongside datasets

Reference: Gebru, T. et al. (2021). Datasheets for Datasets. Communications of the ACM, 64(12), 86-92. Gebru et al. (2021)

Summary Session 1

  • Many data sets can be obtained from APIs / platforms

  • Scraping is easy

  • Annotating data remains challenging

  • High quality annotations require aggregations accounting for low quality labels

  • Prepare DataSheets for your data set

Preparation for next session

Next session, we will start working with our cluster interactively, so please prepare by:

References
  1. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. 10.1145/3458723