Session 1: Data Acquisition

For executing the notebook, run the requirements file (see Downloads).

Learning goals¶

By the end of this notebook you will know how to:

Get data from OpenML
Scrape data from web sources
Build a human-in-the-loop annotation pipeline
Aggregate crowdsourcing data
Prepare DataSheets for your data set

Public Data APIs¶

Often data for a given task can be found in public data sources, such as:

Direct downloads are often the fastest first step before scraping.

OpenML¶

Open Machine Learning Initiative¶

OpenML is an open, collaborative platform for sharing and organizing machine learning research, founded in 2012.

Component	Description
Datasets	Curated tabular, image, and text datasets
Tasks	Standardized ML problems with fixed splits and metrics
Flows	Serialized ML pipelines and algorithm descriptions
Runs	Logged results of applying a Flow to a Task

As of 2024, OpenML hosts 4,000+ datasets, 10,000+ tasks, and 1,000,000+ experimental runs.

import openml
import pandas as pd

# OpenML ID for the classic Titanic dataset
dataset = openml.datasets.get_dataset(40945)

# Extract features and target
# categorical_indicator helps identify which columns to treat as categories
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format="dataframe", 
    target=dataset.default_target_attribute
)

print(f"Dataset: {dataset.name} | Rows: {X.shape[0]}, Cols: {X.shape[1]}")
X.head(3)

Dataset: Titanic | Rows: 1309, Cols: 13

Data Scraping¶

When data is not available via APIs or direct downloads, we can often still get it by scraping it from the web.

Tabular scraping with pandas¶

Just like in some previous lectures, we start with Wikipedia tables.

Run the next cell, then choose a URL and inspect available tables interactively.

import pandas as pd
import requests
from io import StringIO

# Wikipedia started blocking requests without a user-agent header, so we need to set it to mimic a browser
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
}
response = requests.get("https://en.wikipedia.org/wiki/Berlin_population_statistics", headers=headers, timeout=20)
response.raise_for_status()

# Parse HTML text
tables = pd.read_html(StringIO(response.text))
print(f"Found {len(tables)} tables. Showing table 0 ({tables[0].shape[0]} rows, {tables[0].shape[1]} columns):")
tables[0].head(3)

Found 7 tables. Showing table 0 (13 rows, 4 columns):

HTML parsing with BeautifulSoup¶

Now we move to more sophisticated data scraping not just from tables, but from the HTML page structure.

A great playground to test scraping: https://toscrape.com
We will use: https://books.toscrape.com to extract book titles, prices, ratings and descriptions.

import requests
from bs4 import BeautifulSoup

url = "https://books.toscrape.com"
response = requests.get(url, timeout=20)
response.raise_for_status()

# Build the soup object parsing the HTML
soup = BeautifulSoup(response.text, "html.parser")

print(f"Title: {soup.title.get_text(strip=True)}")

Title: All products | Books to Scrape - Sandbox

# Extract product information
product_pods = soup.select("article.product_pod")

single_product = product_pods[0] # Select first product
print(single_product.prettify())

<article class="product_pod">
 <div class="image_container">
  <a href="catalogue/a-light-in-the-attic_1000/index.html">
   <img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>
  </a>
 </div>
 <p class="star-rating Three">
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
 </p>
 <h3>
  <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">
   A Light in the ...
  </a>
 </h3>
 <div class="product_price">
  <p class="price_color">
   Â£51.77
  </p>
  <p class="instock availability">
   <i class="icon-ok">
   </i>
   In stock
  </p>
  <form>
   <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
    Add to basket
   </button>
  </form>
 </div>
</article>

<article class="product_pod">
 <div class="image_container">
  <a href="catalogue/a-light-in-the-attic_1000/index.html">
   <img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>
  </a>
 </div>
 <p class="star-rating Three">
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
 </p>
 <h3>
  <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">
   A Light in the ...
  </a>
 </h3>
 <div class="product_price">
  <p class="price_color">
   Â£51.77
  </p>
  <p class="instock availability">
   <i class="icon-ok">
   </i>
   In stock
  </p>
  <form>
   <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
    Add to basket
   </button>
  </form>
 </div>
</article>

Selector Syntax¶

Basics:

tag -> all elements of that tag
Example: p
.class -> elements with a class
Example: .price_color
#id -> element with a specific id
Example: #product_description

Combinations:

A B -> descendant of A (any depth)
Example: article p
A > B -> direct child of A
Example: article > p
A + B -> immediate next sibling
Example: h2 + p
A ~ B -> any following sibling
Example: #product_description ~ p

Attribute Matching:

[attr="value"] -> has attribute
Example: a[id="product_description"]

# Extract title information
title_element = single_product.select_one("h3 a")
title_element

<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>

text_title = title_element.get_text() # Get content within the <a> tag
attr_title = title_element.get("title") # Get value of an attribute

print(f"Title from get_text(): {text_title}")
print(f"Title from title attribute: {attr_title}")

Title from get_text(): A Light in the ...
Title from title attribute: A Light in the Attic

# Get description from subpage
href = title_element.get("href")
response = requests.get(f"{url}/{href}", timeout=20)
response.raise_for_status()

subsoup = BeautifulSoup(response.text, "html.parser")
description = subsoup.select_one("div[id='product_description'] + p").get_text(strip=True)
description

"It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more"

Lets bring it together now:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from IPython.display import display

url = "https://books.toscrape.com"
response = requests.get(url, timeout=20)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")

def extract_books_single_page(soup: BeautifulSoup, url: str):
    # Retrieve all books
    product_pods = soup.select("article.product_pod")

    # Dictionary in which we fill our data
    scraped_data = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": []}

    # Loop over the products to extract
    for product in product_pods:
        # Info directly on main page
        title = product.select_one("h3 a").get("title")
        price = product.select_one("div.product_price p.price_color").get_text()[1:]
        stock = product.select_one("div.product_price p.instock.availability").get_text().strip()
        rating = product.select_one("p").get("class")[1]

        # Info on subpage
        href = product.select_one("h3 a").get("href")
        response = requests.get(url=f"{url}/{href}", timeout=3)
        response.raise_for_status()
        subsoup = BeautifulSoup(response.text, "html.parser")
        description = subsoup.select_one("article.product_page > p").get_text() # Sometime you might need ">" to enforce direct child selection
        
        scraped_data["Title"].append(title)
        scraped_data["Price"].append(price)
        scraped_data["Availability"].append(stock)
        scraped_data["Rating"].append(rating)
        scraped_data["Description"].append(description)
    return scraped_data

scraped_data = pd.DataFrame(extract_books_single_page(soup, url))

display(scraped_data.head())

scraped_data.to_csv("../downloads/books.csv", index=False)

Selenium¶

Some websites use measures, that make it difficult to scrape simply with requests and BeautifulSoup. For example, by:

dynamic content loading
pagination
bot detection

For this, we can use Selenium, to automate a browser that can interact with the page as a human would. Selenium has many features such as:

waiting for elements to load
clicking buttons and links
filling out forms
handling cookies and sessions

pip install selenium

Human-in-the-loop annotation¶

For many tasks, labels are not available and must be created manually. For this we can use a tool like Label Studio (https://labelstud.io/) to set up a simple annotation pipeline. With Label Studio, you can define labeling schemas for various data types (text, images, tables).

Install and run Label Studio¶

pip install label-studio
label-studio

(runs on port 8080 by default: http://localhost:8080)

Then open the shown URL, create a new project, and import tasks from the next cell output.

Labeling Demonstration¶

Download the scraped book data from the Download section and import it as tasks in Label Studio. Use the following label configuration:

<View>
  <Header value="Book Record"/>

  <HyperText name="table" value="&lt;div style='font-family: -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, Helvetica Neue, Arial, sans-serif;'&gt;
    &lt;table style='width:100%; border-collapse:collapse; font-size:14px;'&gt;
      &lt;tr&gt;&lt;th style='text-align:left; padding:8px; border:1px solid #ddd; width:160px;'&gt;Field&lt;/th&gt;&lt;th style='text-align:left; padding:8px; border:1px solid #ddd;'&gt;Value&lt;/th&gt;&lt;/tr&gt;
      &lt;tr&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;&lt;strong&gt;Title&lt;/strong&gt;&lt;/td&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;$Title&lt;/td&gt;&lt;/tr&gt;
      &lt;tr&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;&lt;strong&gt;Price&lt;/strong&gt;&lt;/td&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;$Price&lt;/td&gt;&lt;/tr&gt;
      &lt;tr&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;&lt;strong&gt;Availability&lt;/strong&gt;&lt;/td&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;$Availability&lt;/td&gt;&lt;/tr&gt;
      &lt;tr&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;&lt;strong&gt;Rating&lt;/strong&gt;&lt;/td&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;$Rating&lt;/td&gt;&lt;/tr&gt;
      &lt;tr&gt;&lt;td style='padding:8px; border:1px solid #ddd; vertical-align:top;'&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;&lt;td style='padding:8px; border:1px solid #ddd;'&gt;$Description&lt;/td&gt;&lt;/tr&gt;
    &lt;/table&gt;
  &lt;/div&gt;"/>

  <Header value="Target Group"/>
  <Choices name="target_group" toName="table" choice="single" showInline="true">
    <Choice value="Kids"/>
    <Choice value="Teens"/>
    <Choice value="Adults"/>
    <Choice value="Unclear"/>
  </Choices>

  <Header value="Compellingness"/>
  <Choices name="compellingness" toName="table" choice="single" showInline="true">
    <Choice value="Compelling"/>
    <Choice value="Not Compelling"/>
  </Choices>
</View>

This is how it looks in Label Studio:

Crowdsourcing Data Annotation¶

Definition: Collecting labeled data from multiple non-expert annotators, typically via platforms like Amazon Mechanical Turk, Prolific, or Label Studio.

Aspect	Traditional	Crowdsourced
Annotators	Few experts	Many non-experts
Cost	High ($50-200/hour)	Low ($5-15/hour)
Scale	Limited (100s)	Massive (1000s-millions)
Quality	High per-label	Variable, needs aggregation

Key Insight: Multiple noisy labels can yield better ground truth than a single expert label when properly aggregated.

Challenges and Solutions¶

Benefits:

Rapid scaling for large datasets (ImageNet: 14M images via crowdsourcing)
Lower cost per annotation
Diverse perspectives reduce single-annotator bias

Challenges:

Challenge	Impact	Mitigation Strategy
Noisy labels	Reduced model accuracy	Aggregate multiple labels
Spammers	Wasted budget	Quality control (gold standard questions)
Ambiguous tasks	Low agreement	Clear instructions, examples
Annotator bias	Skewed distributions	Model annotator reliability (Dawid-Skene)

Best practice: Collect 3-5 labels per instance and aggregate statistically.

Inter-Annotator Agreement Metrics¶

Measure consistency across annotators to assess task clarity and label quality.

Metric	Type	Formula	Interpretation
Percent Agreement	2 annotators	$\frac{\text{matches}}{\text{total}}$ (1)	Simple, ignores chance
Cohen’s Kappa	2 annotators	$\kappa = \frac{p_o - p_e}{1 - p_e}$ (2)	Corrects for chance agreement
Fleiss’ Kappa	3+ annotators	Extension of Cohen’s	Multi-annotator, assumes fixed set
Krippendorff’s Alpha	Any setup	Most general	Handles missing data, any scale

Kappa Scale: < 0 (worse than chance), 0.0-0.2 (slight), 0.2-0.4 (fair), 0.4-0.6 (moderate), 0.6-0.8 (substantial), 0.8-1.0 (near perfect)

# Code Example 1: Computing Inter-Annotator Agreement

import numpy as np
from sklearn.metrics import cohen_kappa_score
from statsmodels.stats.inter_rater import fleiss_kappa, aggregate_raters

# Simulated annotations: 50 items, 3 annotators, 4 classes
np.random.seed(42)
n_items, n_annotators, n_classes = 50, 3, 4

# Generate ground truth and noisy labels
true_labels = np.random.randint(0, n_classes, n_items)
annotations = np.array([
    np.where(np.random.random(n_items) > 0.15, true_labels, 
             np.random.randint(0, n_classes, n_items))
    for _ in range(n_annotators)
]).T  # Shape: (50 items, 3 annotators)

# Cohen's Kappa (pairwise)
kappa_01 = cohen_kappa_score(annotations[:, 0], annotations[:, 1])
kappa_02 = cohen_kappa_score(annotations[:, 0], annotations[:, 2])
kappa_12 = cohen_kappa_score(annotations[:, 1], annotations[:, 2])

print("Inter-Annotator Agreement")
print("=" * 45)
print(f"Cohen's Kappa (A0 vs A1): {kappa_01:.3f}")
print(f"Cohen's Kappa (A0 vs A2): {kappa_02:.3f}")
print(f"Cohen's Kappa (A1 vs A2): {kappa_12:.3f}")
print(f"Mean Kappa (pairwise)   : {np.mean([kappa_01, kappa_02, kappa_12]):.3f}")

# Fleiss' Kappa (all annotators)
agg_table, _ = aggregate_raters(annotations)
fleiss = fleiss_kappa(agg_table)
print(f"Fleiss' Kappa (3 raters): {fleiss:.3f}")

Inter-Annotator Agreement
=============================================
Cohen's Kappa (A0 vs A1): 0.677
Cohen's Kappa (A0 vs A2): 0.728
Cohen's Kappa (A1 vs A2): 0.703
Mean Kappa (pairwise)   : 0.703
Fleiss' Kappa (3 raters): 0.702

Label Aggregation: Majority Voting¶

Simplest aggregation: Choose the label that appears most often.

$\hat{y}_i = \arg\max_c \sum_{j=1}^{J} \mathbb{1}[y_{ij} = c]$

Where $y_{ij}$ is the label given by annotator $j$ for item $i$ .

Pros	Cons
Simple, interpretable	Treats all annotators equally
Fast, no training needed	Ignores annotator reliability
Works well with high agreement	Fails with systematic bias

When to use: High-quality crowd, clear task, Fleiss’ Kappa > 0.6

# Code Example 2: Majority Voting with Tie-Breaking

from scipy.stats import mode

def majority_vote(annotations):
    """
    annotations: (n_items, n_annotators) array of integer labels
    Returns: (n_items,) array of aggregated labels
    """
    result = mode(annotations, axis=1, keepdims=False)
    return result.mode

# Apply majority voting
mv_labels = majority_vote(annotations)

# Evaluate against ground truth
mv_accuracy = (mv_labels == true_labels).mean()
individual_accuracies = [
    (annotations[:, j] == true_labels).mean() 
    for j in range(n_annotators)
]

print("Majority Voting Results")
print("=" * 45)
for j, acc in enumerate(individual_accuracies):
    print(f"Annotator {j} accuracy     : {acc:.3f}")
print("-" * 45)
print(f"Majority vote accuracy     : {mv_accuracy:.3f}")
print(f"Improvement over best solo : {mv_accuracy - max(individual_accuracies):.3f}")

# Show first 10 items
print("\nFirst 10 items (True | A0 A1 A2 | MV):")
for i in range(10):
    ann_str = ' '.join(map(str, annotations[i]))
    print(f"  {i:2d}: {true_labels[i]} | {ann_str} | {mv_labels[i]}")

Majority Voting Results
=============================================
Annotator 0 accuracy     : 0.840
Annotator 1 accuracy     : 0.920
Annotator 2 accuracy     : 0.860
---------------------------------------------
Majority vote accuracy     : 0.960
Improvement over best solo : 0.040

First 10 items (True | A0 A1 A2 | MV):
   0: 2 | 2 0 2 | 2
   1: 3 | 3 3 3 | 3
   2: 0 | 0 0 0 | 0
   3: 2 | 2 2 2 | 2
   4: 2 | 0 2 0 | 0
   5: 3 | 3 3 3 | 3
   6: 0 | 0 0 1 | 0
   7: 0 | 1 0 0 | 0
   8: 2 | 2 2 2 | 2
   9: 1 | 1 1 1 | 1

Dawid-Skene Model (1979)¶

Key idea: Model each annotator’s confusion matrix and the true (latent) labels jointly.

Probabilistic formulation:

$Z_i$ = true label for item $i$ (latent)
$Y_{ij}$ = label from annotator $j$ on item $i$ (observed)
$\pi^{(j)}_{kl}$ = P(annotator $j$ labels class $k$ as $l$ )

Estimation via EM algorithm:

E-step: Compute posterior $P(Z_i = k | Y, \pi)$ for each item
M-step: Update annotator confusion matrices $\pi^{(j)}$
Repeat until convergence

Benefits: Discovers low-quality annotators automatically, weights them appropriately.

# Code Example 3: Dawid-Skene Implementation (Simplified EM)

def dawid_skene(annotations, n_classes, max_iter=20, tol=1e-4):
    """
    annotations: (n_items, n_annotators)
    Returns: aggregated labels, annotator error rates
    """
    n_items, n_ann = annotations.shape
    
    # Initialize with majority vote
    labels = majority_vote(annotations)
    
    # Confusion matrices: (n_annotators, n_classes, n_classes)
    confusion = np.zeros((n_ann, n_classes, n_classes))
    
    for iteration in range(max_iter):
        old_labels = labels.copy()
        
        # M-step: estimate confusion matrices
        for j in range(n_ann):
            for k in range(n_classes):
                mask = (labels == k)
                if mask.sum() > 0:
                    for l in range(n_classes):
                        confusion[j, k, l] = ((annotations[mask, j] == l).sum() + 1) / (mask.sum() + n_classes)
        
        # E-step: estimate labels from weighted votes
        for i in range(n_items):
            log_probs = np.zeros(n_classes)
            for k in range(n_classes):
                for j in range(n_ann):
                    log_probs[k] += np.log(confusion[j, k, annotations[i, j]] + 1e-10)
            labels[i] = np.argmax(log_probs)
        
        # Check convergence
        if np.all(labels == old_labels):
            break
    
    # Compute error rates (1 - accuracy on diagonal)
    error_rates = [1 - confusion[j].diagonal().mean() for j in range(n_ann)]
    
    return labels, error_rates, confusion

# Code Example 4: Comparing Aggregation Methods

# Run Dawid-Skene
ds_labels, error_rates, confusion_matrices = dawid_skene(annotations, n_classes)
ds_accuracy = (ds_labels == true_labels).mean()

print("Dawid-Skene Results")
print("=" * 45)
for j, err in enumerate(error_rates):
    print(f"Annotator {j} error rate   : {err:.3f}")
print("-" * 45)
print(f"Dawid-Skene accuracy       : {ds_accuracy:.3f}")
print(f"Majority vote accuracy     : {mv_accuracy:.3f}")
print(f"Best individual accuracy   : {max(individual_accuracies):.3f}")
print("-" * 45)
print(f"DS improvement over MV     : {ds_accuracy - mv_accuracy:.3f}")

# Show confusion matrix for worst annotator
worst_ann = np.argmax(error_rates)
print(f"\nConfusion Matrix for Annotator {worst_ann} (worst):")
print(confusion_matrices[worst_ann].round(2))

# Count disagreements
disagree = (mv_labels != ds_labels).sum()
print(f"\nMV vs DS disagreements: {disagree}/{n_items} items ({disagree/n_items*100:.1f}%)")

Dawid-Skene Results
=============================================
Annotator 0 error rate   : 0.276
Annotator 1 error rate   : 0.271
Annotator 2 error rate   : 0.269
---------------------------------------------
Dawid-Skene accuracy       : 0.960
Majority vote accuracy     : 0.960
Best individual accuracy   : 0.920
---------------------------------------------
DS improvement over MV     : 0.000

Confusion Matrix for Annotator 0 (worst):
[[0.67 0.13 0.13 0.07]
 [0.07 0.73 0.13 0.07]
 [0.07 0.07 0.73 0.13]
 [0.05 0.05 0.14 0.76]]

MV vs DS disagreements: 0/50 items (0.0%)

Best Practices¶

Method	When to Use	Complexity
Majority Voting	High agreement (Kappa > 0.6), trusted crowd	O(n)
Dawid-Skene	Variable annotator quality, systematic bias	O(n × iter)
MACE, GLAD	Adversarial annotators, item difficulty varies	Higher

Best Practices:

Collect 3-5 labels per item (diminishing returns after 5)
Use qualification tasks to filter low-quality workers
Measure IAA early; if Kappa < 0.4, redesign the task
Start with majority vote, upgrade to Dawid-Skene if annotator quality varies
Report both aggregated accuracy and IAA in papers

References:

Dawid & Skene (1979). Maximum likelihood estimation from incomplete data. JRSS-C.
Snow et al. (2008). Cheap and Fast—But is it Good? EMNLP.

Datasheets for Datasets¶

Core Idea: Every dataset should be accompanied by a datasheet documenting its motivation, composition, collection process, and recommended uses.

Analogy: Electronics components come with datasheets — why not datasets?

Problem	Solution
Datasets used without context	Documented motivation and scope
Unknown biases in training data	Transparent collection process
Models deployed in harmful ways	Clear intended and out-of-scope uses

Goal: Improve transparency, accountability, and reproducibility in ML.

The Seven Core Sections¶

Every datasheet should answer questions in seven areas:

Section	Key Questions
1. Motivation	Why was the dataset created? Who funded it?
2. Composition	What data does it contain? How many instances?
3. Collection	How was data acquired? Was consent obtained?
4. Preprocessing	What cleaning was done? Who labeled the data?
5. Uses	What tasks is it intended for? What should be avoided?
6. Distribution	How is it shared? Under what license?
7. Maintenance	Who maintains it? Will it be updated?

Why Datasheets Matter¶

Undocumented datasets have caused measurable harm:

Case Study	Harm	Missing Documentation
Facial recognition systems	Higher error rates for darker-skinned women	No demographic breakdown
Hate speech classifiers	Over-flagging African-American English	No dialect information
Medical imaging AI	Poor performance on non-Western patients	Geographic composition undisclosed
Recidivism prediction	Discriminatory against Black defendants	Historical bias not documented

Datasheets shift responsibility from implicit to explicit.

# Code Example 1: Datasheet as a Structured Object

datasheet = {
    "dataset_name": "FacialExpressions-5K",
    "version": "1.0",
    "motivation": {
        "purpose": "Expression recognition for assistive technology",
        "funded_by": "EU Horizon Grant #12345"
    },
    "composition": {
        "instances": 5000,
        "type": "Facial images (224x224 JPEG)",
        "labels": ["happy", "sad", "neutral", "surprised", "angry"],
        "sensitive_data": "Contains biometric data; consent obtained"
    },
    "uses": {
        "intended": ["Affective computing research"],
        "prohibited": ["Surveillance", "Hiring decisions", "Law enforcement"]
    },
    "distribution": {
        "license": "CC BY-NC 4.0",
        "url": "https://example.org/dataset"
    }
}

import json
print(json.dumps(datasheet, indent=2))

{
  "dataset_name": "FacialExpressions-5K",
  "version": "1.0",
  "motivation": {
    "purpose": "Expression recognition for assistive technology",
    "funded_by": "EU Horizon Grant #12345"
  },
  "composition": {
    "instances": 5000,
    "type": "Facial images (224x224 JPEG)",
    "labels": [
      "happy",
      "sad",
      "neutral",
      "surprised",
      "angry"
    ],
    "sensitive_data": "Contains biometric data; consent obtained"
  },
  "uses": {
    "intended": [
      "Affective computing research"
    ],
    "prohibited": [
      "Surveillance",
      "Hiring decisions",
      "Law enforcement"
    ]
  },
  "distribution": {
    "license": "CC BY-NC 4.0",
    "url": "https://example.org/dataset"
  }
}

Datasheets are part of a broader ecosystem of responsible AI documentation:

Framework	Focus	Proposed By
Datasheets for Datasets	Dataset documentation	Gebru et al., 2021
Model Cards	ML model reporting	Mitchell et al., 2019
Data Nutrition Labels	Quick dataset overview	Holland et al., 2018
FactSheets	AI service transparency	Arnold et al., 2019

Together they create a full provenance chain from data collection to deployment.

Best Practices Data Sheets¶

Complete all seven sections before publishing a dataset
Document demographic breakdowns in the Composition section
Be explicit about prohibited uses, not just intended ones
Report inter-annotator agreement for labeled data
Version datasheets alongside datasets

Reference: Gebru, T. et al. (2021). Datasheets for Datasets. Communications of the ACM, 64(12), 86-92. Gebru et al. (2021)

Summary Session 1¶

Many data sets can be obtained from APIs / platforms
Scraping is easy
Annotating data remains challenging
High quality annotations require aggregations accounting for low quality labels
Prepare DataSheets for your data set

Preparation for next session¶

Next session, we will start working with our cluster interactively, so please prepare by:

Setting up VPN access with FortiClient to the BHT, see documentation: VPN Zugang einrichten (use a translator if needed)
Working through the Quickstart Guide of our cluster: Quickstart (Cluster Docs)

References¶

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. 10.1145/3458723