Why Semantics Matter for R Data Frames

R users love data.frames and tibbles for tidy, rectangular data. But tidy data isn’t always meaningful data. What does a column labelled gdp actually represent? Euros? Millions? Per capita? Current prices? Constant 2010 prices? These questions matter—especially in statistics, open data publishing, and knowledge graph integration.

The dataset_df class extends the familiar data.frame structure with lightweight, semantically meaningful metadata. It’s built for:

dataset_df helps you preserve the meaning of variables, units, identifiers, and dataset-level context.

From Tidy to Meaningful: An Example

Let’s start with a basic data frame and upgrade it to a dataset_df with semantically enriched columns using defined():

small_country_dataset <- dataset_df(
  country_name = defined(c("AD", "LI"),
    label = "Country name",
    concept = "http://data.europa.eu/bna/c_6c2bb82d",
    namespace = "https://www.geonames.org/countries/$1/"
  ),
  gdp = defined(c(3897, 7365),
    label = "Gross Domestic Product",
    unit = "million dollars",
    concept = "http://data.europa.eu/83i/aa/GDP"
  ),
  dataset_bibentry = dublincore(
    title = "Small Country Dataset",
    creator = person("Jane", "Doe"),
    publisher = "Example Inc."
  )
)

The defined() vectors attach metadata to each column:

label: a human-readable name
unit: an explicit measurement unit
concept: a URI identifying the concept measured
namespace: for generating full subject URIs when exporting to RDF

The dataset_df() call also allows bibliographic metadata:

dataset_bibentry: Dublin Core metadata for citation, reuse, and provenance

Why Units Matter

Many statistical errors begin with a silent assumption about units. In Eurostat data, it’s common to see:

EUR: Euros
MIO_EUR: Millions of euros
PPS: Purchasing Power Standards

By making units explicit at the column level, you:

Prevent decimal-scale mistakes (e.g., thousands vs millions)
Avoid joining or averaging incompatible series
Gain confidence in your data exports (CSV, RDF, JSON-LD, etc.)

This is especially important in multi-currency and multi-country datasets such as those published by Eurostat, where harmonization is crucial.

A Final Structure, Ready for Export

The enriched dataset_df object can be serialized to RDF using:

triples <- dataset_to_triples(small_country_dataset)

n_triples(mapply(n_triple, triples$s, triples$p, triples$o))
#> [1] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"rowid\"^^<http://www.w3.org/2001/XMLSchema#string> \"eg:1\"^^<http://www.w3.org/2001/XMLSchema#string> ."     
#> [2] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"rowid\"^^<http://www.w3.org/2001/XMLSchema#string> \"eg:2\"^^<http://www.w3.org/2001/XMLSchema#string> ."     
#> [3] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"country_name\"^^<http://www.w3.org/2001/XMLSchema#string> \"AD\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [4] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"country_name\"^^<http://www.w3.org/2001/XMLSchema#string> \"LI\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [5] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"gdp\"^^<http://www.w3.org/2001/XMLSchema#string> \"3897\"^^<http://www.w3.org/2001/XMLSchema#string> ."       
#> [6] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"gdp\"^^<http://www.w3.org/2001/XMLSchema#string> \"7365\"^^<http://www.w3.org/2001/XMLSchema#string> ."

This supports export to:

Wikibase via wbdataset
RDF Data Cube via datacube
DataCite or DCAT metadata formats

This vignette represents the final conceptual structure for dataset_df before its rOpenSci submission. Future work will build on this foundation without breaking it.

Summary: Why Use `dataset_df`

Feature	What It Adds
`label`	Human-readable variable name
`unit`	Explicit unit (e.g., `MIO_EUR`)
`concept`	URI identifying what is measured
`subject`	Dataset-level topical classification
`namespace`	Base URI for RDF subject identifiers
`dataset_bibentry`	Bibliographic metadata via Dublin Core

The dataset_df class is designed to remain fully compatible with the tidyverse data workflow, while offering a metadata structure suitable for:

Receiving SDMX-style statistical data into R
Exporting semantically meaningful datasets to DCAT, RDF, or Wikibase
Complying with open science repository requirements (e.g., DataCite, Zenodo)

Start tidy. Stay meaningful. Embrace dataset_df.

Why Semantics Matter for R Data Frames

From Tidy to Meaningful: An Example

Why Units Matter

A Final Structure, Ready for Export

Summary: Why Use dataset_df

Summary: Why Use `dataset_df`