Why Semantics Matter for R Data Frames

library(dataset)

R users love data.frames and tibbles for tidy, rectangular data. But tidy data isn’t always meaningful data. What does a column labelled gdp actually represent? Euros? Millions? Per capita? Current prices? Constant 2010 prices? These questions matter—especially in statistics, open data publishing, and knowledge graph integration.

The dataset_df class extends the familiar data.frame structure with lightweight, semantically meaningful metadata. It’s built for:

dataset_df helps you preserve the meaning of variables, units, identifiers, and dataset-level context.

From Tidy to Meaningful: An Example

Let’s start with a basic data frame and upgrade it to a dataset_df with semantically enriched columns using defined():

small_country_dataset <- dataset_df(
  country_name = defined(c("AD", "LI"),
    label = "Country name",
    concept = "http://data.europa.eu/bna/c_6c2bb82d",
    namespace = "https://www.geonames.org/countries/$1/"
  ),
  gdp = defined(c(3897, 7365),
    label = "Gross Domestic Product",
    unit = "million dollars",
    concept = "http://data.europa.eu/83i/aa/GDP"
  ),
  dataset_bibentry = dublincore(
    title = "Small Country Dataset",
    creator = person("Jane", "Doe"),
    publisher = "Example Inc."
  )
)

The defined() vectors attach metadata to each column:

The dataset_df() call also allows bibliographic metadata:

Why Units Matter

Many statistical errors begin with a silent assumption about units. In Eurostat data, it’s common to see:

By making units explicit at the column level, you:

This is especially important in multi-currency and multi-country datasets such as those published by Eurostat, where harmonization is crucial.

A Final Structure, Ready for Export

The enriched dataset_df object can be serialized to RDF using:

triples <- dataset_to_triples(small_country_dataset)

n_triples(mapply(n_triple, triples$s, triples$p, triples$o))
#> [1] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"rowid\"^^<http://www.w3.org/2001/XMLSchema#string> \"eg:1\"^^<http://www.w3.org/2001/XMLSchema#string> ."     
#> [2] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"rowid\"^^<http://www.w3.org/2001/XMLSchema#string> \"eg:2\"^^<http://www.w3.org/2001/XMLSchema#string> ."     
#> [3] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"country_name\"^^<http://www.w3.org/2001/XMLSchema#string> \"AD\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [4] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"country_name\"^^<http://www.w3.org/2001/XMLSchema#string> \"LI\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [5] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"gdp\"^^<http://www.w3.org/2001/XMLSchema#string> \"3897\"^^<http://www.w3.org/2001/XMLSchema#string> ."       
#> [6] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"gdp\"^^<http://www.w3.org/2001/XMLSchema#string> \"7365\"^^<http://www.w3.org/2001/XMLSchema#string> ."

This supports export to:

This vignette represents the final conceptual structure for dataset_df before its rOpenSci submission. Future work will build on this foundation without breaking it.

Summary: Why Use dataset_df

Feature What It Adds
label Human-readable variable name
unit Explicit unit (e.g., MIO_EUR)
concept URI identifying what is measured
subject Dataset-level topical classification
namespace Base URI for RDF subject identifiers
dataset_bibentry Bibliographic metadata via Dublin Core

The dataset_df class is designed to remain fully compatible with the tidyverse data workflow, while offering a metadata structure suitable for:

Start tidy. Stay meaningful. Embrace dataset_df.