R users love data.frame
s and tibble
s for
tidy, rectangular data. But tidy data isn’t always meaningful
data. What does a column labelled gdp
actually
represent? Euros? Millions? Per capita? Current prices? Constant 2010
prices? These questions matter—especially in statistics, open data
publishing, and knowledge graph integration.
The dataset_df
class extends the familiar
data.frame
structure with lightweight, semantically
meaningful metadata. It’s built for:
Tidyverse lovers who want better documentation and safer analysis
Open science workflows that need interoperable metadata
Semantic web users who want to export structured RDF data from R
dataset_df
helps you preserve the meaning of variables,
units, identifiers, and dataset-level context.
Let’s start with a basic data frame and upgrade it to a
dataset_df
with semantically enriched columns using
defined()
:
small_country_dataset <- dataset_df(
country_name = defined(c("AD", "LI"),
label = "Country name",
concept = "http://data.europa.eu/bna/c_6c2bb82d",
namespace = "https://www.geonames.org/countries/$1/"
),
gdp = defined(c(3897, 7365),
label = "Gross Domestic Product",
unit = "million dollars",
concept = "http://data.europa.eu/83i/aa/GDP"
),
dataset_bibentry = dublincore(
title = "Small Country Dataset",
creator = person("Jane", "Doe"),
publisher = "Example Inc."
)
)
The defined()
vectors attach metadata to each
column:
label
: a human-readable name
unit
: an explicit measurement unit
concept
: a URI identifying the concept
measured
namespace
: for generating full subject URIs when
exporting to RDF
The dataset_df()
call also allows bibliographic
metadata:
dataset_bibentry
: Dublin Core metadata for citation,
reuse, and provenanceMany statistical errors begin with a silent assumption about units. In Eurostat data, it’s common to see:
EUR
: Euros
MIO_EUR
: Millions of euros
PPS
: Purchasing Power Standards
By making units explicit at the column level, you:
Prevent decimal-scale mistakes (e.g., thousands vs millions)
Avoid joining or averaging incompatible series
Gain confidence in your data exports (CSV, RDF, JSON-LD, etc.)
This is especially important in multi-currency and multi-country datasets such as those published by Eurostat, where harmonization is crucial.
The enriched dataset_df
object can be serialized to RDF
using:
triples <- dataset_to_triples(small_country_dataset)
n_triples(mapply(n_triple, triples$s, triples$p, triples$o))
#> [1] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"rowid\"^^<http://www.w3.org/2001/XMLSchema#string> \"eg:1\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [2] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"rowid\"^^<http://www.w3.org/2001/XMLSchema#string> \"eg:2\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [3] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"country_name\"^^<http://www.w3.org/2001/XMLSchema#string> \"AD\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [4] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"country_name\"^^<http://www.w3.org/2001/XMLSchema#string> \"LI\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [5] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"gdp\"^^<http://www.w3.org/2001/XMLSchema#string> \"3897\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [6] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"gdp\"^^<http://www.w3.org/2001/XMLSchema#string> \"7365\"^^<http://www.w3.org/2001/XMLSchema#string> ."
This supports export to:
Wikibase via wbdataset
RDF Data Cube via datacube
DataCite or DCAT metadata formats
This vignette represents the final conceptual structure for
dataset_df
before its rOpenSci submission. Future work will
build on this foundation without breaking it.
dataset_df
Feature | What It Adds |
---|---|
label |
Human-readable variable name |
unit |
Explicit unit (e.g., MIO_EUR ) |
concept |
URI identifying what is measured |
subject |
Dataset-level topical classification |
namespace |
Base URI for RDF subject identifiers |
dataset_bibentry |
Bibliographic metadata via Dublin Core |
The dataset_df
class is designed to remain fully
compatible with the tidyverse data workflow, while
offering a metadata structure suitable for:
Receiving SDMX-style statistical data into R
Exporting semantically meaningful datasets to DCAT, RDF, or Wikibase
Complying with open science repository requirements (e.g., DataCite, Zenodo)
Start tidy. Stay meaningful. Embrace dataset_df
.