Semantically Enriched Vectors with defined()

library(dataset)

The defined() function in the dataset package allows you to create semantically enriched vectors that retain human-readable metadata — including labels, measurement units, definitions (e.g. URIs), and namespaces.

This vignette demonstrates how to create, manipulate, and interpret defined vectors, and how they integrate seamlessly into data frames and tidy workflows.

Introducting the defined Class

The defined() constructor enriches a vector by attaching additional attributes that convey semantic meaning. It builds upon the foundation of labelled vectors and introduces three further metadata elements:

Let’s inspect the metadata attached to a defined vector representing GDP values:

gdp_1 <- defined(
  c(3897, 7365),
  label = "Gross Domestic Product",
  unit = "million dollars",
  concept = "http://data.europa.eu/83i/aa/GDP"
)

cat("The print method:\n")
#> The print method:
print(gdp_1)
#> gdp_1: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in million dollars 
#> [1] 3897 7365
cat("And the summary:\n")
#> And the summary:
summary(gdp_1)
#> Gross Domestic Product (million dollars)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    3897    4764    5631    5631    6498    7365

When summary() is called on a defined vector, its label and unit (if available) are displayed above the summary statistics.

The defined() class extends the attributes of a labelled vector with a unit (of measure), a concept definition and a namespace.

attributes(gdp_1)
#> $label
#> [1] "Gross Domestic Product"
#> 
#> $class
#> [1] "haven_labelled_defined" "haven_labelled"         "vctrs_vctr"            
#> [4] "double"                
#> 
#> $unit
#> [1] "million dollars"
#> 
#> $concept
#> [1] "http://data.europa.eu/83i/aa/GDP"
cat("Get the label only: ")
#> Get the label only:
var_label(gdp_1)
#> [1] "Gross Domestic Product"
cat("Get the unit only: ")
#> Get the unit only:
var_unit(gdp_1)
#> [1] "million dollars"
cat("Get the concept definition only: ")
#> Get the concept definition only:
var_concept(gdp_1)
#> [1] "http://data.europa.eu/83i/aa/GDP"

What happens if we try to concatenate a semantically under-specified new vector to the GDP vector?

Semantic Consistency in Concatenation

a <- defined(1:3, label = "Length", unit = "metres")
b <- defined(4:6, label = "Length", unit = "metres")

c(a, b)
#> x: Length
#> Measured in metres 
#> [1] 1 2 3 4 5 6
gdp_2 <- defined(2034, label = "Gross Domestic Product")

You will get an intended error message that some attributes are not compatible. You certainly want to avoid that you are concatenating figures in euros and dollars, for example.

Attempting to concatenate the under-specified gdp_2 vector with gdp_1 will trigger an error:

c(gdp_1, gdp_2)
Error in `vec_c()`:
! Can't combine `..1` <haven_labelled_defined> and `..2` <haven_labelled_defined>.
✖ Some attributes are incompatible.

This error is intentional — it ensures that values with mismatched or incomplete semantic context (e.g., a different currency unit or an undefined concept) do not silently contaminate the dataset.

Aligning Metadata Before Concatenation

We can resolve this by explicitly defining the missing unit and definition for gdp_2 so that it matches gdp_1:

var_unit(gdp_2) <- "million dollars"
var_concept(gdp_2) <- "http://data.europa.eu/83i/aa/GDP"

With matching metadata, concatenation now succeeds:

summary(c(gdp_1, gdp_2))
#> Gross Domestic Product (million dollars)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    2034    2966    3897    4432    5631    7365

Semantic Identifiers via Namespaces

Namespaces allow defined values — such as country codes — to be expanded into resolvable URIs. This is especially powerful for linked data and machine-readable classification systems.

country <- defined(c("AD", "LI", "SM"),
  label = "Country name",
  concept = "http://data.europa.eu/bna/c_6c2bb82d",
  namespace = "https://www.geonames.org/countries/$1/"
)

The namespace attribute allows each value in a vector to become a resolvable URI — useful in linked data and semantic web contexts.

The point of using a namespace is that it can point to a both human- and machine readable definition of the ID column, or any attribute column in the datasets. (Attributes in a statistical datasets are characteristics of the observations or the measured variables.)

The namespace acts as a template: $1 is replaced by the actual value of each element, producing links like: - https://www.geonames.org/countries/AD/ in the case of Andorra, - https://www.geonames.org/countries/LI/ for Lichtenstein, and - https://www.geonames.org/countries/SM/ for San Marino.

In addition, the definition URI — http://publications.europa.eu/resource/authority/bna/c_6c2bb82d — resolves to a machine-readable classification of country names, helping to align datasets with official vocabularies and standards. ## Basic Usage

Working with character vectors:

countries <- defined(
  c("AD", "LI"),
  label = "Country code",
  namespace = "https://www.geonames.org/countries/$1/"
)

countries
#> x: Country code
#> Defined vector 
#> [1] "AD" "LI"
as_character(countries)
#> [1] "AD" "LI"

Subsetting and coercion

gdp_1[1:2]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in million dollars 
#> [1] 3897 7365
gdp_1 > 5000
#> [1] FALSE  TRUE
as.vector(gdp_1)
#> [1] 3897 7365
as.list(gdp_1)
#> [[1]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in million dollars 
#> [1] 3897
#> 
#> [[2]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in million dollars 
#> [1] 7365

Coerce to base R types

Coerce back the labelled country vector to a character vector:

as_character(country)
#> [1] "AD" "LI" "SM"
as_character(c(gdp_1, gdp_2))
#> [1] "3897" "7365" "2034"

And to numeric:

as_numeric(c(gdp_1, gdp_2))
#> [1] 3897 7365 2034