defined()
The defined()
function in the dataset
package allows you to create semantically enriched vectors that retain
human-readable metadata — including labels, measurement units,
definitions (e.g. URIs), and namespaces.
This vignette demonstrates how to create, manipulate, and interpret defined vectors, and how they integrate seamlessly into data frames and tidy workflows.
defined
ClassThe defined()
constructor enriches a vector by attaching
additional attributes that convey semantic meaning. It builds upon the
foundation of labelled vectors and introduces three further metadata
elements:
A unit of measurement
(e.g. “million
dollars”)
A concept
, which can be a textual reference or
ideally a URI
A namespace
, which enables the construction of
meaningful, resolvable identifiers for values or categories
Let’s inspect the metadata attached to a defined vector representing GDP values:
gdp_1 <- defined(
c(3897, 7365),
label = "Gross Domestic Product",
unit = "million dollars",
concept = "http://data.europa.eu/83i/aa/GDP"
)
cat("The print method:\n")
#> The print method:
print(gdp_1)
#> gdp_1: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in million dollars
#> [1] 3897 7365
cat("And the summary:\n")
#> And the summary:
summary(gdp_1)
#> Gross Domestic Product (million dollars)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 3897 4764 5631 5631 6498 7365
When summary()
is called on a defined vector, its label
and unit (if available) are displayed above the summary statistics.
The defined()
class extends the attributes of a labelled
vector with a unit (of measure), a concept definition and a
namespace.
attributes(gdp_1)
#> $label
#> [1] "Gross Domestic Product"
#>
#> $class
#> [1] "haven_labelled_defined" "haven_labelled" "vctrs_vctr"
#> [4] "double"
#>
#> $unit
#> [1] "million dollars"
#>
#> $concept
#> [1] "http://data.europa.eu/83i/aa/GDP"
cat("Get the label only: ")
#> Get the label only:
var_label(gdp_1)
#> [1] "Gross Domestic Product"
cat("Get the unit only: ")
#> Get the unit only:
var_unit(gdp_1)
#> [1] "million dollars"
cat("Get the concept definition only: ")
#> Get the concept definition only:
var_concept(gdp_1)
#> [1] "http://data.europa.eu/83i/aa/GDP"
What happens if we try to concatenate a semantically under-specified new vector to the GDP vector?
a <- defined(1:3, label = "Length", unit = "metres")
b <- defined(4:6, label = "Length", unit = "metres")
c(a, b)
#> x: Length
#> Measured in metres
#> [1] 1 2 3 4 5 6
You will get an intended error message that some attributes are not compatible. You certainly want to avoid that you are concatenating figures in euros and dollars, for example.
Attempting to concatenate the under-specified gdp_2
vector with gdp_1
will trigger an error:
Error in `vec_c()`:
! Can't combine `..1` <haven_labelled_defined> and `..2` <haven_labelled_defined>.
✖ Some attributes are incompatible.
This error is intentional — it ensures that values with mismatched or incomplete semantic context (e.g., a different currency unit or an undefined concept) do not silently contaminate the dataset.
We can resolve this by explicitly defining the missing unit and definition for gdp_2 so that it matches gdp_1:
With matching metadata, concatenation now succeeds:
Namespaces allow defined values — such as country codes — to be expanded into resolvable URIs. This is especially powerful for linked data and machine-readable classification systems.
country <- defined(c("AD", "LI", "SM"),
label = "Country name",
concept = "http://data.europa.eu/bna/c_6c2bb82d",
namespace = "https://www.geonames.org/countries/$1/"
)
The namespace attribute allows each value in a vector to become a resolvable URI — useful in linked data and semantic web contexts.
The point of using a namespace is that it can point to a both human- and machine readable definition of the ID column, or any attribute column in the datasets. (Attributes in a statistical datasets are characteristics of the observations or the measured variables.)
The namespace acts as a template: $1 is replaced by the actual value of each element, producing links like: - https://www.geonames.org/countries/AD/ in the case of Andorra, - https://www.geonames.org/countries/LI/ for Lichtenstein, and - https://www.geonames.org/countries/SM/ for San Marino.
In addition, the definition URI — http://publications.europa.eu/resource/authority/bna/c_6c2bb82d — resolves to a machine-readable classification of country names, helping to align datasets with official vocabularies and standards. ## Basic Usage
Working with character vectors:
gdp_1[1:2]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in million dollars
#> [1] 3897 7365
gdp_1 > 5000
#> [1] FALSE TRUE
as.vector(gdp_1)
#> [1] 3897 7365
as.list(gdp_1)
#> [[1]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in million dollars
#> [1] 3897
#>
#> [[2]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in million dollars
#> [1] 7365