For some categorical variables rarity can reflect structural features. For instance with United States Zip codes rare zip codes often represent low population density regions.
When this is the case it can make sense to pool the rare levels into a new re-coded level called ``rare.’’ If this new level is statistically significant it can be a usable modeling feature. This sort of pooling is only potentially useful if below a given training count behave similarly.
This capability was more of an experimental demonstration of possible
extensions of vtreat
to have more inference capabilities
about rare level than a commonly useful feature. Most of this power has
since been captured in the more useful catP
feature (also
demonstrated here). Even more power is found in using an interaction of
catN
or catB
with catP
.
An example of the rare level feature using vtreat
is
given below.
First we set up some data by defining a set of population centers
(populationFrame
) and code to observe individuals (with
replacement) uniformly from the combined population with a rare
condition (inClass
) that has elevated occurrence in
observations coming from the small population centers
(rareCodes
).
library('vtreat')
set.seed(2325)
populationFrame <- data.frame(
popsize = round(rlnorm(100,meanlog=log(4000),sdlog=1)),
stringsAsFactors = FALSE)
populationFrame$code <- paste0('z',formatC(sample.int(100000,
size=nrow(populationFrame),
replace=FALSE),width=5,flag='0'))
rareCodes <- populationFrame$code[populationFrame$popsize<1000]
# Draw individuals from code-regions proportional to size of code region
# (or uniformly over all individuals labeled by code region).
# Also add the outcome which has altered conditional probability for rareCodes.
drawIndividualsAndReturnCodes <- function(n) {
ords <- sort(sample.int(sum(populationFrame$popsize),size=n,replace=TRUE))
cs <- cumsum(populationFrame$popsize)
indexes <- findInterval(ords,cs)+1
indexes <- indexes[sample.int(n,size=n,replace=FALSE)]
samp <- data.frame(code=populationFrame$code[indexes],
stringsAsFactors = FALSE)
samp$inClass <- runif(n) < ifelse(samp$code %in% rareCodes,0.3,0.01)
samp
}
We then draw a sample we want to make some observations on.
testSet <- drawIndividualsAndReturnCodes(2000)
table(generatedAsRare=testSet$code %in% rareCodes,inClass=testSet$inClass)
## inClass
## generatedAsRare FALSE TRUE
## FALSE 1957 19
## TRUE 17 7
Notice that in the sample we can observe the elevated rate of
inClass==TRUE
conditioned on coming from a
code
that is one of the rareCodes
.
We could try to learn this relation using vtreat
. To do
this we set up another sample (designSet
) to work on, so we
are not inferring from testSet
(where we will evaluate
results).
designSet <- drawIndividualsAndReturnCodes(2000)
treatments <- vtreat::designTreatmentsC(designSet,'code','inClass',TRUE,
rareCount=5,rareSig=NULL,
verbose=FALSE)
treatments$scoreFrame[,c('varName','sig'),drop=FALSE]
## varName sig
## 1 code_catP 0.035934754
## 2 code_catB 0.025765020
## 3 code_lev_rare 0.006440297
## 4 code_lev_x_z01318 0.944465050
## 5 code_lev_x_z05023 0.255244077
## 6 code_lev_x_z05141 0.932425672
## 7 code_lev_x_z13059 0.766518335
## 8 code_lev_x_z22752 0.168315399
## 9 code_lev_x_z27896 0.249886934
## 10 code_lev_x_z37337 0.999706031
## 11 code_lev_x_z45874 0.261182774
## 12 code_lev_x_z46558 0.213118876
## 13 code_lev_x_z54516 0.859663802
## 14 code_lev_x_z59854 0.031292097
## 15 code_lev_x_z60281 0.222826006
## 16 code_lev_x_z71826 0.249467981
## 17 code_lev_x_z79197 0.255244077
## 18 code_lev_x_z86061 0.944465050
## 19 code_lev_x_z86248 0.178878966
We see in treatments$scoreFrame
we have a level called
code_lev_rare
, which is where a number of rare levels are
re-coding. We can also confirm levels that occur rareCount
or fewer times are eligible to code to to
code_lev_rare
.
## Warning in prepare.treatmentplan(treatments, designSet, pruneSig = 0.5):
## possibly called prepare() on same data frame as
## designTreatments*()/mkCrossFrame*Experiment(), this can lead to over-fit. To
## avoid this, please use mkCrossFrame*Experiment$crossFrame.
designSetTreated$code <- designSet$code
summary(as.numeric(table(designSetTreated$code[designSetTreated$code_lev_rare==1])))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 2.655 4.000 5.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 10.25 17.00 27.47 38.75 143.00
We can now apply this treatment to testSet
to see how
this inferred rare level performs. Notice also the
code_catP
which directly encodes prevalence or frequency of
the level during training also gives usable estimate of size (likely a
more useful one then the rare-level code itself).
As we can see below the code_lev_rare
correlates with
the condition, and usefully re-codes novel levels (levels in
testSet
that were not seen in designSet
) to
rare.
testSetTreated <- vtreat::prepare(treatments,testSet,pruneSig=0.5)
testSetTreated$code <- testSet$code
testSetTreated$newCode <- !(testSetTreated$code %in% unique(designSet$code))
testSetTreated$generatedAsRareCode <- testSetTreated$code %in% rareCodes
# Show code_lev_rare==1 corresponds to a subset of rows with elevated inClass==TRUE rate.
table(code_lev_rare=testSetTreated$code_lev_rare,
inClass=testSetTreated$inClass)
## inClass
## code_lev_rare FALSE TRUE
## 0 1894 18
## 1 80 8
# Show newCodes get coded with code_level_rare==1.
table(newCode=testSetTreated$newCode,code_lev_rare=testSetTreated$code_lev_rare)
## code_lev_rare
## newCode 0 1
## FALSE 1912 88
# Show newCodes tend to come from defined rareCodes.
table(newCode=testSetTreated$newCode,
generatedAsRare=testSetTreated$generatedAsRareCode)
## generatedAsRare
## newCode FALSE TRUE
## FALSE 1976 24
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00050 0.00950 0.01950 0.02541 0.03450 0.07150
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000500 0.001000 0.001500 0.001398 0.002000 0.002500
## Min. 1st Qu. Median Mean 3rd Qu. Max.
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0005000 0.0005000 0.0010000 0.0009792 0.0015000 0.0020000