Owing to its extensive functionality and its roots in a robust grammar of graphics (Wilkinson 2013), ggplot2 (Wickham 2016) has become very popular among R users. In using ggplot, the general approach to generating plots is to first conceive of the plot and then use our undertsanding of ggolot to create the required mappings, layers and other adjustments. The process requires us to divide our cognitive resources to the main task at hand – extracting intelligence from the data – and the process of generating the requisite plot. The guiding insight behind ezEDA is that the process of exploratory data analysis can be made more productive by increasing the proportion of congitive resources devoted to imagining the kinds of analysis we could perform (by reducing the proportion devoted to the mechanices of generating the plot). Where possible, devoting more of our mental energies to thinking about the kinds of explorations we would like to perform will likely make us more productive.
Although exploratory analysis of data differs from dataset to dataset, we can still see some recurrent themes or patterns that apply to many situations. ezEDA identifies such common patterns, and for each one, it provides a single convenience function that relieves us of the mechanices of generating the plot. With ezEDA we aim to ease the task of generating ggplot-based visualizations by allowing users to think in terms of their problem domain rather than the details of how to achieve the plot that they have in mind. This approach is particularly useful when the visualization task involves standard themes or patterns. The ezEDA package currently provides functions for twelve patterns and we aim to incorporate more in future versions.
ezEDA is only beneficial in situations where the analyst is able to exploit a common pattern that ezEDA has already identified. For other situations, the analyst has to use other means like constructing a ggplot plot from the ground up. Here are some general features of ezEDA functions:
Once they have completed their initial explorations and obtained the necessary insights, it could very well be that users will then dive into ggplot and fine-tune their plots where needed.
Currently, ezEDA provides functions under five categories:
ezEDA is partly motivated by the work of Stephen Few (Few 2015) who described a general framework for Exploratory Data Analysis. Such a framework would be useful when we are presented with a dataset and would like to generate questions to ask of the dataset – questions that could potentially generate interesting results or confirm what we already knew or suspected. Few’s framework is driven by the observation that any dataset has columns of four types:
Once we establish this terminology, the list below shows the patterns and the corresponding ezEDA functions:
measure_change_over_time_wide
and measure_change_over_time_long
measure_distribution
, measure_distribution_by_category
, measure_distribution_by_two_categories
and measure_distribution_over_time
two_measure_relationship
and multi_measure_relationship
category_tally
and two_category_tally
category_contribution
and two_category_contribution
ezEDA provides functions for these tasks. The package currently has twelve functions and we hope to add more as we identify more patterns.
ezEDA is available under CRAN and can be installed as usual by calling the install.packages function:
ezEDA depends on many other packages like ggplot2, dplyr, tidyr and others and if any of those packages are not installed on your system, the above code will cause any of the missing packages to be installed as well. # Using ezEDA As always, it is a good idea to make the package namespace available via the library function:
We discuss each of the functions below. For convenience, we have divided the functions into convenient groups.
If a dataset has one or more measure columns and a time column, then it could be useful to study the movement of the measure columns over time. Each of the two functions in this group help us to simultaneously visualize trends of up to 6 measures. Dependning on whether the data is in wide form (different measures in different columns) or long (all measure values in one column with another column to identify each measure) you can use a differemnt function.
Simultaneously visualize the change over time of several measures (up to 6) when the data is in wide form (see above).
Simultaneously visualize the change over time of several measures (up to 6) when the data is in long form (see above).
## For ggplot2::economics_long, plot the trend of population and number unemployed.
## In this dataset, all measures are in the column named value and
## the names of the measures are in the column named variable
measure_change_over_time_long(ggplot2::economics_long, date, variable, value, pop, unemploy)
Very often we study the distributuons of numeric columns (measures).ezEDA provides four different functions. Within this broad theme, we are often also interested in studying how the distribution changes based on one or two categories or based on time.
Distribution of a measure: default is histogram.
## For ggplot2::mpg, plot the distribution of highway mileage
measure_distribution(ggplot2::mpg, hwy)
#> Using binwidth corresponding to 30 bins. Use bwidth argument to fix if needed.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Distribution of a measure: default is histogram. By default the function uses a bin width corresponding to 30 bins. Use the bwidth argument to specify the desired value for the width of each bin.
## For ggplot2::mpg, plot the distribution of highway mileage
measure_distribution(ggplot2::mpg, hwy, bwidth = 2)
Distribution of a measure. Get a boxplot instead of histogram.
Distribution of a measure with highlighting of different values of a single category.
## For ggplot2::diamonds, plot the distribution of price while highlighting
## the counts of diamonds of different cuts
measure_distribution_by_category(ggplot2::diamonds, price, cut)
#> Using binwidth corresponding to 30 bins. Use bwidth argument to fix if needed.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Distribution of a measure with highlighting of different values of a single category across facets.
## For ggplot2::diamonds, plot the distribution of price showing
## the distribution for each kind of cut in a different facet
measure_distribution_by_category(ggplot2::diamonds, price, cut, separate = TRUE)
#> Using binwidth corresponding to 30 bins. Use bwidth argument to fix if needed.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Distribution of a measure for a combination of two categories within a facet grid
## For ggplot2::diamonds, plot the distribution of price separately
## for each unique combination of cut and clarity
measure_distribution_by_two_categories(ggplot2::diamonds, carat, cut, clarity)
#> Using binwidth corresponding to 30 bins. Use bwidth argument to fix if needed.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Study the change of distribution of a measure over time
## 50 random values of three measures for each of 1999, 2000 and 2001
h1 <- round(rnorm(50, 60, 8), 0)
h2 <- round(rnorm(50, 65, 8), 0)
h3 <- round(rnorm(50, 70, 8), 0)
h <- c(h1, h2, h3)
y <- c(rep(1999, 50), rep(2000, 50), rep(2001, 50))
df <- data.frame(height = h, year = y)
measure_distribution_over_time(df, h, year)
#> Using binwidth corresponding to 30 bins. Use bwidth argument to fix if needed.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
WHen a dataset has two or more measures, studying their pairwise relationships often yields useful insights. exEDA provides two relevant functions.
Scatterplot of the relationship between two measures with optional coloring of points by a category.
## For ggplot2::mpg, plot the highway mileage against the displacement
two_measures_relationship(ggplot2::mpg, displ, hwy)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Below is a variant shopwing the relationship separately for each value of a category.
The two functions in this group satisfy a common need to generate counts of categories. For example, in a dataset of students, we might want to plot the counts of students by gender; in the diamonds dataset from ggplot2, we might want to generate a plot of the number of diamonds by color, and so on.
Barplot of the counts based on a category column. Often when a barplot has many bars, we run the risk of the x-axis labels running onto each other. To handle this issue, the function flips the coordinates. The first argument is the dataset to be used and the second argument is the unquoted name of the relevant category column.
Barplot of one category showing its conposition in terms of another category. The arguments are: the dataset to be used and the unquoted names of the two category columns.
## For ggplot2::diamonds plot the tallies for different types of
## cut and clarity
two_category_tally(ggplot2::diamonds, cut, clarity)
Below is a variant with facets.
## For ggplot2::diamonds, plot the tallies for different types of
## cut and clarity and show the plots for each value of the
## second category in a separate facet
two_category_tally(ggplot2::diamonds, cut, clarity, separate = TRUE)
#> Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
#> "none")` instead.
The two functions in this group satisfy a common need to analyze the extent to which specific categories contribute to specific measures. For example, in a dataset of people, we might want to plot the total income by gender; in the diamonds dataset from ggplot2, we might want to generate a plot of the contribution to the price by diamonds of various kinds of cut, and so on.
Barplot of the total of a measure based on a category column.
Barplot of the total of a measure based on a category column.
## For ggplot2::diamonds, plot the total price for each kind of cut
## while also showing the contribution of each kind of clarity
## within each kind of cut
two_category_contribution(ggplot2::diamonds, cut, clarity, price)
Below is a variant with facets.
Few, Stephen. 2015. Signal: Understanding What Matters in a World of Noise. Analytics Press.
Wickham, Hadley. 2016. Ggplot2: Elegrant Graphics for Data Analysis. Springer.
Wilkinson, Leland. 2013. The Grammar of Graphics. Springer Science & Business Media.