Introduction to lessR

library(lessR)
#> 
#> lessR 4.3.8                         feedback: gerbing@pdx.edu 
#> --------------------------------------------------------------
#> > d <- Read("")   Read text, Excel, SPSS, SAS, or R data file
#>   d is default data frame, data= in analysis routines optional
#> 
#> Many examples of reading, writing, and manipulating data, 
#> graphics, testing means and proportions, regression, factor analysis,
#> customization, and descriptive statistics from pivot tables
#>   Enter: browseVignettes("lessR")
#> 
#> View lessR updates, now including time series forecasting
#>   Enter: news(package="lessR")
#> 
#> Interactive data analysis
#>   Enter: interact()
#> 
#> Attaching package: 'lessR'
#> The following object is masked from 'package:base':
#> 
#>     sort_by

The vignette examples of using lessR became so extensive that lessR exceeded the maximum R package installation size. Find some examples below and many more vignette examples at:

lessR examples

Read Data

Many of the following examples analyze data in the Employee data set, included with lessR. To read an internal lessR data set, just pass the name of the data set to the lessR function Read(). Read the Employee data into the data frame d. For data sets other than those provided by lessR, enter the path name or URL between the quotes, or leave the quotes empty to browse for the data file on your computer system. See the Read and Write vignette for more details.

d <- Read("Employee")
#> 
#> >>> Suggestions
#> Recommended binary format for data files: feather
#>   Create with Write(d, "your_file", format="feather")
#> More details about your data, Enter:  details()  for d, or  details(name)
#> 
#> Data Types
#> ------------------------------------------------------------
#> character: Non-numeric data values
#> integer: Numeric data values, integers only
#> double: Numeric data values with decimal digits
#> ------------------------------------------------------------
#> 
#>     Variable                  Missing  Unique 
#>         Name     Type  Values  Values  Values   First and last values
#> ------------------------------------------------------------------------------------------
#>  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
#>  2    Gender character     37       0       2   M  M  W ... W  W  M
#>  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
#>  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
#>  5    JobSat character     35       2       3   med  low  high ... high  low  high
#>  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
#>  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
#>  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
#> ------------------------------------------------------------------------------------------

d is the default name of the data frame for the lessR data analysis functions. Explicitly access the data frame with the data parameter in the analysis functions.

As an option, also read the table of variable labels. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need be entered into the table. The table can be a csv file or an Excel file.

Read the file of variable labels into the l data frame, currently the only permitted name. The labels will be displayed on both the text and visualization output. Each displayed label is the variable name juxtaposed with the corresponding label, as shown in the following output.

l <- rd("Employee_lbl")
#> 
#> >>> Suggestions
#> Recommended binary format for data files: feather
#>   Create with Write(d, "your_file", format="feather")
#> More details about your data, Enter:  details()  for d, or  details(name)
#> 
#> Data Types
#> ------------------------------------------------------------
#> character: Non-numeric data values
#> ------------------------------------------------------------
#> 
#>     Variable                  Missing  Unique 
#>         Name     Type  Values  Values  Values   First and last values
#> ------------------------------------------------------------------------------------------
#>  1     label character      8       0       8   Time of Company Employment ... Test score on legal issues after instruction
#> ------------------------------------------------------------------------------------------
l
#>                                                label
#> Years                     Time of Company Employment
#> Gender                                  Man or Woman
#> Dept                             Department Employed
#> Salary                           Annual Salary (USD)
#> JobSat            Satisfaction with Work Environment
#> Plan             1=GoodHealth, 2=GetWell, 3=BestCare
#> Pre    Test score on legal issues before instruction
#> Post    Test score on legal issues after instruction

Bar Chart

Consider the categorical variable Dept in the Employee data table. Use BarChart() to tabulate and display the visualization of the number of employees in each department, here relying upon the default data frame (table) named d. Otherwise add the data= option for a data frame with another name.

BarChart(Dept)
Bar chart of tablulated counts of employees in each department.

Bar chart of tablulated counts of employees in each department.

#> >>> Suggestions
#> BarChart(Dept, horiz=TRUE)  # horizontal bar chart
#> BarChart(Dept, fill="reds")  # red bars of varying lightness
#> PieChart(Dept)  # doughnut (ring) chart
#> Plot(Dept)  # bubble plot
#> Plot(Dept, stat="count")  # lollipop plot 
#> 
#> --- Dept --- 
#> 
#> Missing Values: 1 
#> 
#>                 ACCT   ADMN   FINC   MKTG   SALE    Total 
#> Frequencies:       5      6      4      6     15       36 
#> Proportions:   0.139  0.167  0.111  0.167  0.417    1.000 
#> 
#> Chi-squared test of null hypothesis of equal probabilities 
#>   Chisq = 10.944, df = 4, p-value = 0.027

Specify a single fill color with the fill parameter, the edge color of the bars with color. Set the transparency level with transparency. Against a lighter background, display the value for each bar with a darker color using the labels_color parameter. To specify a color, use color names, specify a color with either its rgb() or hcl() color space coordinates, or use the lessR custom color palette function getColors().

BarChart(Dept, fill="darkred", color="black", transparency=.8,
         labels_color="black")

#> >>> Suggestions
#> BarChart(Dept, horiz=TRUE)  # horizontal bar chart
#> BarChart(Dept, fill="reds")  # red bars of varying lightness
#> PieChart(Dept)  # doughnut (ring) chart
#> Plot(Dept)  # bubble plot
#> Plot(Dept, stat="count")  # lollipop plot 
#> 
#> --- Dept --- 
#> 
#> Missing Values: 1 
#> 
#>                 ACCT   ADMN   FINC   MKTG   SALE    Total 
#> Frequencies:       5      6      4      6     15       36 
#> Proportions:   0.139  0.167  0.111  0.167  0.417    1.000 
#> 
#> Chi-squared test of null hypothesis of equal probabilities 
#>   Chisq = 10.944, df = 4, p-value = 0.027

Use the theme parameter to change the entire color theme: ā€œcolorsā€, ā€œlightbronzeā€, ā€œdodgerblueā€, ā€œslateredā€, ā€œdarkredā€, ā€œgrayā€, ā€œgoldā€, ā€œdarkgreenā€, ā€œblueā€, ā€œredā€, ā€œroseā€, ā€œgreenā€, ā€œpurpleā€, ā€œsiennaā€, ā€œbrownā€, ā€œorangeā€, ā€œwhiteā€, and ā€œlightā€. In this example, changing the full theme accomplishes the same as changing the fill color. Turn off the displayed value on each bar with the parameter labels set to off. Specify a horizontal bar chart with base R parameter horiz.

BarChart(Dept, theme="gray", labels="off", horiz=TRUE)

#> >>> Suggestions
#> BarChart(Dept, horiz=TRUE)  # horizontal bar chart
#> BarChart(Dept, fill="reds")  # red bars of varying lightness
#> PieChart(Dept)  # doughnut (ring) chart
#> Plot(Dept)  # bubble plot
#> Plot(Dept, stat="count")  # lollipop plot 
#> 
#> --- Dept --- 
#> 
#> Missing Values: 1 
#> 
#>                 ACCT   ADMN   FINC   MKTG   SALE    Total 
#> Frequencies:       5      6      4      6     15       36 
#> Proportions:   0.139  0.167  0.111  0.167  0.417    1.000 
#> 
#> Chi-squared test of null hypothesis of equal probabilities 
#>   Chisq = 10.944, df = 4, p-value = 0.027

Histogram

Consider the continuous variable Salary in the Employee data table. Use Histogram() to tabulate and display the number of employees in each department, here relying upon the default data frame (table) named d, so the data= parameter is not needed.

Histogram(Salary)
Histogram of tablulated counts for the bins of Salary.

Histogram of tablulated counts for the bins of Salary.

#> >>> Suggestions 
#> bin_width: set the width of each bin 
#> bin_start: set the start of the first bin 
#> bin_end: set the end of the last bin 
#> Histogram(Salary, density=TRUE)  # smoothed curve + histogram 
#> Plot(Salary)  # Violin/Box/Scatterplot (VBS) plot 
#> 
#> --- Salary --- 
#>  
#>      n   miss         mean           sd          min          mdn          max 
#>      37      0    73795.557    21799.533    46124.970    69547.600   134419.230 
#> 
#>   
#> --- Outliers ---     from the box plot: 1 
#>  
#> Small      Large 
#> -----      ----- 
#>             134419.2 
#> 
#> 
#> Bin Width: 10000 
#> Number of Bins: 10 
#>  
#>              Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
#> --------------------------------------------------------- 
#>   40000 >  50000   45000      4    0.11        4     0.11 
#>   50000 >  60000   55000      8    0.22       12     0.32 
#>   60000 >  70000   65000      8    0.22       20     0.54 
#>   70000 >  80000   75000      5    0.14       25     0.68 
#>   80000 >  90000   85000      3    0.08       28     0.76 
#>   90000 > 100000   95000      5    0.14       33     0.89 
#>  100000 > 110000  105000      1    0.03       34     0.92 
#>  110000 > 120000  115000      1    0.03       35     0.95 
#>  120000 > 130000  125000      1    0.03       36     0.97 
#>  130000 > 140000  135000      1    0.03       37     1.00

By default, the Histogram() function provides a color theme according to the current, active theme. The function also provides the corresponding frequency distribution, summary statistics, the table that lists the count of each category, from which the histogram is constructed, as well as an outlier analysis based on Tukeyā€™s outlier detection rules for box plots.

Use the parameters bin_start, bin_width, and bin_end to customize the histogram.

Easy to change the color, either by changing the color theme with style(), or just change the fill color with fill. Can refer to standard R colors, as shown with lessR function showColors(), or implicitly invoke the lessR color palette generating function getColors(). Each 30 degrees of the color wheel is named, such as "greens", "rusts", etc, and implements a sequential color palette.

Histogram(Salary, bin_start=35000, bin_width=14000, fill="reds")
Customized histogram.

Customized histogram.

#> >>> Suggestions 
#> bin_end: set the end of the last bin 
#> Histogram(Salary, density=TRUE)  # smoothed curve + histogram 
#> Plot(Salary)  # Violin/Box/Scatterplot (VBS) plot 
#> 
#> --- Salary --- 
#>  
#>      n   miss         mean           sd          min          mdn          max 
#>      37      0    73795.557    21799.533    46124.970    69547.600   134419.230 
#> 
#>   
#> --- Outliers ---     from the box plot: 1 
#>  
#> Small      Large 
#> -----      ----- 
#>             134419.2 
#> 
#> 
#> Bin Width: 14000 
#> Number of Bins: 8 
#>  
#>              Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
#> --------------------------------------------------------- 
#>   35000 >  49000   42000      1    0.03        1     0.03 
#>   49000 >  63000   56000     14    0.38       15     0.41 
#>   63000 >  77000   70000      9    0.24       24     0.65 
#>   77000 >  91000   84000      4    0.11       28     0.76 
#>   91000 > 105000   98000      5    0.14       33     0.89 
#>  105000 > 119000  112000      2    0.05       35     0.95 
#>  119000 > 133000  126000      1    0.03       36     0.97 
#>  133000 > 147000  140000      1    0.03       37     1.00

Scatterplot

Specify an X and Y variable with the plot function to obtain A scatter plot. In this example, both variables are continuous though any combination of continuous or categorical variables is possible, including specifying only one variable.

Plot(Years, Salary)

#> >>> Suggestions  or  enter: style(suggest=FALSE)
#> Plot(Years, Salary, enhance=TRUE)  # many options
#> Plot(Years, Salary, fill="skyblue")  # interior fill color of points
#> Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
#> Plot(Years, Salary, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
#> 
#> 
#> >>> Pearson's product-moment correlation 
#>  
#> Years: Time of Company Employment 
#> Salary: Annual Salary (USD) 
#>  
#> Number of paired values with neither missing, n = 36 
#> Sample Correlation of Years and Salary: r = 0.852 
#>   
#> Hypothesis Test of 0 Correlation:  t = 9.501,  df = 34,  p-value = 0.000 
#> 95% Confidence Interval for Correlation:  0.727 to 0.923 
#> 

Enhance the default scatterplot with parameter enhance. The visualization includes the mean of each variable indicated by the respective line through the scatterplot, the 95% confidence ellipse, labeled outliers, least-squares regression line with 95% confidence interval, and the corresponding regression line with the outliers removed.

Plot(Years, Salary, enhance=TRUE)
#> [Ellipse with Murdoch and Chow's function ellipse from their ellipse package]

#> >>> Suggestions  or  enter: style(suggest=FALSE)
#> Plot(Years, Salary, color="red")  # exterior edge color of points
#> Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
#> Plot(Years, Salary, out_cut=.10)  # label top 10% from center as outliers 
#> 
#> 
#> >>> Pearson's product-moment correlation 
#>  
#> Years: Time of Company Employment 
#> Salary: Annual Salary (USD) 
#>  
#> Number of paired values with neither missing, n = 36 
#> Sample Correlation of Years and Salary: r = 0.852 
#>   
#> Hypothesis Test of 0 Correlation:  t = 9.501,  df = 34,  p-value = 0.000 
#> 95% Confidence Interval for Correlation:  0.727 to 0.923 
#>   
#> 
#> >>> Outlier analysis with Mahalanobis Distance 
#>  
#>   MD  ID 
#> ----- ----- 
#> 8.14  18 
#> 7.84  34 
#>  
#> 5.63  31 
#> 5.58  19 
#> 3.75   4 
#> ...  ...

The default plot for a single continuous variable includes not only the scatterplot, but also the superimposed violin plot and box plot, with outliers identified. Call this plot the VBS plot.

Plot(Salary)
#> [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
#> 
#> >>> Suggestions
#> Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE)  # Label two outliers ...
#> Plot(Salary, box_adj=TRUE)  # Adjust boxplot whiskers for asymmetry

#> --- Salary --- 
#> Present: 37 
#> Missing: 0 
#> Total  : 37 
#>  
#> Mean         : 73795.557 
#> Stnd Dev     : 21799.533 
#> IQR          : 31012.560 
#> Skew         : 0.190   [medcouple, -1 to 1] 
#>  
#> Minimum      : 46124.970 
#> Lower Whisker: 46124.970 
#> 1st Quartile : 56772.950 
#> Median       : 69547.600 
#> 3rd Quartile : 87785.510 
#> Upper Whisker: 122563.380 
#> Maximum      : 134419.230 
#> 
#>   
#> --- Outliers ---     from the box plot: 1 
#>  
#> Small      Large 
#> -----      ----- 
#>             134419.23 
#> 
#> Number of duplicated values: 0 
#> 
#> Parameter values (can be manually set) 
#> ------------------------------------------------------- 
#> size: 0.61      size of plotted points 
#> out_size: 0.82  size of plotted outlier points 
#> jitter_y: 0.45 random vertical movement of points 
#> jitter_x: 0.00  random horizontal movement of points 
#> bw: 9529.04       set bandwidth higher for smoother edges

Regression Analysis

The full output is extensive: Summary of the analysis, estimated model, fit indices, ANOVA, correlation matrix, collinearity analysis, best subset regression, residuals and influence statistics, and prediction intervals. The motivation is to provide virtually all of the information needed for a proper regression analysis.

reg(Salary ~ Years + Pre)

#> >>> Suggestion
#> # Create an R markdown file for interpretative output with  Rmd = "file_name"
#> reg(Salary ~ Years + Pre, Rmd="eg")  
#> 
#> 
#>   BACKGROUND 
#> 
#> Data Frame:  d 
#>  
#> Response Variable: Salary 
#> Predictor Variable 1: Years 
#> Predictor Variable 2: Pre 
#>  
#> Number of cases (rows) of data:  37 
#> Number of cases retained for analysis:  36 
#> 
#> 
#>   BASIC ANALYSIS 
#> 
#>              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95% 
#> (Intercept) 44140.971  13666.115    3.230    0.003   16337.052   71944.891 
#>       Years  3251.408    347.529    9.356    0.000    2544.355    3958.462 
#>         Pre   -18.265    167.652   -0.109    0.914    -359.355     322.825 
#> 
#> Standard deviation of Salary: 21,822.372 
#>  
#> Standard deviation of residuals:  11,753.478 for df=33 
#> 95% range of residuals:  47,825.260 = 2 * (2.035 * 11,753.478) 
#>  
#> R-squared: 0.726    Adjusted R-squared: 0.710    PRESS R-squared: 0.659 
#> 
#> Null hypothesis of all 0 population slope coefficients:
#>   F-statistic: 43.827     df: 2 and 33     p-value:  0.000 
#> 
#> -- Analysis of Variance 
#>  
#>             df           Sum Sq          Mean Sq   F-value   p-value 
#>     Years    1  12107157290.292  12107157290.292    87.641     0.000 
#>       Pre    1      1639658.444      1639658.444     0.012     0.914 
#>  
#> Model        2  12108796948.736   6054398474.368    43.827     0.000 
#> Residuals   33   4558759843.773    138144237.690 
#> Salary      35  16667556792.508    476215908.357 
#> 
#> 
#>   K-FOLD CROSS-VALIDATION 
#> 
#> 
#>   RELATIONS AMONG THE VARIABLES 
#> 
#>          Salary Years  Pre 
#>   Salary   1.00  0.85 0.03 
#>    Years   0.85  1.00 0.05 
#>      Pre   0.03  0.05 1.00 
#> 
#>         Tolerance       VIF 
#>   Years     0.998     1.002 
#>     Pre     0.998     1.002 
#> 
#>  Years Pre    R2adj    X's 
#>      1   0    0.718      1 
#>      1   1    0.710      2 
#>      0   1   -0.028      1 
#>  
#> [based on Thomas Lumley's leaps function from the leaps package] 
#> 
#> 
#>   RESIDUALS AND INFLUENCE 
#> 
#> -- Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance 
#>    [sorted by Cook's Distance] 
#>    [n_res_rows = 20, out of 36 rows of data, or do n_res_rows="all"] 
#> ----------------------------------------------------------------------------------------- 
#>                        Years     Pre     Salary     fitted      resid rstdnt dffits cooks 
#>       Correll, Trevon     21      97 134419.230 110648.843  23770.387  2.424  1.217 0.430 
#>         James, Leslie     18      70 122563.380 101387.773  21175.607  1.998  0.714 0.156 
#>         Capelle, Adam     24      83 108138.430 120658.778 -12520.348 -1.211 -0.634 0.132 
#>           Hoang, Binh     15      96 111074.860  91158.659  19916.201  1.860  0.649 0.131 
#>    Korhalkar, Jessica      2      74  72502.500  49292.181  23210.319  2.171  0.638 0.122 
#>        Billing, Susan      4      91  72675.260  55484.493  17190.767  1.561  0.472 0.071 
#>          Singh, Niral      2      59  61055.440  49566.155  11489.285  1.064  0.452 0.068 
#>        Skrotzki, Sara     18      63  91352.330 101515.627 -10163.297 -0.937 -0.397 0.053 
#>      Saechao, Suzanne      8      98  55545.250  68362.271 -12817.021 -1.157 -0.390 0.050 
#>         Kralik, Laura     10      74  92681.190  75303.447  17377.743  1.535  0.287 0.026 
#>   Anastasiou, Crystal      2      59  56508.320  49566.155   6942.165  0.636  0.270 0.025 
#>     Langston, Matthew      5      94  49188.960  58681.106  -9492.146 -0.844 -0.268 0.024 
#>        Afshari, Anbar      6     100  69441.930  61822.925   7619.005  0.689  0.264 0.024 
#>   Cassinelli, Anastis     10      80  57562.360  75193.857 -17631.497 -1.554 -0.265 0.022 
#>      Osterman, Pascal      5      69  49704.790  59137.730  -9432.940 -0.826 -0.216 0.016 
#>   Bellingar, Samantha     10      67  66337.830  75431.301  -9093.471 -0.793 -0.198 0.013 
#>          LaRoe, Maria     10      80  61961.290  75193.857 -13232.567 -1.148 -0.195 0.013 
#>      Ritchie, Darnell      7      82  53788.260  65403.102 -11614.842 -1.006 -0.190 0.012 
#>        Sheppard, Cory     14      66  95027.550  88455.199   6572.351  0.579  0.176 0.011 
#>        Downs, Deborah      7      90  57139.900  65256.982  -8117.082 -0.706 -0.174 0.010 
#> 
#> 
#>   PREDICTION ERROR 
#> 
#> -- Data, Predicted, Standard Error of Prediction, 95% Prediction Intervals 
#>    [sorted by lower bound of prediction interval] 
#>    [to see all intervals add n_pred_rows="all"] 
#>  ---------------------------------------------- 
#> 
#>                        Years    Pre     Salary       pred    s_pred    pi.lwr     pi.upr     width 
#>          Hamide, Bita      1     83  51036.850  45876.388 12290.483 20871.211  70881.564 50010.352 
#>          Singh, Niral      2     59  61055.440  49566.155 12619.291 23892.014  75240.296 51348.281 
#>   Anastasiou, Crystal      2     59  56508.320  49566.155 12619.291 23892.014  75240.296 51348.281 
#> ... 
#>          Link, Thomas     10     83  66312.890  75139.062 11933.518 50860.137  99417.987 48557.849 
#>          LaRoe, Maria     10     80  61961.290  75193.857 11918.048 50946.405  99441.308 48494.903 
#>   Cassinelli, Anastis     10     80  57562.360  75193.857 11918.048 50946.405  99441.308 48494.903 
#> ... 
#>       Correll, Trevon     21     97 134419.230 110648.843 12881.876 84440.470 136857.217 52416.747 
#>         Capelle, Adam     24     83 108138.430 120658.778 12955.608 94300.394 147017.161 52716.767 
#> 
#> ---------------------------------- 
#> Plot 1: Distribution of Residuals 
#> Plot 2: Residuals vs Fitted Values 
#> ----------------------------------

Time Series

The time series plot, plotting the values of a variable cross time, is a special case of a scatterplot, potentially with the points of size 0 with adjacent points connected by a line segment. Indicate a time series by specifying the x-variable, the first variable listed, as a variable of type Date. This conversion to Date data values occurs automatically for dates specified in a digital format, such as 18/8/2024.

d <- Read("StockPrice")
#> 
#> >>> Suggestions
#> Recommended binary format for data files: feather
#>   Create with Write(d, "your_file", format="feather")
#> More details about your data, Enter:  details()  for d, or  details(name)
#> 
#> Data Types
#> ------------------------------------------------------------
#> character: Non-numeric data values
#> Date: Date with year, month and day
#> double: Numeric data values with decimal digits
#> ------------------------------------------------------------
#> 
#>     Variable                  Missing  Unique 
#>         Name     Type  Values  Values  Values   First and last values
#> ------------------------------------------------------------------------------------------
#>  1     Month      Date   1419       0     473   1985-01-01 ... 2024-05-01
#>  2   Company character   1419       0       3   Apple  Apple ... Intel  Intel
#>  3     Price    double   1419       0    1400   0.100055  0.085392 ... 30.346739  30.555891
#>  4    Volume    double   1419       0    1419   6366416000 ... 229147100
#> ------------------------------------------------------------------------------------------
head(d)
#>        Month Company    Price     Volume
#> 1 1985-01-01   Apple 0.100055 6366416000
#> 2 1985-02-01   Apple 0.085392 4733388800
#> 3 1985-03-01   Apple 0.076335 4615587200
#> 4 1985-04-01   Apple 0.073316 2868028800
#> 5 1985-05-01   Apple 0.059947 4639129600
#> 6 1985-06-01   Apple 0.062103 5811388800

We have the date as Month, and also Company and stock Price.

d <- Read("StockPrice")
#> 
#> >>> Suggestions
#> Recommended binary format for data files: feather
#>   Create with Write(d, "your_file", format="feather")
#> More details about your data, Enter:  details()  for d, or  details(name)
#> 
#> Data Types
#> ------------------------------------------------------------
#> character: Non-numeric data values
#> Date: Date with year, month and day
#> double: Numeric data values with decimal digits
#> ------------------------------------------------------------
#> 
#>     Variable                  Missing  Unique 
#>         Name     Type  Values  Values  Values   First and last values
#> ------------------------------------------------------------------------------------------
#>  1     Month      Date   1419       0     473   1985-01-01 ... 2024-05-01
#>  2   Company character   1419       0       3   Apple  Apple ... Intel  Intel
#>  3     Price    double   1419       0    1400   0.100055  0.085392 ... 30.346739  30.555891
#>  4    Volume    double   1419       0    1419   6366416000 ... 229147100
#> ------------------------------------------------------------------------------------------
Plot(Month, Price, filter=(Company=="Apple"), area_fill="on")
#> 
#> filter:  (Company == "Apple") 
#> -----
#> Rows of data before filtering:  1419 
#> Rows of data after filtering:   473

#> >>> Suggestions  or  enter: style(suggest=FALSE)
#> Plot(Month, Price, time_ahead=4)  # exponential smoothing forecast 4 time units
#> Plot(Month, Price, time_unit="years")  # aggregate time by yearly sum
#> Plot(Month, Price, time_unit="years", time_agg="mean")  # aggregate by yearly mean

With the by parameter, plot all three companies on the same panel.

Plot(Month, Price, by=Company)

#> >>> Suggestions  or  enter: style(suggest=FALSE)
#> Plot(Month, Price, time_ahead=4)  # exponential smoothing forecast 4 time units
#> Plot(Month, Price, time_unit="years")  # aggregate time by yearly sum
#> Plot(Month, Price, time_unit="years", time_agg="mean")  # aggregate by yearly mean

Here, aggregate the mean by time, from months to quarters.

Plot(Month, Price, time_unit="quarters", time_agg="mean")
#> >>> Warning
#> The  Date  variable is not sorted in Increasing Order.
#> 
#> For a data frame named d, enter: 
#>     d <- sort_by(d, Month)
#> Maybe you have a  by  variable with repeating Date values?
#> Enter  ?sort_by  for more information and examples.
#> [with functions from Ryan, Ulrich, Bennett, and Joy's xts package]

#> >>> Suggestions  or  enter: style(suggest=FALSE)
#> Plot(Month, Price, time_ahead=4)  # exponential smoothing forecast 4 time units

Plot() implements exponential smoothing forecasting with accompanying visualization. New parameters include time_ahead for the number of time_units to forecast into the future, and time_format to provide a specific format for the date variable if not detected correctly by default. Control aspects of the exponential smoothing estimation and prediction algorithms with parameters es_level (alpha), es_trend (beta), es_seasons (gamma), es_type for additive or multiplicative seasonality, and es_PIlevel for the level of the prediction intervals.

To forecast Appleā€™s stock price, focus here on the last several years of the data, beginning with Row 400 through Row 473, the last row of data for apple. In this example, forecast ahead 24 months.

d <- d[400:473,]
Plot(Month, Price, time_unit="months", time_agg="mean", time_ahead=24)

#> >>> Suggestions  or  enter: style(suggest=FALSE)
#> Plot(Month, Price, time_ahead=4, es_seasons=FALSE)  # turn off exponential smoothing seasonal effect 
#> 
#> Sum of squared fit errors: 7,753.551 
#> 
#>          predicted      upr      lwr
#> Jun 2024  184.7528 206.6612 162.8443
#> Jul 2024  191.2191 219.2801 163.1582
#> Aug 2024  187.1104 220.2061 154.0147
#> Sep 2024  177.5939 215.0600 140.1277
#> Oct 2024  188.4028 229.7860 147.0195
#> Nov 2024  195.4951 240.4607 150.5294
#> Dec 2024  190.8555 239.1433 142.5676
#> Jan 2025  188.8448 240.2453 137.4444
#> Feb 2025  184.8265 239.1659 130.4871
#> Mar 2025  189.2376 246.3691 132.1061
#> Apr 2025  192.6431 252.4405 132.8457
#> May 2025  195.3299 257.6831 132.9767
#> Jun 2025  197.6827 263.8267 131.5386
#> Jul 2025  204.1490 272.6194 135.6786
#> Aug 2025  200.0403 270.7638 129.3167
#> Sep 2025  190.5237 263.4342 117.6133
#> Oct 2025  201.3327 276.3694 126.2959
#> Nov 2025  208.4249 285.5326 131.3173
#> Dec 2025  203.7853 282.9127 124.6580
#> Jan 2026  201.7747 282.8745 120.6750
#> Feb 2026  197.7564 280.7845 114.7282
#> Mar 2026  202.1675 287.0831 117.2518
#> Apr 2026  205.5730 292.3378 118.8081
#> May 2026  208.2598 296.8380 119.6816

Pivot Tables

Aggregate with pivot(). Any function that processes a single vector of data, such as a column of data values for a variable in a data frame, and outputs a single computed value, the statistic, can be passed to pivot(). Functions can be user-defined or built-in.

Here, compute the mean and standard deviation of each company in the StockPrice data set download it with lessR.

d <- Read("StockPrice", quiet=TRUE)
pivot(d, c(mean, sd), Price, by=Company)
#>   Company Price_n Price_na Price_mean Price_sd
#> 1   Apple     473        0     23.157   46.248
#> 2     IBM     473        0     60.010   43.547
#> 3   Intel     473        0     16.725   14.689

Interpret this call to pivot() as

Select any two of the three possibilities for multiple parameter values: Multiple compute functions, multiple variables over which to compute, and multiple categorical variables by which to define groups for aggregation.