R CMD check
R CMD check
passes cleanly in future
R-devel.R CMD check
passes cleanly in future
R-devel.R CMD check
passes cleanly on R and
R-devel.R CMD check
passes cleanly on R and
R-devel.R CMD check
passes cleanly on R and
R-devel.loop_apply()
as Rcpp version was
appears to be having PROTECTion problems. (Also fixes #256)Update for changes in R namespace best-practices.
New parameter .id
to adply()
that
specifies the name(s) of the index column(s). (Thanks to Kirill Müller,
#191)
Fix bug in split_indices()
when n
isn’t
supplied.
Fix bug in .id
parameter to ldply()
and
rdply()
allowing for .id = NULL
to work as
described in the help. (Thanks to Doug Mitarotonda, #207, and Marek,
#224 and #225)
Deprecate exotic functions liply()
and
isplit2()
, remove unused and unexported functions
dots()
and parallel_fe()
(Thanks to Kirill
Müller, #242, #248)
Warn on duplicate names that cause certain array functions to fail. (Thanks to Kirill Müller, #211)
Parameter .inform
is now honored for
?_ply()
calls. (Thanks to Kirill Müller, #209)
New parameter .id
to ldply()
and
rdply()
that specifies the name of the index column.
(Thanks to Kirill Müller, #107, #140, #142)
The .id column in ldply()
is generated as a factor
to preserve the sort order, but only if the new .id
parameter is set. (Thanks to Kirill Müller, #137)
rbind.fill
now silently drops NULL inputs
(#138)
rbind.fill
avoids array copying which had produced
quadratic time complexity. *dply
of large numbers of groups
should be faster. (Contributed by Peter Meilstrup)
rbind.fill
handles non-numeric matrix columns
(i.e. factor arrays, character arrays, list arrays); also arrays with
more than 2 dimensions can be used. Dimnames of array columns are now
preserved. (Contributed by Peter Meilstrup)
rbind.fill(x,y)
converts factor columns of Y to
character when columns of X are character. join(x,y)
and
match_df(x,y)
now work when the key column in X is
character and Y is factor. (Contributed by Peter Meilstrup)
Fix faulty array allocation which caused problems when using
split_indices
with large (> 2^24) vectors. (Fixes
#131)
list_to_array()
incorrectly determined dimensions if
column of labels contained any missing values (#169).
r*ply
expression is evaluated exactly
.n
times, evaluation results are consistent with side
effects. (#158, thanks to Kirill Müller)
**ply
gain a .inform
argument
(previously only available in llply
) - this gives more
useful debugging information at the cost of some speed. (Thanks to Brian
Diggs, #57)
if .dims = TRUE
alply
’s output gains
dimensions and dimnames, similar to apply
. Sequential
indexing of a list produced by alply
should be unaffected.
(Peter Meilstrup)
colwise
, numcolwise
and
catcolwise
now all accept additional arguments in ….
(Thanks to Stavros Macrakis, #62)
here
makes it possible to use **ply
+ a
function that uses non-standard evaluation (e.g. summarise
,
mutate
, subset
, arrange
) inside a
function. (Thanks to Peter Meilstrup, #3)
join_all
recursively joins a list of data frames.
(Fixes #29)
name_rows
provides a convenient way of saving and
then restoring row names so that you can preserve them if you need to.
(#61)
progress_time
(used with
.progress = "time"
) estimates the amount of time remaining
before the job is completed. (Thanks to Mike Lawrence, #78)
summarise
now works iteratively so that later
columns can refer to earlier. (Thanks to Jim Hester, #44)
take
makes it easy to subset along an arbitrary
dimension.
Improved documentation thanks to patches from Tim Bates.
**ply
gains a .paropts
argument, a list
of options that is passed onto foreach
for controlling
parallel computation.
*_ply
now accepts .parallel
argument to
enable parallel processing. (Fixes #60)
Progress bars are disabled when using parallel plyr (Fixes #32)
a*ply
: 25x speedup when indexing array objects, 3x
speedup when indexing data frames. This should substantially reduce the
overhead of using a*ply
d*ply
subsetting has been considerably optimised:
this will have a small impact unless you have a very large number of
groups, in which case it will be considerably faster.
idata.frame
: Subsetting immutable data frames with
[.idf
is now faster (Peter Meilstrup)
quickdf
is around 20% faster
split_indices
, which powers much internal splitting
code (like vaggregate
, join
and
d*ply
) is about 2x faster. It was already incredibly fast
~0.2s for 1,000,000 obs, so this won’t have much impact on overall
performance
*aply
functions now bind list mode results into a
list-array (Peter Meilstrup)
*aply
now accepts 0-dimension arrays as inputs.
(#88)
count
now works correctly for factor and Date
inputs. (Fixes #130)
*dply
now deals better with matrix results,
converting them to data frames, rather than vectors. (Fixes
#12)
d*ply
will now preserve factor levels input if
drop = FALSE
(#81)
join
works correctly when there are no common rows
(Fixes #74), or when one input has no rows (Fixes #48). It also
consistently orders the columns: common columns, then x cols, then y
cols (Fixes #40).
quickdf
correctly handles NA variable names. (Fixes
#66. Thanks to Scott Kostyshak)
rbind.fill
and rbind.fill.matrix
work
consistently with matrices and data frames with zero rows. Fixes #79.
(Peter Meilstrup)
rbind.fill
now stops if inputs are not data frames.
(Fixes #51)
rbind.fill
now works consistently with 0 column data
frames
round_any
now works with POSIXct
objects, thanks to Jean-Olivier Irisson (#76)
rbind.fill
: if a column contains both factors and
characters (in different inputs), the resulting column will be coerced
to character
When there are more than 2^31 distinct combinations
id
, switches to a slower fallback strategy using strings
(inspired by merge
) that guarantees correct results. This
fixes problems with join
when joining across many columns.
(Fixes #63)
split_indices
checks input more aggressively to
prevent segfaults. Fixes #43.
fix small bug in loop_apply
which lead to segfaults
in certain circumstances. (Thanks to Pål Westermark for patch)
itertools
and iterators
moved to
suggests from imports so that plyr now only depends on base R.
documentation improved using new features of
roxygen2
fixed namespacing issue which lead to lost labels when subsetting
the results of *lply
colwise
automatically strips off split
variables.
rlply
now correctly deals with
rlply(4, NULL)
(thanks to bug report from Eric
Goldlust)
rbind.fill
tries harder to keep attributes,
retaining the attributes from the first occurrence of each column it
finds. It also now works with variables of class POSIXlt
and preserves the ordered status of factors.
arrange
now works with one column data
frames
d*ply
returns correct number of rows when function
returns vector
fix NAMESPACE bug which was causing problems with ggplot2
rbind.fill
now treats 1d arrays in the same way as
rbind
(i.e. it turns them into ordinary vectors)
fix bug in rename when renaming multiple columns
new strip_splits
function removes splitting
variables from the data frames returned by ddply
.
rename
moved in from reshape, and
rewritten.
new match_df
function makes it easy to subset a data
frame to only contain values matching another data frame. Inspired by
http://stackoverflow.com/questions/4693849.
**ply
now works when passed a list of
functions
*dply
now correctly names output even when some
output combinations are missing (NULL) (Thanks to bug report from Karl
Ove Hufthammer)
*dply
preserves the class of many more object
types.
a*ply
now correctly works with zero length margins,
operating on the entire object (Thanks to bug report from Stavros
Macrakis)
join
now implements joins in a more SQL like way,
returning all possible matches, not just the first one. It is still a
(little) faster than merge. The previous behaviour is accessible with
match = "first"
.
join
is now more symmetric so that
join(x, y, "left")
is closer to
join(y, x, "right")
, modulo column ordering
named.quoted
failed when quoted expressions were
longer than 50 characters. (Thanks to bug report from Eric
Goldlust)
rbind.fill
now correctly maintains POSIXct tzone
attributes and preserves missing factor levels
split_labels
correctly preserves empty factor
levels, which means that drop = FALSE
should work in more
places. Use base::droplevels
to remove levels that don’t
occur in the data, and drop = T
to remove combinations of
levels that don’t occur.
vaggregate
now passes ...
to the
aggregation function when working out the output type (thanks to bug
report by Pavan Racherla)
count
now takes an additional parameter
wt_var
which allows you to compute weighted sums. This is
as fast, or faster than, tapply
or
xtabs
.
Really fix bug in names.quoted
.
now captures the environment in which it was
evaluated. This should fix an esoteric class of bugs which no-one
probably ever encountered, but will form the basis for an improved
version of ggplot2::aes
.
names.quoted
that interfered with
ggplot2mutate
that works like transform to add
new columns or overwrite existing columns, but computes new columns
iteratively so later transformations can use columns created by earlier
transformations. (It’s also about 10x faster) (Fixes #21)split column names are no longer coerced to valid R names.
quickdf
now adds names if missing
summarise
preserves variable names if explicit names
not provided (Fixes #17)
arrays
with names should be sorted correctly once
again (also fixed a bug in the test case that prevented me from catching
this automatically)
m_ply
no longer possesses .parallel argument
(mistakenly added)
ldply
(and hence adply
and
ddply
) now correctly passes on .parallel argument (Fixes
#16)
id
uses a better strategy for converting to
integers, making it possible to use for cases with larger potential
numbers of combinations
l*ply
, d*ply
, a*ply
and
m*ply
all gain a .parallel argument that when
TRUE
, applies functions in parallel using a parallel
backend registered with the foreach package:
<- seq_len(20)
x <- function(i) Sys.sleep(0.1)
wait system.time(llply(x, wait))
# user system elapsed
# 0.007 0.005 2.005
::registerDoParallel(2)
doParallelsystem.time(llply(x, wait, .parallel = TRUE))
# user system elapsed
# 0.020 0.011 1.038
This work has been generously supported by BD (Becton Dickinson).
aply and mply gain an .expand argument that controls whether data frames produce a single output dimension (one element for each row), or an output dimension for each variable.
new vaggregate (vector aggregate) function, which is equivalent to tapply, but much faster (~ 10x), since it avoids copying the data.
llply: for simple lists and vectors, with no progress bar, no extra info, and no parallelisation, llply calls lapply directly to avoid all the overhead associated with those unused extra features.
llply: in serial case, for loop replaced with custom C function that takes about 40% less time (or about 20% less time than lapply). Note that as a whole, llply still has much more overhead than lapply.
round_any now lives in plyr instead of reshape
list_to_array
works correct even when there are missing
values in the array. This is particularly important for daply.*dply
deals more gracefully with the case when all
results are NULL (fixes #10)
*aply
correctly orders output regardless of
dimension names (fixes #11)
join gains type = “full” which preserves all x and y rows
experimental immutable data frame (idata.frame) that vastly speeds up subsetting - for large datasets with large numbers of groups, this can yield 10-fold speed ups. See examples in ?idata.frame to see how to use it.
rbind.fill rewritten again to increase speed and work with more data types
d*ply now much faster with nested groups
This work has been generously supported by BD (Becton Dickinson).
d*ply
when .drop = FALSEa*ply
now works correctly with array-listsr*ply
now works with …