Dplyr
dplyr is an R package whose set of functions are designed to enable dataframe manipulation in an intuitive, user-friendly way. It is one of the core packages of the popular tidyverse set of packages in the R programming language. Data analysts typically use dplyr in order to transform existing datasets into a format better suited for some particular type of analysis, or data visualization.
For instance, someone seeking to analyze a large dataset may wish to only view a smaller subset of the data. Alternatively, a user may wish to rearrange the data in order to see the rows ranked by some numerical value, or even based on a combination of values from the original dataset. Functions within the dplyr package will allow a user to perform such tasks.
dplyr was launched in 2014. On the dplyr web page, the package is described as "a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges."
The five core verbs
While dplyr actually includes several dozen functions that enable various forms of data manipulation, the package features five primary verbs or actions:- filter, which is used to extract rows from a dataframe, based on conditions specified by a user;
- select, which is used to subset a dataframe by its columns;
- arrange, which is used to sort rows in a dataframe based on attributes held by particular columns;
- mutate, which is used to create new variables, by altering and/or combining values from existing columns; and
- summarize, also spelled summarise, which is used to collapse values from a dataframe into a single summary.
Additional functions
- count, which is used to sum the number of unique observations that contain some particular value or categorical attribute;
- rename, which enables a user to alter the column names for variables, often to improve ease of use and intuitive understanding of a dataset;
- slice_max, which returns a data subset that contains the rows with the highest number of values for some particular variable;
- slice_min, which returns a data subset that contains the rows with the lowest number of values for some particular variable.
Built-in datasets
band_instruments, band_instruments2, band_members, starwars, storms.