A simple string of items. These items themselves can be of any of the following types (without mixing types):
Notice that values of the character type can be enclosed either in single or double quotes. There is no meaningful difference. This can be confusing. Good luck.
To create a vector, use the function c()
.
fruit = c('apple', 'banana', 'pineapple', 'kumquat')
print(fruit)
## [1] "apple" "banana" "pineapple" "kumquat"
class(fruit)
## [1] "character"
Vectors can be combined into “data frames” (see below). The then become columns of the data frame.
Lists are like vectors, but can mix different data types.
mylist = list("banana", 21.7, 11L)
print(mylist)
## [[1]]
## [1] "banana"
##
## [[2]]
## [1] 21.7
##
## [[3]]
## [1] 11
class(mylist)
## [1] "list"
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function.
mymatrix = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(mymatrix)
## [,1] [,2] [,3]
## [1,] "a" "a" "b"
## [2,] "c" "b" "a"
class(mymatrix)
## [1] "matrix"
Factors are created using a vector. They store the vector along with a list of the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character or Boolean etc. in the input vector.
Factors are sometimes called “nominal” data. They are useful in statistical modeling.
choice = c('stb', 'choc', 'stb', 'van', 'stb', 'choc', 'van')
print(choice)
## [1] "stb" "choc" "stb" "van" "stb" "choc" "van"
class(choice)
## [1] "character"
So far, choice
is a vector of character-type entries. But we can turn it into a factor:
choice = as.factor(choice)
print(choice)
## [1] stb choc stb van stb choc van
## Levels: choc stb van
class(choice)
## [1] "factor"
As you see above, when you call print()
on a factor, its levels are listed along with its content. There is also a dedicated function that just calls the levels of a factor:
levels(choice)
## [1] "choc" "stb" "van"
In day-to-day practice, the levels()
function is quite useful because a lot of problems can be solved by looking at the levels of a factor.
The king of the data types in R, data frames are what you’re most likely to find yourself working with. They are like 2D-arrays, but with column and row headers.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
## gender height weight Age
## 1 Male 152.0 81 42
## 2 Male 171.5 93 38
## 3 Female 165.0 78 26
As you saw above, you can “make” data frames by hand. More likely, however, you will want to analyze an existing data frame that is in an Excel-related format such as .csv, or in a database format such as .json, MySQL, etc.
Let us start with a file in .csv format. Make sure the file is in your working directory. Use the following command to open one of our data frames:
mydf =
read.csv('relatives.csv', header = T, row.names = F)
print(colnames(mydf))
Notice I am only calling the column names here with the print()
command instead of the whole data frame. (The colnames()
function is nested inside the print()
function.) If you want a peek at the whole data frame, use the functions that RStudio (or Excel) provides for this purpose.
For database formats, there are useful and functional packages readily available. The best idea is to Google a recommendation.
For example, a .json file can be read into R using the jsonlite
package.
install.packages("jsonlite")
require(jsonlite)
After this, we can open a .json file that is locally in our working directory (in this example, ‘winners.json’) and open it as a data frame in R. (Nested objects are being ‘flattened’ by this call.)
winners <- fromJSON('winners.json', flatten=TRUE)
And here is one for SPSS-format files:
install.packages("foreign")
require(foreign)
mydata = read.spss("myfile", to.data.frame=TRUE)
Once you have your data frame in R, and it has a variable name such as mydata or similar, there are several functions you can use to explore the data.
Try each of the following by applying the function to a dataset in your working memory:
dim()
to get the number of dimensions (rows, columns)length()
counts the columnsncol()
counts the columnsnrow()
counts the rowscolnames()
gives a numbered list of column namesrownames()
gives a numbered list of row names (if present)str()
stands for “structure”… see what it doessummary()
see what it does, and note differences to str()
head()
prints the “head” (first 6 lines) of the datatail()
prints the “tail” (last 6 lines) of the dataCleaning and wrangling are theoretically separate, but practically intertwined processes.
Sometimes, when importing a dataset and coercing it into data frame format, R may misinterpret data types. For example, if there is a factor whose levels are labeled by numbers, R will be most likely to read it as a numeric vector, not as a number.
In the practice dataset relatives
, I have included one such column – it is coded with numbers, but we want it to be treated as a factor.
# read.csv('relatives.csv', header = T)
str(relatives)
The column which_old is in a binary numeric format. In order to tell R that this is, in fact, a factor, you can do either of two things.
First, you can simply tell R to re-parse the column as a factor:
relatives$which_old =
as.factor(relatives$which_old)
str(relatives)
Alternatively, you can write a new factor in which the levels are labeled by character values. In order to do this, we use the ifelse()
function.
relatives$which_new =
as.factor(ifelse(relatives$which_old == 0, "no", "yes"))
str(relatives)
# If you wanted to remove the old version of the factor from your data frame:
# relatives$which_old = NULL