Part 2


Data types in R

Vectors

A simple string of items. These items themselves can be of any of the following types (without mixing types):

  • Logical: TRUE, FALSE (can be abbreviated to T, F)
  • Numeric: 12.3, 5, 999
  • Integer: 2L, 34L, 0L
  • Character: ‘a’ , ‘“good”, “TRUE”, ’23.4’

Notice that values of the character type can be enclosed either in single or double quotes. There is no meaningful difference. This can be confusing. Good luck.

To create a vector, use the function c().

fruit = c('apple', 'banana', 'pineapple', 'kumquat')

print(fruit)
## [1] "apple"     "banana"    "pineapple" "kumquat"
class(fruit)
## [1] "character"

Vectors can be combined into “data frames” (see below). The then become columns of the data frame.

Lists

Lists are like vectors, but can mix different data types.

mylist = list("banana", 21.7, 11L)

print(mylist)
## [[1]]
## [1] "banana"
## 
## [[2]]
## [1] 21.7
## 
## [[3]]
## [1] 11
class(mylist)
## [1] "list"

Matrices

A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function.

mymatrix = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)

print(mymatrix)
##      [,1] [,2] [,3]
## [1,] "a"  "a"  "b" 
## [2,] "c"  "b"  "a"
class(mymatrix)
## [1] "matrix"

Factors

Factors are created using a vector. They store the vector along with a list of the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character or Boolean etc. in the input vector.

Factors are sometimes called “nominal” data. They are useful in statistical modeling.

choice = c('stb', 'choc', 'stb', 'van', 'stb', 'choc', 'van')

print(choice)
## [1] "stb"  "choc" "stb"  "van"  "stb"  "choc" "van"
class(choice)
## [1] "character"

So far, choice is a vector of character-type entries. But we can turn it into a factor:

choice = as.factor(choice)

print(choice)
## [1] stb  choc stb  van  stb  choc van 
## Levels: choc stb van
class(choice)
## [1] "factor"

As you see above, when you call print() on a factor, its levels are listed along with its content. There is also a dedicated function that just calls the levels of a factor:

levels(choice)
## [1] "choc" "stb"  "van"

In day-to-day practice, the levels() function is quite useful because a lot of problems can be solved by looking at the levels of a factor.

Data frames

The king of the data types in R, data frames are what you’re most likely to find yourself working with. They are like 2D-arrays, but with column and row headers.

BMI <-  data.frame(
   gender = c("Male", "Male","Female"), 
   height = c(152, 171.5, 165), 
   weight = c(81,93, 78),
   Age = c(42,38,26)
)

print(BMI)
##   gender height weight Age
## 1   Male  152.0     81  42
## 2   Male  171.5     93  38
## 3 Female  165.0     78  26

Inputting data

As you saw above, you can “make” data frames by hand. More likely, however, you will want to analyze an existing data frame that is in an Excel-related format such as .csv, or in a database format such as .json, MySQL, etc.

.csv

Let us start with a file in .csv format. Make sure the file is in your working directory. Use the following command to open one of our data frames:

mydf =
  read.csv('relatives.csv', header = T, row.names = F)

print(colnames(mydf))

Notice I am only calling the column names here with the print() command instead of the whole data frame. (The colnames() function is nested inside the print() function.) If you want a peek at the whole data frame, use the functions that RStudio (or Excel) provides for this purpose.

For database formats, there are useful and functional packages readily available. The best idea is to Google a recommendation.

.json

For example, a .json file can be read into R using the jsonlite package.

install.packages("jsonlite")

require(jsonlite)

After this, we can open a .json file that is locally in our working directory (in this example, ‘winners.json’) and open it as a data frame in R. (Nested objects are being ‘flattened’ by this call.)

winners <- fromJSON('winners.json', flatten=TRUE)

SPSS

And here is one for SPSS-format files:

install.packages("foreign")

require(foreign)

mydata = read.spss("myfile", to.data.frame=TRUE)

A first look at your data frame

Once you have your data frame in R, and it has a variable name such as mydata or similar, there are several functions you can use to explore the data.

Try each of the following by applying the function to a dataset in your working memory:

  • dim() to get the number of dimensions (rows, columns)
  • length() counts the columns
  • ncol() counts the columns
  • nrow() counts the rows
  • colnames() gives a numbered list of column names
  • rownames() gives a numbered list of row names (if present)
  • str() stands for “structure”… see what it does
  • summary() see what it does, and note differences to str()
  • head() prints the “head” (first 6 lines) of the data
  • tail() prints the “tail” (last 6 lines) of the data

Cleaning and wrangling data

Cleaning and wrangling are theoretically separate, but practically intertwined processes.

Controling data type

Sometimes, when importing a dataset and coercing it into data frame format, R may misinterpret data types. For example, if there is a factor whose levels are labeled by numbers, R will be most likely to read it as a numeric vector, not as a number.

In the practice dataset relatives, I have included one such column – it is coded with numbers, but we want it to be treated as a factor.

# read.csv('relatives.csv', header = T)

str(relatives)

The column which_old is in a binary numeric format. In order to tell R that this is, in fact, a factor, you can do either of two things.

First, you can simply tell R to re-parse the column as a factor:

relatives$which_old =
  as.factor(relatives$which_old)

str(relatives)

Alternatively, you can write a new factor in which the levels are labeled by character values. In order to do this, we use the ifelse() function.

relatives$which_new = 
  as.factor(ifelse(relatives$which_old == 0, "no", "yes"))

str(relatives)

# If you wanted to remove the old version of the factor from your data frame:

# relatives$which_old = NULL