Session 1: Wednesday, 10 a.m. - 12 p.m.

2. Goals of the workshop

Following this workshop, you will be able to:

  • Systematically retrieve social media data from the web and thus build a corpus for linguistic study,
  • Prepare your corpus by cleaning and ordering it,
  • Mine your corpus for linguistic phenomena that you may want to study,
  • Create a dataset with numeric information about the phenomenon you are studying,
  • Conduct statistical analyses of your data,
  • Visualize your findings,
  • Report your statistics.

In this first session, we will get acquainted with R.

R is a software that you interact with via a programming language. So, in short, you can say that R is a programming language.

As with any other programming language, the logic of interacting with R is that you give it input, and it gives you output. Or: you enter expressions and R evaluates them.


3. R as calculator

Many things are straightforward. You can use R as a calculator, for example. What happens if you enter this in R (and then hit Return)?

4 + 2

Try to have R perform a few other simple mathematical operations, e.g.

  • six times eleven
  • six point nine times eleven
  • two minus three
  • four divided by two

Notice that you must of course figure out how to communicate these mathematical operations to R in a way that it understands. If you use words instead of numbers, you may not succeed.


4. Assigning variables

The most fundamental functionality in computer programming is assigning variables. That is to say: linking content to placeholders. In the following example, x and y are variable names, and they are each assigned to numeric content.

x = 7
y = 2

Once you have assigned variables, the variables can be used just like their actual content. In other words: you can

  • add x to y,
  • divide y by x,
  • divide x by 2 and then add y,
  • etc.

Try those operations on your own.


5. Data types

Above, we have assigned numeric variables. But variables can easily be of a different type. For example, x can have a word as its content. Notice that text content needs to be written in quotation marks in order to be understood.

x = 'alfred'
x
## [1] "alfred"
y = 'mildred'
y
## [1] "mildred"

But what happens when we try to add x and y? Try it.

x + y

There are a number of other data types than just numeric and word (“string”) variables. At this point we will only introduce one more: so-called Booleans.

First, try this:

a = 3.5
b = 4
a + b

What happens when you do this

a = b

and then add a + b?

What happened is that we assigned the value of b to a.

This is important because it tells us about the nature of the equal sign (=) in R. It does not mean equal in the usual mathematical sense. It is an operator that assigns content to a variable, in right-to-left order.

There is however another equal sign: the “double equal” sign (==).

# FIRST TEST OF THE == SIGN
a = 9
b = 12
a + b
## [1] 21
a == b
## [1] FALSE
# SECOND TEST
a = b
a + b
## [1] 24
a == b
## [1] TRUE

6. Who Hadley Wickham is

Hadley is a statistician, and a regular man

Hadley

Hadley

with admirers who think he is from a higher sphere, and who live in the “Hadleyverse.”

Hadley as Obama poster

Hadley as Obama poster

Hadley grew up in New Zealand, and has the accent to prove it (look him up on YouTube). He now lives in Houston, TX, where he used to be a professor of statistics at Rice University. He’s not a regular professor anymore, because he doesn’t need to be. What did Hadley do to become this famous?

In everything he does, he tries to make R much, much more easy to use by writing packages that simplify core functions of the data science work flow.

Hadley wrote packages for R that have become very widely used, such as

  • ggplot2
  • dplyr
  • tidyr

and many others. He also wrote books about writing packages for R… they are all freely, legally available in pdf format online. - In addition, he is the “chief scientist” at RStudio.


7. Setting up a working environment

RStudio

  • the console
  • scripts
  • data files
  • viewing options