Coercion & Vectorization

Now that we’ve introduced atomic vectors, we can demonstrate the interesting ways they are used in R. In particular, we’ll introduce coercion of both type and length. Once we understand coercion, we can then introduce missing values. Lastly, we’ll further discuss using vectors with vectorization in mind.

After reading these notes you should be able to:

Type Coercion

Because atomic vectors are homogeneous objects, that is each element has the same type, what happens if you try to make an atomic vector with two different types?

typeof(42)
[1] "double"
typeof(TRUE)
[1] "logical"

Clearly, 42 is a double vector and TRUE is a logical vector.

c(42, TRUE)
[1] 42  1

Trying to combine them into a single atomic vector produces an interesting result. TRUE has become 1. What’s going on here?

typeof(c(42, TRUE))
[1] "double"

This is our first example of type coercion. Type coercion is the act of changing a vector’s type. In the above example, the logical vector TRUE was coerced to become a double vector, and then it was able to be combined with another double vector, 42.

Coercion can occur both explicitly and implicitly. Implicit coercion is often the root cause of errors that you will encounter, so understanding when and how it occurs is important.

Let’s start with explicit coercion. R has several functions for explicit coercion:

  • as.logical(), coerce a vector to be a logical vector.
  • as.integer(), coerce a vector to be an integer vector.
  • as.double(), coerce a vector to be a double vector.
  • as.character(), coerce a vector to be a character vector.

To illustrate their usage, let’s apply each to a vector of the four main types we will encounter.

To Logical

c(
  as.logical(TRUE),    # logical
  as.logical(42L),     # integer
  as.logical(42.3),    # double
  as.logical("string") # character
)
[1] TRUE TRUE TRUE   NA

Notice that in this case, R does not know what to do in order to coerce this character string to logical, hence it returns NA.1

It is possible to coerce some very specific strings to logical without producing NA values. For example:

as.logical("TRUE")
[1] TRUE
as.logical("FALSE")
[1] FALSE

For numeric values, 0 is coerced to FALSE and non-zero values are coerced to TRUE.

as.logical(42)
[1] TRUE
as.logical(0)
[1] FALSE

To Integer

c(
  as.integer(TRUE),    # logical
  as.integer(42L),     # integer
  as.integer(42.3),    # double
  as.integer("string") # character
)
Warning: NAs introduced by coercion
[1]  1 42 42 NA

Unlike coercion to logical, here we receive a warning. This is due to the nature of NA values, which we will discuss in the next section.

As before, it is possible to coerce some very specific strings to integers without producing NA values. For example:

as.integer("42")
[1] 42

When coercing a double to integer, it simply keeps the integer portion.

as.integer(1.9)
[1] 1

When coercing from logical to integer, FALSE because 0, and TRUE becomes 1.

as.integer(FALSE)
[1] 0
as.integer(TRUE)
[1] 1

To Double

c(
  as.double(TRUE),    # logical
  as.double(42L),     # integer
  as.double(42.3),    # double
  as.double("string") # character
)
Warning: NAs introduced by coercion
[1]  1.0 42.0 42.3   NA

Coercing to double is rather similar to coercion to integer.

as.double("123")
[1] 123
as.double(FALSE)
[1] 0
as.double(TRUE)
[1] 1

Note that while is.numeric() checks for numeric mode, that is type of integer or double, using as.numeric() will create a double vector.

typeof(as.numeric(1L))
[1] "double"

To Character

c(
  as.character(TRUE),    # logical
  as.character(42L),     # integer
  as.character(42.3),    # double
  as.character("string") # character
)
[1] "TRUE"   "42"     "42.3"   "string"

Coercion to character is the easiest!

Implicit Coercion

So far, we’ve been performing explicit coercion. However, R is a dynamically typed language.2 As such, often coercion will happen implicitly.

3 + TRUE
[1] 4

What on earth just happened? Operator coercion! A type of implicit coercion.

You see, you can’t add TRUE to 3. That doesn’t make sense. And R understands this. So, since 3 is a double, which is a number, and addition makes sense on numbers, R first coerces TRUE to double.

3 + as.double(TRUE)
[1] 4

Sometimes we will use this to our advantage. Often, it will happen when you least expect it and it will cause issues.

Some additional examples:

4.2 + 3L
[1] 7.2
log(TRUE)
[1] 0

Sometimes R will give up, thankfully.

42 + "foo"
Error in 42 + "foo" : non-numeric argument to binary operator

Here we see a very common error. Supplying a non-numeric argument, in this case character, to a binary operator. Binary operators (+, -, *, etc) work with numbers. This won’t be the last time you see this error message.

How does this work? How can you predict when it will happen? There are a ton of little details, but essentially practice, and understanding one key idea. The following four examples will help give you some intuition.

c(TRUE, 1L, 4.2, "string") # character result
[1] "TRUE"   "1"      "4.2"    "string"
c(TRUE, 1L, 4.2) # double result
[1] 1.0 1.0 4.2
c(TRUE, 1L) # integer result
[1] 1 1
c(TRUE) # logical result
[1] TRUE

The idea here is that character > double > integer > logical. So if you try to mix a double and a logical, you get a double. Integer and double? Double. Anything and character? Character.

Missing Values

By default, NA is a logical value.

typeof(NA)
[1] "logical"

Then how can we have an NA in the following?

c(1, NA, 3)
[1]  1 NA  3

How does this vector still have type double?

typeof(c(1, NA, 3))
[1] "double"

The answer is that there are actually many NA values.

NA_character_ # character
[1] NA
NA_real_ # double
[1] NA
NA_integer_ # integer
[1] NA
NA # logical
[1] NA

We can verify this, which is necessary since the printed output of each looks the same.

typeof(NA_character_)
[1] "character"
typeof(NA_real_)
[1] "double"
typeof(NA_integer_)
[1] "integer"
typeof(NA)
[1] "logical"

Now we can see them all in action, being coerced.

typeof(c(NA, "foo"))
[1] "character"
typeof(c(NA, 4.2))
[1] "double"
typeof(c(NA, 4L))
[1] "integer"
typeof(c(NA, TRUE))
[1] "logical"

Essentially, NA will often be implicitly coerced to be the type you need.

The NA value represent missing values. Think of NA as a placeholder for “I don’t know.”

NA + 1
[1] NA
42 / NA
[1] NA
NA * 4
[1] NA
log(NA)
[1] NA
sum(c(1:10, NA))
[1] NA

Why are all of these also NA? Because one plus “I don’t know” equals “I don’t know.”

Vectorization

Many operations in R are vectorized. Vectorized operations are essentially operations that are performed in an element by element fashion.

For example, consider two vectors of the same length.

(x = 1:5)
[1] 1 2 3 4 5
(y = 5:1)
[1] 5 4 3 2 1

Wrapping the assignment in parentheses still performs the assignment, but then immediately also prints the object that was assigned the name.

Let’s try adding these two vectors.

x + y
[1] 6 6 6 6 6

What has occurred here is element by element addition. Essentially the following:

c(1 + 5, 2 + 4, 3 + 3, 4 + 2, 5 + 1)
[1] 6 6 6 6 6

The addition (+), subtraction (-), multiplication (*), division (/), and exponentiation (^) operators are vectorized.

x + y
[1] 6 6 6 6 6
x - y
[1] -4 -2  0  2  4
x * y
[1] 5 8 9 8 5
x / y
[1] 0.2 0.5 1.0 2.0 5.0
x ^ y
[1]  1 16 27 16  5

Additionally, many functions are applied to each element of a vector.

log(x)
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
sqrt(y)
[1] 2.236068 2.000000 1.732051 1.414214 1.000000

Length Coercion

We’ve seen how operators can silently perform type coercion, but often, they will perform length coercion as well. Let’s start with the vector containing the numbers 1 through 5.

1:5
[1] 1 2 3 4 5

Now let’s try adding two to each element of this vector:

1:5 + 2
[1] 3 4 5 6 7

Seems like this did the job, right? Yes, but don’t be fooled. What’s really happening is some length coercion. The addition operator that we are attempting to use here expects the vectors on the left and right of the operator are of equal length. But clearly they are not. So how does it work?

R will force whichever vector is shorter to match the length of the longer vector through a process called recycling. Essentially, R will repeat the shorter vector until it is the same length as the longer vector.

So, what’s really happening above is the following:

1:5 + c(2, 2, 2, 2, 2)
[1] 3 4 5 6 7

Simple enough when dealing with a length one vector. Let’s look at something more interesting.

1:6 + c(0, 5)
[1]  1  7  3  9  5 11

What happened this time? Now we have vectors of length six and two, so the length two vector needs to be recycled a few times.

1:6 + c(c(0, 5), c(0, 5), c(0, 5))
[1]  1  7  3  9  5 11

As an aside, that’s a bit of a pain to type. Thankfully, there is the rep() function.3 The following both create the same vector.

c(c(0, 5), c(0, 5), c(0, 5))
[1] 0 5 0 5 0 5
rep(x = c(0, 5), times = 3)
[1] 0 5 0 5 0 5

OK, so far so good. But, what if the vectors are some unfortunate lengths, like three and ten.

1:3 + 1:10
Warning in 1:3 + 1:10: longer object length is not a multiple of shorter object
length
 [1]  2  4  6  5  7  9  8 10 12 11

How do you repeat something of length three to get to a length ten vector? Well, you try your best. First, notice that the above creates a warning, not an error. This means the code ran, but R is letting you know something weird happened.

The following will produce the same result, but without the warning. Notice on the LHS, we are doing the recycling manually. To make the lengths compatible, we had to do some partial recycling.

c(1:3, 1:3, 1:3, 1) + 1:10
 [1]  2  4  6  5  7  9  8 10 12 11

What if one of the vectors is length zero? R will make both length zero!

integer(0) + 1:10
integer(0)

Because NULL has length zero, it functions in a similar manner, with some added type coercion.

NULL + 1
numeric(0)

Much like implicit type coercion, length coercion often occurs implicitly and silently. It will be the source of many future frustrations!

Summary

  • TODO: You learned a lot in this chapter…
  • TODO: this are fundamental / core ideas that you will likely want to return to often…

What’s Next?

  • TODO: generic vectors!
  • TODO: class

Footnotes

  1. It does so without warnings because NA is by default logical typed. This will not be true for some other coercions.↩︎

  2. There are also static typed languages. Computer Science educators often prefer a static typed language, like C or Java, as a first language instead of a dynamically typed language, like R or Python. As we discuss implicit coercion, you might come to agree with them.↩︎

  3. Thankfully, we also wouldn’t actually do the recycling. R will do it for us. It’s a pain to type because you shouldn’t type it.↩︎