Subsetting

After reading these notes you should be able to:

Subsetting Operators and Types

R has three subsetting operators: [ (single bracket), [[ (double bracket), and $ (dollar sign). Depending on the type of object you apply them to, they may have a different behavior.

Often, but very much not always, they will be used as follows:

  • [: Create a subset that is the same type of the object being subset.
  • [[ and $: Extract a single element, which could be a different type than the object being subsetting.

Additionally, these operators can often be mixed with one of the six types types of subsetting allowed in R:

  • Positive integer vectors
  • Negative integer vectors
  • Logical vectors
  • Empty
  • Zero valued
  • Character vectors (Object names)

We’ll demonstrate these with each of the three key objects that we have discussed so far: atomic vectors, lists, and data frames. Recall, each of these is a vector.

Atomic Vectors

Let’s start with possibly the most important, using the single bracket with atomic vectors. We’ll demonstrate each of the six types.

To demonstrate, we’ll start with a simple atomic vector x.

x = 10.1:1.1
x
 [1] 10.1  9.1  8.1  7.1  6.1  5.1  4.1  3.1  2.1  1.1
typeof(x)
[1] "double"

Single Bracket, Positive Integer

First, we’ll demonstrate using a vector of integers for subsetting. Note that any numeric vector used to subset is coerced to be integer.

x[c(3, 2, 1)]
[1]  8.1  9.1 10.1
x[c(1, 2, 4)]
[1] 10.1  9.1  7.1
x[10:1]
 [1]  1.1  2.1  3.1  4.1  5.1  6.1  7.1  8.1  9.1 10.1

In each of the above, an atomic vector with the same type as x and the same length as the vector used to subset is returned. The elements of the vector returned correspond to the elements of the original vector at the indexes of the integers supplied.

Note that you can repeat integers.

x[c(1, 1, 1, 10, 10, 10)]
[1] 10.1 10.1 10.1  1.1  1.1  1.1

Single Bracket, Negative Integer

Negative integers can be used to remove indexes from the original vector.

x[-1]
[1] 9.1 8.1 7.1 6.1 5.1 4.1 3.1 2.1 1.1
x[-10]
[1] 10.1  9.1  8.1  7.1  6.1  5.1  4.1  3.1  2.1
x[-c(1, 10)]
[1] 9.1 8.1 7.1 6.1 5.1 4.1 3.1 2.1
x[c(-1, -10)]
[1] 9.1 8.1 7.1 6.1 5.1 4.1 3.1 2.1

Note that you cannot mix positive and negative integers.

x[c(1, -10)]
Error in x[c(1, -10)] : only 0's may be mixed with negative subscripts

Single Bracket, Logical

Perhaps the most useful, logical subsetting allows us to use a logical vector of the same length as the vector being subset. It returns the elements at the same indexes as the TRUE values.

x[c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)]
[1] 10.1  8.1  6.1  4.1  2.1
x[c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE)]
[1] 10.1  9.1  8.1  7.1  6.1
x[c(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)]
[1] 10.1

If you do not supply a logical vector of the same length, expect recycling.

x[c(TRUE, FALSE)]
[1] 10.1  8.1  6.1  4.1  2.1
x[c(FALSE, TRUE)]
[1] 9.1 7.1 5.1 3.1 1.1

But, beware, in this case, R will not warn you if the logical vector does not cleanly divide the vector being subset.

x[c(FALSE, TRUE, FALSE)]
[1] 9.1 6.1 3.1

A missing value in the logical vector will create a missing value in the result.

x[c(NA, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)]
[1]  NA 8.1 6.1 4.1 2.1

Single Bracket, Nothing

If you use the brackets with nothing, R will return the entire vector. This might seem useless, but we will demonstrate its power later.

x[]
 [1] 10.1  9.1  8.1  7.1  6.1  5.1  4.1  3.1  2.1  1.1

Single Bracket, Zero

Subletting using 0 returns a vector of length zero with the same type as the vector being subset.

x[0]
numeric(0)

This is the same as subsetting with NULL.

x[NULL]
numeric(0)

Single Bracket, Character

Using a character vector to subset will only work if the vector being subset has names.

x["foo"]
[1] NA
x_named = c(a = 1, b = 2, c = 3)
x_named["b"]
b 
2 
x_named[c("a", "c")]
a c 
1 3 

Double Bracket

In general, single brackets return a object of the same type with some number of elements, while double brackets are said to extract a single element.

This can sometimes be hard to notice with atomic vectors.

x
 [1] 10.1  9.1  8.1  7.1  6.1  5.1  4.1  3.1  2.1  1.1

Recall our example vector. Now let’s subset using an integer with both single and double brackets.

x[2]
[1] 9.1
x[[2]]
[1] 9.1

What’s the difference between the code examples above? In this case, nothing.

x_named
a b c 
1 2 3 

Let’s try with a named vector.

x_named[2]
b 
2 
x_named[[2]]
[1] 2

Here, there is a subtle difference. The former preserves the names, while the latter does not. This is because the double bracket is only extracting the element. It retains none of the information about the original vector, in this case, the names.1

Double brackets can only be used with positive integer (an index) or character vectors (a name) of length one.2

x_named[["c"]]
[1] 3

Dollar Sign

The dollar sign operator, $ cannot be used with atomic vectors.

Lists

Much of subsetting a list is done in a very similar fashion to atomic vectors. However, because with single brackets the object returned is a list, sometimes this creates confusion.

y = list(a = 1:10,
         b = "Hello, World!",
         c = log,
         d = list(a = 1, b = "z"))

Single Bracket

Each of the six types of subsetting using a single bracket also work with list.

y[1:2]
$a
 [1]  1  2  3  4  5  6  7  8  9 10

$b
[1] "Hello, World!"
y[-1]
$b
[1] "Hello, World!"

$c
function (x, base = exp(1))  .Primitive("log")

$d
$d$a
[1] 1

$d$b
[1] "z"
y[c(TRUE, FALSE, TRUE, FALSE)]
$a
 [1]  1  2  3  4  5  6  7  8  9 10

$c
function (x, base = exp(1))  .Primitive("log")
y[]
$a
 [1]  1  2  3  4  5  6  7  8  9 10

$b
[1] "Hello, World!"

$c
function (x, base = exp(1))  .Primitive("log")

$d
$d$a
[1] 1

$d$b
[1] "z"
y[0]
named list()
y[c("a", "c")]
$a
 [1]  1  2  3  4  5  6  7  8  9 10

$c
function (x, base = exp(1))  .Primitive("log")

Notice, for each, a list is returned. Here a single bracket preserves the list type.3

Where this might cause confusion is a subset using a single bracket that returns a list of length one.

y[1]
$a
 [1]  1  2  3  4  5  6  7  8  9 10

The important thing to note here: This is a length one list. It is not simply the atomic vector contained in the first element. It is a list containing that atomic vector.

Double Bracket

If you want to extract a particular element of a list, this is done with double brackets.

y[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

The result here is not a list, but instead the atomic vector that was the first element of the list, which in this case was an atomic vector.

y[[1:2]]
[1] 2

What happened here? This is equivalent to the following:

y[[1]][[2]]
[1] 2

Extract the first element of the list, then extract the second element of the extracted element.

Dollar Sign

The dollar sign operator is essentially a shortcut to using double brackets for a named list.

y[["a"]]
 [1]  1  2  3  4  5  6  7  8  9 10
y$a
 [1]  1  2  3  4  5  6  7  8  9 10

As such, it also extracts the element. It does not return a list. Unless of course the element you’re extracting is itself a list.

y$d
$a
[1] 1

$b
[1] "z"

Data Frames

Recall, data frames are also vectors, and in particular a list.

z = data.frame(
  a = 5:1,
  b = rep("a", times = 5),
  c = c(TRUE, FALSE, TRUE, FALSE, TRUE),
  d = c(1, 1, 1, 1, 1)
)

As such, everything that applies to a list, applies to a data frame. Just think of it as a list with named elements.

z[1:2]
  a b
1 5 a
2 4 a
3 3 a
4 2 a
5 1 a

It looks like this is something different, subsetting columns, but remember, the elements of the data frame, are the elements of a list. It just so happens that we interpret them as columns.

z[-1]
  b     c d
1 a  TRUE 1
2 a FALSE 1
3 a  TRUE 1
4 a FALSE 1
5 a  TRUE 1
z[c(TRUE, FALSE, TRUE, FALSE)]
  a     c
1 5  TRUE
2 4 FALSE
3 3  TRUE
4 2 FALSE
5 1  TRUE
z[]
  a b     c d
1 5 a  TRUE 1
2 4 a FALSE 1
3 3 a  TRUE 1
4 2 a FALSE 1
5 1 a  TRUE 1
z[0]
data frame with 0 columns and 5 rows
z[c("a", "c")]
  a     c
1 5  TRUE
2 4 FALSE
3 3  TRUE
4 2 FALSE
5 1  TRUE

The one oddity here, is the use of 0 to subset.

z[0]
data frame with 0 columns and 5 rows

Note that this suggest this data frame still has five rows. This is due to the preserving nature of single brackets. But importantly, this object is still length zero.

length(z[0])
[1] 0
nrow(z[0])
[1] 5
ncol(z[0])
[1] 0

Double brackets also remain the same.

z[[2]]
[1] "a" "a" "a" "a" "a"
z[["d"]]
[1] 1 1 1 1 1

And again, the dollar sign operates the same as well.

z$a
[1] 5 4 3 2 1
z$b
[1] "a" "a" "a" "a" "a"

Rows and Columns

The interesting addition to subsetting methods for data frames involves an addition to the single bracket syntax. Like other single bracket operations, it will mostly return a data frame. In general, the syntax is:

some_df[rows, cols]

Let’s look at some examples.

z
  a b     c d
1 5 a  TRUE 1
2 4 a FALSE 1
3 3 a  TRUE 1
4 2 a FALSE 1
5 1 a  TRUE 1

Recall the data frame we had assigned the name z.

z[1:2, ]
  a b     c d
1 5 a  TRUE 1
2 4 a FALSE 1

Here, were are subsetting the original data frame to only the first two rows. But leaving a blank after the comma, this gets us all of the columns.

z[, 3:4]
      c d
1  TRUE 1
2 FALSE 1
3  TRUE 1
4 FALSE 1
5  TRUE 1

Here, we leave a blank before the comma, so all rows, but the third and fourth column.

We can also put these together:

z[c(1, 4), 3:4]
      c d
1  TRUE 1
4 FALSE 1

Since we’re using single brackets, we can also use negative integers and more.

z[-1, ] # everything except the first row
  a b     c d
2 4 a FALSE 1
3 3 a  TRUE 1
4 2 a FALSE 1
5 1 a  TRUE 1
z[, -4] # everything except the fourth column
  a b     c
1 5 a  TRUE
2 4 a FALSE
3 3 a  TRUE
4 2 a FALSE
5 1 a  TRUE
z[-1, -4] # exclude the first row and fourth column
  a b     c
2 4 a FALSE
3 3 a  TRUE
4 2 a FALSE
5 1 a  TRUE
z[c(TRUE, TRUE, TRUE, FALSE, FALSE), ] # subset to first three rows
  a b     c d
1 5 a  TRUE 1
2 4 a FALSE 1
3 3 a  TRUE 1
z[0, 0]
data frame with 0 columns and 0 rows
z[1:3, c("a", "d")]
  a d
1 5 1
2 4 1
3 3 1

Beware! The following breaks a rule we’ve seen so far:

z[, 1]
[1] 5 4 3 2 1

You may have hoped this returned a data frame, however, it has simplified the result to a vector. To avoid this behavior:

z[, 1, drop = FALSE]
  a
1 5
2 4
3 3
4 2
5 1

This behavior can cause trouble since you can’t always predict it. Later, we’ll introduce tibbles which are a more-or-less drop-in replacement for data frames that avoid this behavior.

Preserving versus Simplifying

A theme has emerged. Until this recent exception, single brackets were a preserving operation. That is, it returns an object of the same type, and keeps attributes.4

z
  a b     c d
1 5 a  TRUE 1
2 4 a FALSE 1
3 3 a  TRUE 1
4 2 a FALSE 1
5 1 a  TRUE 1
z[0]
data frame with 0 columns and 5 rows
attributes(z[0])
$names
character(0)

$row.names
[1] 1 2 3 4 5

$class
[1] "data.frame"

In contrast, double brackets and dollar signs are simplifying operations. They extract an individual element and do not keep attributes.

typeof(z)
[1] "list"
attributes(z)
$names
[1] "a" "b" "c" "d"

$class
[1] "data.frame"

$row.names
[1] 1 2 3 4 5
z[[1]]
[1] 5 4 3 2 1
typeof(z[[1]])
[1] "integer"
attributes(z[[1]])
NULL

The following table summarizes what we have seen.

Type Simplifying Preserving
Atomic Vector x[[1]] x[1]
List x[[1]] x[1]
Data Frame x[[1]] x[1]
Data Frame x[, 1] x[, 1, drop = FALSE]

Subset and Replace

If we mix subsetting and assignment, we can replace elements.

x
 [1] 10.1  9.1  8.1  7.1  6.1  5.1  4.1  3.1  2.1  1.1
x[c(1, 3, 5)] = c(42, 42, 42)
x
 [1] 42.0  9.1 42.0  7.1 42.0  5.1  4.1  3.1  2.1  1.1

We could do something like the above, but also utilize recycling.

x[c(8, 9, 10)] = 0
x
 [1] 42.0  9.1 42.0  7.1 42.0  5.1  4.1  0.0  0.0  0.0
y
$a
 [1]  1  2  3  4  5  6  7  8  9 10

$b
[1] "Hello, World!"

$c
function (x, base = exp(1))  .Primitive("log")

$d
$d$a
[1] 1

$d$b
[1] "z"
y[["d"]] = 5:1
y
$a
 [1]  1  2  3  4  5  6  7  8  9 10

$b
[1] "Hello, World!"

$c
function (x, base = exp(1))  .Primitive("log")

$d
[1] 5 4 3 2 1
z
  a b     c d
1 5 a  TRUE 1
2 4 a FALSE 1
3 3 a  TRUE 1
4 2 a FALSE 1
5 1 a  TRUE 1
z$a = 42
z
   a b     c d
1 42 a  TRUE 1
2 42 a FALSE 1
3 42 a  TRUE 1
4 42 a FALSE 1
5 42 a  TRUE 1

This is where empty subsetting can become useful.

foo = 1:10
foo[] = 42
foo
 [1] 42 42 42 42 42 42 42 42 42 42

Here, we’ve replaced all elements with the value 42. The empty subsetting allows us to do this as x = 42 would simply assign the name x to the object 42.

We can also use more interesting subsets, for example with data frames.

z[2, ] = data.frame(a = 0, b = "z", c = FALSE, d = 42)
z
   a b     c  d
1 42 a  TRUE  1
2  0 z FALSE 42
3 42 a  TRUE  1
4 42 a FALSE  1
5 42 a  TRUE  1

Notice we have to be careful here. We’re attempting to replace rows, but because rows span multiple columns, hence multiple types, we need to make sure those types are present in the replacement object. In other words, a row of a data frame is a data frame, so we need to replace it with a data frame.

Or, we could deal with a lot of coercion.

str(z)
'data.frame':   5 obs. of  4 variables:
 $ a: num  42 0 42 42 42
 $ b: chr  "a" "z" "a" "a" ...
 $ c: logi  TRUE FALSE TRUE FALSE TRUE
 $ d: num  1 42 1 1 1
z[2, ] = c(a = 0, b = "z", c = FALSE, d = 42)
z
   a b     c  d
1 42 a  TRUE  1
2  0 z FALSE 42
3 42 a  TRUE  1
4 42 a FALSE  1
5 42 a  TRUE  1
str(z)
'data.frame':   5 obs. of  4 variables:
 $ a: chr  "42" "0" "42" "42" ...
 $ b: chr  "a" "z" "a" "a" ...
 $ c: chr  "TRUE" "FALSE" "TRUE" "FALSE" ...
 $ d: chr  "1" "42" "1" "1" ...

Why did coercion happen here? Hint: Remember how atomic vectors work.5

Lastly, if you replace an element with NULL, if will be removed.

y
$a
 [1]  1  2  3  4  5  6  7  8  9 10

$b
[1] "Hello, World!"

$c
function (x, base = exp(1))  .Primitive("log")

$d
[1] 5 4 3 2 1
y[3:4] = NULL
y
$a
 [1]  1  2  3  4  5  6  7  8  9 10

$b
[1] "Hello, World!"

Summary

  • TODO: You’ve learned to…

What’s Next?

  • TODO: programming. logical and Boolean operators. how they are super useful for subsetting.

TODO

  • TODO: out of bounds?
  • TODO: visual explanations of object types and their subsetting

Footnotes

  1. More generally, attributes.↩︎

  2. A single logical value will appear to work, but it is really first being coerced to integer.↩︎

  3. Other options, might simplify.↩︎

  4. This is why there are still five rows in the odd example we saw.↩︎

  5. Rows of data frames are not atomic vectors. Columns of data frames are (most often) atomic vectors.↩︎