In a single pipeline for each condition, find all flights that meet the condition:
The following libraries will be needed:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Had an arrival delay of two or more hours
The column arr_delay
looks like the column to use but to confirm what type of data it contains we can see the package documentation with the following code:
arr_delay
is in minutes and the minus time shows when the departures were early so code would be:
flights |>
filter (arr_delay > 120 )
# A tibble: 10,034 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 811 630 101 1047 830
2 2013 1 1 848 1835 853 1001 1950
3 2013 1 1 957 733 144 1056 853
4 2013 1 1 1114 900 134 1447 1222
5 2013 1 1 1505 1310 115 1638 1431
6 2013 1 1 1525 1340 105 1831 1626
7 2013 1 1 1549 1445 64 1912 1656
8 2013 1 1 1558 1359 119 1718 1515
9 2013 1 1 1732 1630 62 2028 1825
10 2013 1 1 1803 1620 103 2008 1750
# ℹ 10,024 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Flew to Houston (IAH
or HOU
)
Looking at the column data with unique()
which is a base R function to show the distinct values:
[1] "IAH" "MIA" "BQN" "ATL" "ORD" "FLL" "IAD" "MCO" "PBI" "TPA" "LAX" "SFO"
[13] "DFW" "BOS" "LAS" "MSP" "DTW" "RSW" "SJU" "PHX" "BWI" "CLT" "BUF" "DEN"
[25] "SNA" "MSY" "SLC" "XNA" "MKE" "SEA" "ROC" "SYR" "SRQ" "RDU" "CMH" "JAX"
[37] "CHS" "MEM" "PIT" "SAN" "DCA" "CLE" "STL" "MYR" "JAC" "MDW" "HNL" "BNA"
[49] "AUS" "BTV" "PHL" "STT" "EGE" "AVL" "PWM" "IND" "SAV" "CAK" "HOU" "LGB"
[61] "DAY" "ALB" "BDL" "MHT" "MSN" "GSO" "CVG" "BUR" "RIC" "GSP" "GRR" "MCI"
[73] "ORF" "SAT" "SDF" "PDX" "SJC" "OMA" "CRW" "OAK" "SMF" "TUL" "TYS" "OKC"
[85] "PVD" "DSM" "PSE" "BHM" "CAE" "HDN" "BZN" "MTJ" "EYW" "PSP" "ACK" "BGR"
[97] "ABQ" "ILM" "MVY" "SBN" "LEX" "CHO" "TVC" "ANC" "LGA"
Also possible in {dplyr} using distinct()
:
flights |>
distinct (dest)
# A tibble: 105 × 1
dest
<chr>
1 IAH
2 MIA
3 BQN
4 ATL
5 ORD
6 FLL
7 IAD
8 MCO
9 PBI
10 TPA
# ℹ 95 more rows
flights |>
filter (dest %in% c ("IAH" , "HOU" ))
# A tibble: 9,313 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 623 627 -4 933 932
4 2013 1 1 728 732 -4 1041 1038
5 2013 1 1 739 739 0 1104 1038
6 2013 1 1 908 908 0 1228 1219
7 2013 1 1 1028 1026 2 1350 1339
8 2013 1 1 1044 1045 -1 1352 1351
9 2013 1 1 1114 900 134 1447 1222
10 2013 1 1 1205 1200 5 1503 1505
# ℹ 9,303 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Another way to do this, but less efficienct in code is to write:
flights |>
filter (dest == "IAH" | dest == "HOU" )
# A tibble: 9,313 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 623 627 -4 933 932
4 2013 1 1 728 732 -4 1041 1038
5 2013 1 1 739 739 0 1104 1038
6 2013 1 1 908 908 0 1228 1219
7 2013 1 1 1028 1026 2 1350 1339
8 2013 1 1 1044 1045 -1 1352 1351
9 2013 1 1 1114 900 134 1447 1222
10 2013 1 1 1205 1200 5 1503 1505
# ℹ 9,303 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
with the |
representing or
.
Were operated by United, American, or Delta
Checking the carrier
column for information that matches:
[1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV"
[16] "OO"
These look like codes rather than names.
Checking the package information online https://github.com/tidyverse/nycflights13 there is a dataset for airlines
:
This section of the book doesn’t refer to joins so presuming the exercise is requiring looking for the right codes:
airlines |>
filter (name %in% c ("United" , "American" , "Delta" ))
# A tibble: 0 × 2
# ℹ 2 variables: carrier <chr>, name <chr>
This doesn’t bring back the right carriers but there are only 16 rows and the names should be:
airlines |>
filter (name %in% c ("United Air Lines Inc." , "American Airlines Inc." , "Delta Air Lines Inc." ))
# A tibble: 3 × 2
carrier name
<chr> <chr>
1 AA American Airlines Inc.
2 DL Delta Air Lines Inc.
3 UA United Air Lines Inc.
flights |>
filter (carrier %in% c ("AA" , "DL" , "UA" ))
# A tibble: 139,504 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 554 600 -6 812 837
5 2013 1 1 554 558 -4 740 728
6 2013 1 1 558 600 -2 753 745
7 2013 1 1 558 600 -2 924 917
8 2013 1 1 558 600 -2 923 937
9 2013 1 1 559 600 -1 941 910
10 2013 1 1 559 600 -1 854 902
# ℹ 139,494 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Departed in summer (July, August, and September)
flights |>
filter (month %in% c (7 , 8 , 9 ))
# A tibble: 86,326 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 7 1 1 2029 212 236 2359
2 2013 7 1 2 2359 3 344 344
3 2013 7 1 29 2245 104 151 1
4 2013 7 1 43 2130 193 322 14
5 2013 7 1 44 2150 174 300 100
6 2013 7 1 46 2051 235 304 2358
7 2013 7 1 48 2001 287 308 2305
8 2013 7 1 58 2155 183 335 43
9 2013 7 1 100 2146 194 327 30
10 2013 7 1 100 2245 135 337 135
# ℹ 86,316 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Because these are numbers the filter can be reduced to:
flights |>
filter (month %in% 7 : 9 )
# A tibble: 86,326 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 7 1 1 2029 212 236 2359
2 2013 7 1 2 2359 3 344 344
3 2013 7 1 29 2245 104 151 1
4 2013 7 1 43 2130 193 322 14
5 2013 7 1 44 2150 174 300 100
6 2013 7 1 46 2051 235 304 2358
7 2013 7 1 48 2001 287 308 2305
8 2013 7 1 58 2155 183 335 43
9 2013 7 1 100 2146 194 327 30
10 2013 7 1 100 2245 135 337 135
# ℹ 86,316 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Arrived more than two hours late, but didn’t leave late
flights |>
filter (dep_delay <= 0 ,
arr_delay > 120 )
# A tibble: 29 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 27 1419 1420 -1 1754 1550
2 2013 10 7 1350 1350 0 1736 1526
3 2013 10 7 1357 1359 -2 1858 1654
4 2013 10 16 657 700 -3 1258 1056
5 2013 11 1 658 700 -2 1329 1015
6 2013 3 18 1844 1847 -3 39 2219
7 2013 4 17 1635 1640 -5 2049 1845
8 2013 4 18 558 600 -2 1149 850
9 2013 4 18 655 700 -5 1213 950
10 2013 5 22 1827 1830 -3 2217 2010
# ℹ 19 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Were delayed by at least an hour, but made up over 30 minutes in flight
flights |>
filter (dep_delay >= 60 ,
arr_delay < dep_delay - 30 )
# A tibble: 1,844 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 2205 1720 285 46 2040
2 2013 1 1 2326 2130 116 131 18
3 2013 1 3 1503 1221 162 1803 1555
4 2013 1 3 1839 1700 99 2056 1950
5 2013 1 3 1850 1745 65 2148 2120
6 2013 1 3 1941 1759 102 2246 2139
7 2013 1 3 1950 1845 65 2228 2227
8 2013 1 3 2015 1915 60 2135 2111
9 2013 1 3 2257 2000 177 45 2224
10 2013 1 4 1917 1700 137 2135 1950
# ℹ 1,834 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
At least an hour means 60 minutes or more so >=
.
Sort flights
to find the flights with longest departure delays. Find the flights that left earliest in the morning.
Rather than ,
between the two parts this answer will require the &
flights |>
filter (dep_time < 0900 ) |>
arrange (desc (dep_delay))
# A tibble: 79,314 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 9 641 900 1301 1242 1530
2 2013 7 22 845 1600 1005 1044 1815
3 2013 12 5 756 1700 896 1058 2020
4 2013 1 1 848 1835 853 1001 1950
5 2013 5 19 713 1700 853 1007 1955
6 2013 12 19 734 1725 849 1046 2039
7 2013 12 17 705 1700 845 1026 2020
8 2013 12 14 830 1845 825 1210 2154
9 2013 6 27 753 1830 803 937 2015
10 2013 11 3 603 1645 798 829 1913
# ℹ 79,304 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Sort flights
to find the fastest flights. (Hint: Try including a math calculation inside of your function.)
Was there a flight on every day of 2013?
Which flights traveled the farthest distance? Which traveled the least distance?
Does it matter what order you used filter()
and arrange()
if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.