This is a quick example of how I personally would break down data.

First, I load relevant libraries and define my own helpful “not in” operator.

library(tidyverse) #for dplyr, ggplot2, so on.
library(lubridate) #for good date handling.
`%ni%` = Negate(`%in%`)

The data I’m pulling in here as “summary2.csv” is a freely available list of reported storm incidents from the Bureau of Meterology.

stormData = read_csv("summary2.csv", quote = "\"")
## New names:
## • `` -> `...10`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 20779 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Database, NearestTown, State, Comments
## dbl  (4): EventID, ID, Latitude, Longitude
## lgl  (1): ...10
## dttm (1): DateTime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I’m told the data has problems. That’s quite common. What problems?

problems(stormData)
## # A tibble: 110 × 5
##      row   col expected           actual                                   file 
##    <int> <int> <chr>              <chr>                                    <chr>
##  1  4060    10 1/0/T/F/TRUE/FALSE "narrow and high in the south\".\","     C:/U…
##  2  4060    11 10 columns         "11 columns"                             C:/U…
##  3  4074    10 1/0/T/F/TRUE/FALSE "trees snapped, uprooted\" on spotter c… C:/U…
##  4  4074    12 10 columns         "12 columns"                             C:/U…
##  5  4101    10 1/0/T/F/TRUE/FALSE "creeks/storm-water drains overflowing\… C:/U…
##  6  4101    11 10 columns         "11 columns"                             C:/U…
##  7  4104    10 1/0/T/F/TRUE/FALSE "creeks/storm-water drains overflowing\… C:/U…
##  8  4104    11 10 columns         "11 columns"                             C:/U…
##  9  4437    10 1/0/T/F/TRUE/FALSE "but just local\".\","                   C:/U…
## 10  4437    11 10 columns         "11 columns"                             C:/U…
## # ℹ 100 more rows

There’s 110 rows where somebody has put a bunch of slashes in the comments, which are hard to parse due to their use as escape characters. In a future version of this example I will demonstrate how I’d clean this up if necessary (and it’s often necessary in pipelines); but in an exploratory analysis where most of the data is fine I would filter then out like so.

badRows = problems(stormData) %>%
  pull(row) #Get rows with problems
badEventIDs = slice(stormData, badRows) %>%
  pull(EventID) %>%
  unique() #Get the ID of rows with problems

goodRecords = stormData %>%
  filter(EventID %ni% badEventIDs) #Remove all rows with problems

goodRecords %>%
  count(Database) #count() being a wrapper around group_by() and summarize().
## # A tibble: 6 × 2
##   Database       n
##   <chr>      <int>
## 1 Hail        4344
## 2 Lighting     323
## 3 Rain        6451
## 4 Tornado     1317
## 5 Waterspout    82
## 6 Wind        8164

I have the basic setup. Now, how do I want to slice it? Maybe I want to see how they’re broken down over time. I’ll extract the year from the timestamp so it can be seen year on year; and add a little check to make sure every row has a year.

records2 = goodRecords %>%
  mutate(Year = year(parse_date_time(DateTime, orders = "ymdHMS", truncated = 3))) #Scope for performance improvement with vectorization if the database was really big. 

naCountStorms = pull(records2, Year) %>%
  is.na() %>%
  sum() 


if(naCountStorms > 0){
  warning("NAs generated in year mutation.")
}

The warning isn’t triggered, so the code worked. Here’s the graph of reports by year.

records2 %>%
count(Year) %>%
ggplot(aes(x = Year, y = n)) +
  geom_col()

Reports go way up in number as communication improves. I am most interested in pre-1960 reports. There’s an intriguing spike in the 1930s. What is that? Let’s investigate.

records2 %>%
  count(Year) %>%
  filter(Year >= 1900 & Year <= 1950) %>%
  print(n = 51)
## # A tibble: 51 × 2
##     Year     n
##    <dbl> <int>
##  1  1900    18
##  2  1901    28
##  3  1902    38
##  4  1903    41
##  5  1904    25
##  6  1905    23
##  7  1906    38
##  8  1907    24
##  9  1908    22
## 10  1909    41
## 11  1910    45
## 12  1911    39
## 13  1912    54
## 14  1913    89
## 15  1914    58
## 16  1915    56
## 17  1916    11
## 18  1917    19
## 19  1918    24
## 20  1919    21
## 21  1920    17
## 22  1921    17
## 23  1922    15
## 24  1923    32
## 25  1924    40
## 26  1925    36
## 27  1926    28
## 28  1927    13
## 29  1928    15
## 30  1929    28
## 31  1930    26
## 32  1931    27
## 33  1932    38
## 34  1933    38
## 35  1934    47
## 36  1935   139
## 37  1936    58
## 38  1937    35
## 39  1938    37
## 40  1939    14
## 41  1940    23
## 42  1941    14
## 43  1942    18
## 44  1943    14
## 45  1944    12
## 46  1945    21
## 47  1946    23
## 48  1947    23
## 49  1948    12
## 50  1949    25
## 51  1950    15

1935 jumps out as the only one in this period with a triple digit number of reports. What happened?

Let’s look at it by type and year together.

records2 %>%
  count(Year, Database) %>%
  filter(Year >= 1900 & Year <= 1950) %>%
  ggplot(aes(x = Year, y = n)) +
  geom_col() +
  facet_grid(rows = vars(Database))

A large number of wind and hail reports were gathered in 1935. Was there a massive hailstorm? Multiple hailstorms? Let’s isolate 1935.

records2 %>%
  filter(Year == 1935) %>%
  count(Database, State)
## # A tibble: 13 × 3
##    Database   State     n
##    <chr>      <chr> <int>
##  1 Hail       QLD      57
##  2 Hail       VIC       3
##  3 Lighting   QLD       5
##  4 Lighting   VIC       9
##  5 Rain       QLD      12
##  6 Rain       VIC       2
##  7 Tornado    NSW       1
##  8 Tornado    QLD       1
##  9 Tornado    VIC       3
## 10 Waterspout <NA>      3
## 11 Wind       NSW       4
## 12 Wind       QLD      32
## 13 Wind       VIC       7

This is now small enough that viewing it as a table makes more sense than viewing it as a graph. What happened to make an exceptional number of reports is clear - wind and hail across Queensland. Enormous wind and noise. What do the comments say?

records2 %>%
  filter(Year == 1935, State == "QLD") %>%
  select(State, Comments) %>%
  unique()
## # A tibble: 1 × 2
##   State Comments
##   <chr> <chr>   
## 1 QLD   NULL

No comments in Queensland. What about Victoria?

records2 %>%
  filter(Year == 1935, State == "VIC") %>%
  select(DateTime, Comments) %>%
  unique() %>%
  print(n = 24)
## # A tibble: 18 × 2
##    DateTime            Comments                          
##    <dttm>              <chr>                             
##  1 1935-01-02 00:00:00 Crops damaged.                    
##  2 1935-01-02 00:00:00 Wheat crops destroyed.            
##  3 1935-01-02 00:00:00 <NA>                              
##  4 1935-01-07 00:00:00 <NA>                              
##  5 1935-01-14 00:00:00 Great damage to orchards.         
##  6 1935-02-20 00:00:00 <NA>                              
##  7 1935-06-23 00:00:00 Path width 3600                   
##  8 1935-09-13 00:00:00 <NA>                              
##  9 1935-09-15 00:00:00 <NA>                              
## 10 1935-09-16 00:00:00 5 cows struck by lightning.       
## 11 1935-09-30 00:00:00 Man struck by lightning.          
## 12 1935-09-30 00:00:00 Woman in shed struck by lightning.
## 13 1935-10-21 00:00:00 100 sheep drowned.                
## 14 1935-12-02 00:00:00 Crops and gardens destroyed.      
## 15 1935-12-03 00:00:00 Roads blocked by broken timber.w  
## 16 1935-12-03 00:00:00 <NA>                              
## 17 1935-12-10 00:00:00 <NA>                              
## 18 1935-12-10 00:00:00 Crops ruined.

Victoria in late 1935: People struck by lightning. Cows struck by lightning. Sheep drowning (and presumably struck by lightning, though no report here). A tumultuous time.