This is a quick example of how I personally would break down data.
First, I load relevant libraries and define my own helpful “not in” operator.
library(tidyverse) #for dplyr, ggplot2, so on.
library(lubridate) #for good date handling.
`%ni%` = Negate(`%in%`)
The data I’m pulling in here as “summary2.csv” is a freely available list of reported storm incidents from the Bureau of Meterology.
stormData = read_csv("summary2.csv", quote = "\"")
## New names:
## • `` -> `...10`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 20779 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Database, NearestTown, State, Comments
## dbl (4): EventID, ID, Latitude, Longitude
## lgl (1): ...10
## dttm (1): DateTime
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I’m told the data has problems. That’s quite common. What problems?
problems(stormData)
## # A tibble: 110 × 5
## row col expected actual file
## <int> <int> <chr> <chr> <chr>
## 1 4060 10 1/0/T/F/TRUE/FALSE "narrow and high in the south\".\"," C:/U…
## 2 4060 11 10 columns "11 columns" C:/U…
## 3 4074 10 1/0/T/F/TRUE/FALSE "trees snapped, uprooted\" on spotter c… C:/U…
## 4 4074 12 10 columns "12 columns" C:/U…
## 5 4101 10 1/0/T/F/TRUE/FALSE "creeks/storm-water drains overflowing\… C:/U…
## 6 4101 11 10 columns "11 columns" C:/U…
## 7 4104 10 1/0/T/F/TRUE/FALSE "creeks/storm-water drains overflowing\… C:/U…
## 8 4104 11 10 columns "11 columns" C:/U…
## 9 4437 10 1/0/T/F/TRUE/FALSE "but just local\".\"," C:/U…
## 10 4437 11 10 columns "11 columns" C:/U…
## # ℹ 100 more rows
There’s 110 rows where somebody has put a bunch of slashes in the comments, which are hard to parse due to their use as escape characters. In a future version of this example I will demonstrate how I’d clean this up if necessary (and it’s often necessary in pipelines); but in an exploratory analysis where most of the data is fine I would filter then out like so.
badRows = problems(stormData) %>%
pull(row) #Get rows with problems
badEventIDs = slice(stormData, badRows) %>%
pull(EventID) %>%
unique() #Get the ID of rows with problems
goodRecords = stormData %>%
filter(EventID %ni% badEventIDs) #Remove all rows with problems
goodRecords %>%
count(Database) #count() being a wrapper around group_by() and summarize().
## # A tibble: 6 × 2
## Database n
## <chr> <int>
## 1 Hail 4344
## 2 Lighting 323
## 3 Rain 6451
## 4 Tornado 1317
## 5 Waterspout 82
## 6 Wind 8164
I have the basic setup. Now, how do I want to slice it? Maybe I want to see how they’re broken down over time. I’ll extract the year from the timestamp so it can be seen year on year; and add a little check to make sure every row has a year.
records2 = goodRecords %>%
mutate(Year = year(parse_date_time(DateTime, orders = "ymdHMS", truncated = 3))) #Scope for performance improvement with vectorization if the database was really big.
naCountStorms = pull(records2, Year) %>%
is.na() %>%
sum()
if(naCountStorms > 0){
warning("NAs generated in year mutation.")
}
The warning isn’t triggered, so the code worked. Here’s the graph of reports by year.
records2 %>%
count(Year) %>%
ggplot(aes(x = Year, y = n)) +
geom_col()
Reports go way up in number as communication improves. I am most interested in pre-1960 reports. There’s an intriguing spike in the 1930s. What is that? Let’s investigate.
records2 %>%
count(Year) %>%
filter(Year >= 1900 & Year <= 1950) %>%
print(n = 51)
## # A tibble: 51 × 2
## Year n
## <dbl> <int>
## 1 1900 18
## 2 1901 28
## 3 1902 38
## 4 1903 41
## 5 1904 25
## 6 1905 23
## 7 1906 38
## 8 1907 24
## 9 1908 22
## 10 1909 41
## 11 1910 45
## 12 1911 39
## 13 1912 54
## 14 1913 89
## 15 1914 58
## 16 1915 56
## 17 1916 11
## 18 1917 19
## 19 1918 24
## 20 1919 21
## 21 1920 17
## 22 1921 17
## 23 1922 15
## 24 1923 32
## 25 1924 40
## 26 1925 36
## 27 1926 28
## 28 1927 13
## 29 1928 15
## 30 1929 28
## 31 1930 26
## 32 1931 27
## 33 1932 38
## 34 1933 38
## 35 1934 47
## 36 1935 139
## 37 1936 58
## 38 1937 35
## 39 1938 37
## 40 1939 14
## 41 1940 23
## 42 1941 14
## 43 1942 18
## 44 1943 14
## 45 1944 12
## 46 1945 21
## 47 1946 23
## 48 1947 23
## 49 1948 12
## 50 1949 25
## 51 1950 15
1935 jumps out as the only one in this period with a triple digit number of reports. What happened?
Let’s look at it by type and year together.
records2 %>%
count(Year, Database) %>%
filter(Year >= 1900 & Year <= 1950) %>%
ggplot(aes(x = Year, y = n)) +
geom_col() +
facet_grid(rows = vars(Database))
A large number of wind and hail reports were gathered in 1935. Was there a massive hailstorm? Multiple hailstorms? Let’s isolate 1935.
records2 %>%
filter(Year == 1935) %>%
count(Database, State)
## # A tibble: 13 × 3
## Database State n
## <chr> <chr> <int>
## 1 Hail QLD 57
## 2 Hail VIC 3
## 3 Lighting QLD 5
## 4 Lighting VIC 9
## 5 Rain QLD 12
## 6 Rain VIC 2
## 7 Tornado NSW 1
## 8 Tornado QLD 1
## 9 Tornado VIC 3
## 10 Waterspout <NA> 3
## 11 Wind NSW 4
## 12 Wind QLD 32
## 13 Wind VIC 7
This is now small enough that viewing it as a table makes more sense than viewing it as a graph. What happened to make an exceptional number of reports is clear - wind and hail across Queensland. Enormous wind and noise. What do the comments say?
records2 %>%
filter(Year == 1935, State == "QLD") %>%
select(State, Comments) %>%
unique()
## # A tibble: 1 × 2
## State Comments
## <chr> <chr>
## 1 QLD NULL
No comments in Queensland. What about Victoria?
records2 %>%
filter(Year == 1935, State == "VIC") %>%
select(DateTime, Comments) %>%
unique() %>%
print(n = 24)
## # A tibble: 18 × 2
## DateTime Comments
## <dttm> <chr>
## 1 1935-01-02 00:00:00 Crops damaged.
## 2 1935-01-02 00:00:00 Wheat crops destroyed.
## 3 1935-01-02 00:00:00 <NA>
## 4 1935-01-07 00:00:00 <NA>
## 5 1935-01-14 00:00:00 Great damage to orchards.
## 6 1935-02-20 00:00:00 <NA>
## 7 1935-06-23 00:00:00 Path width 3600
## 8 1935-09-13 00:00:00 <NA>
## 9 1935-09-15 00:00:00 <NA>
## 10 1935-09-16 00:00:00 5 cows struck by lightning.
## 11 1935-09-30 00:00:00 Man struck by lightning.
## 12 1935-09-30 00:00:00 Woman in shed struck by lightning.
## 13 1935-10-21 00:00:00 100 sheep drowned.
## 14 1935-12-02 00:00:00 Crops and gardens destroyed.
## 15 1935-12-03 00:00:00 Roads blocked by broken timber.w
## 16 1935-12-03 00:00:00 <NA>
## 17 1935-12-10 00:00:00 <NA>
## 18 1935-12-10 00:00:00 Crops ruined.
Victoria in late 1935: People struck by lightning. Cows struck by lightning. Sheep drowning (and presumably struck by lightning, though no report here). A tumultuous time.