Quick EDA – Landslides dataset from NASA (Part I)
Context
Landslides are one of the most pervasive hazards in the world, causing injuries and fatalities almost in any country. There are several triggers, but one of the main reason are intense and prolonged rainfall over saturated soil on vulnerable slopes.
The Global Landslide Catalog (GLC) was developed with the goal of identifying rainfall-triggered landslide events around the world, regardless of size, impacts, or location. The GLC considers all types of mass movements triggered by rainfall, which have been reported in the media, disaster databases, scientific reports, or other sources. The GLC has been compiled since 2007 at NASA Goddard Space Flight Center. This is a unique data set with the ID tag “GLC” in the landslide editor.
Idea
We are in the presence of a small dataset, around 11,000 records, on which we are going to work. In this first section, we will develop a basic exploratory analysis, step by step in order to determine which variables we want to focus on. There are different ways to implement an exploratory analysis
Using basic functions
# Displays the type and a preview of all columns as a row so that it's very easy to take in.
dim(df)
# Displays 10 first rows
head(df, 10)
# The matrix and data frame methods return a matrix of class table, obtained by applying #summary to each column and collating the results.
summary(df)
Using new libraries
I am speaking about skim and dataexplorer. They are relative new and can offer excellent results specially considering time-effort leaving some room to make specific searches after an initial idea about the dataset.
Skimr
skimr
is designed to provide summary statistics about variables. It is opinionated in its defaults, but easy to modify. In base R, the most similar functions are summary()
for vectors and data frames and fivenum()
for numeric vectors:
Dataexplorer
Dataexplorer is designed to provide a graphical view about a dataframe. There are 3 main goals for DataExplorer:
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Data Reporting
Exploratory analysis using Skimr
Code:
library(skimr)
skim(df)
Results:
# Skim summary statistics
# n obs: 11033
# n variables: 31
#
# ── Variable type:character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n min max empty n_unique
# admin_division_name 1637 9396 11033 3 36 0 887
# country_code 1564 9469 11033 2 2 0 139
# country_name 1562 9471 11033 4 32 0 141
# created_date 0 11033 11033 8 22 0 420
# event_date 0 11033 11033 22 22 0 6550
# event_description 862 10171 11033 3 1003 0 9401
# event_import_source 1563 9470 11033 3 80 0 3
# event_time 6021 5012 11033 4 7 0 25
# event_title 0 11033 11033 3 150 0 10546
# gazeteer_closest_point 1563 9470 11033 2 45 0 4389
# landslide_category 1 11032 11033 5 19 0 14
# landslide_setting 69 10964 11033 4 16 0 14
# landslide_size 9 11024 11033 5 12 0 6
# landslide_trigger 23 11010 11033 4 23 0 18
# last_edited_date 0 11033 11033 22 22 0 1
# location_accuracy 2 11031 11033 3 7 0 9
# location_description 102 10931 11033 3 412 0 10432
# notes 10716 317 11033 13 484 0 265
# photo_link 9537 1496 11033 28 292 0 1469
# source_link 846 10187 11033 23 386 0 8294
# source_name 0 11033 11033 2 172 0 3918
# storm_name 10456 577 11033 1 41 0 217
# submitted_date 9 11024 11033 3 22 0 3787
#
# ── Variable type:integer ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n mean sd p0 p25 p50 p75 p100 hist
# admin_division_population 1562 9471 11033 157760.05 829734.54 0 1963 7365 34021 1.3e+07 ▇▁▁▁▁▁▁▁
# event_id 0 11033 11033 5598.95 3249.23 1 2785 5563 8435 11221 ▇▇▇▇▇▇▇▇
# fatality_count 1385 9648 11033 3.22 59.89 0 0 0 1 5000 ▇▁▁▁▁▁▁▁
# injury_count 5674 5359 11033 0.75 8.46 0 0 0 0 374 ▇▁▁▁▁▁▁▁
#
# ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n mean sd p0 p25 p50 p75 p100 hist
# event_import_id 1562 9471 11033 4798.56 2789.13 -111.17 2386.5 4773 7189.5 9669 ▇▇▇▇▇▇▇▇
# gazeteer_distance 1562 9471 11033 11.87 15.6 3e-05 2.36 6.25 15.82 215.45 ▇▁▁▁▁▁▁▁
# latitude 0 11033 11033 25.88 20.42 -46.77 13.92 30.53 40.87 72.63 ▁▁▁▃▅▇▅▁
# longitude 0 11033 11033 2.52 100.91 -179.98 -107.87 19.69 93.95 179.99 ▁▇▅▁▂▆▇▁
Let’s gonna analyze this result. The data is the same, I prefer use tables to facilitate the analysis and visualization.
Results
n obs: 11033 – n variables: 31
Variable type: character
variable | missing | complete | n | min | max | empty | n_unique |
---|---|---|---|---|---|---|---|
admin_division_name | 1637 | 9396 | 11033 | 3 | 36 | 0 | 887 |
country_code | 1564 | 9469 | 11033 | 2 | 2 | 0 | 139 |
country_name | 1562 | 9471 | 11033 | 4 | 32 | 0 | 141 |
created_date | 0 | 11033 | 11033 | 8 | 22 | 0 | 420 |
event_date | 0 | 11033 | 11033 | 22 | 22 | 0 | 6550 |
event_description | 862 | 10171 | 11033 | 1003 | 1003 | 0 | 9401 |
event_import_source | 1563 | 9470 | 11033 | 80 | 80 | 0 | 3 |
event_time | 6021 | 5012 | 11033 | 7 | 7 | 0 | 25 |
event_title | 0 | 11033 | 11033 | 150 | 150 | 0 | 10546 |
gazeteer_closest_point | 1563 | 9470 | 11033 | 45 | 45 | 0 | 4389 |
landslide_category | 1 | 11032 | 11033 | 19 | 19 | 0 | 14 |
landslide_setting | 69 | 10964 | 11033 | 16 | 16 | 0 | 14 |
landslide_size | 9 | 11024 | 11033 | 12 | 12 | 0 | 6 |
landslide_trigger | 23 | 11010 | 11033 | 23 | 23 | 0 | 18 |
last_edited_date | 0 | 11033 | 11033 | 22 | 22 | 0 | 1 |
location_accuracy | 2 | 11031 | 11033 | 7 | 7 | 0 | 9 |
location_description | 102 | 10931 | 11033 | 412 | 412 | 0 | 10432 |
notes | 10716 | 317 | 11033 | 484 | 484 | 0 | 265 |
photo_link | 9537 | 1496 | 11033 | 292 | 292 | 0 | 1469 |
source_link | 846 | 10187 | 11033 | 386 | 386 | 0 | 8294 |
source_name | 0 | 11033 | 11033 | 172 | 172 | 0 | 3918 |
storm_name | 10456 | 577 | 11033 | 41 | 41 | 0 | 217 |
submitted_date | 9 | 11024 | 11033 | 22 | 22 | 0 | 3787 |
Variable type: integer
variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
admin_division_name | 1562 | 9396 | 11033 | 157760.05 | 3829734.54 | 0 | 1963 | 7365 | 34021 | 13,000,000 | ▇▁▁▁▁▁▁▁ |
event_id | 0 | 11033 | 11033 | 5598.95 | 3249.23 | 1 | 2785 | 5563 | 8435 | 11221 | ▇▇▇▇▇▇▇▇ |
fatality_count | 1385 | 9648 | 11033 | 3.22 | 59.89 | 0 | 0 | 0 | 1 | 5000 | ▇▁▁▁▁▁▁▁ |
injury_count | 5674 | 5359 | 11033 | 0.75 | 8.46 | 0 | 0 | 0 | 0 | 374 | ▇▁▁▁▁▁▁▁ |
Variable type: numeric
variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
event_import_id | 1562 | 9471 | 11033 | 4798.56 | 2789.13 | -111.17 | 2386.5 | 4773 | 7189.5 | 9669 | ▇▇▇▇▇▇▇▇ |
gazeteer_distance | 1562 | 9471 | 11033 | 11.87 | 15.6 | 3E-05 | 2.36 | 6.25 | 15.82 | 215.45 | ▇▁▁▁▁▁▁▁ |
latitude | 0 | 11033 | 11033 | 25.88 | 20.42 | -46.77 | 13.92 | 30.53 | 40.87 | 72.63 | ▁▁▁▃▅▇▅▁ |
longitude | 0 | 11033 | 11033 | 2.52 | 100.91 | -179.98 | -107.87 | 19.69 | 93.95 | 179.99 | ▁▇▅▁▂▆▇▁ |
Possible points to investigate
In order to make the conclusion as clear as I can, I will enumerate the possible points to analyze:
1) Event date
First, we have all the data for the value “event_date”. We can group the events by year, in order to see which years have more landslides.
variable | missing | complete | n | min | max | empty | n_unique |
---|---|---|---|---|---|---|---|
event_date | 0 | 11033 | 11033 | 22 | 22 | 0 | 6550 |
event_description | 862 | 10171 | 11033 | 1003 | 1003 | 0 | 9401 |
event_import_source | 1563 | 9470 | 11033 | 80 | 80 | 0 | 3 |
event_time | 6021 | 5012 | 11033 | 7 | 7 | 0 | 25 |
event_title | 0 | 11033 | 11033 | 150 | 150 | 0 | 10546 |
2) Country
First, we have almost all the data for the value “country_code”.We can create maps showing the geolocalization of the landslides.
variable | missing | complete | n | min | max | empty | n_unique |
---|---|---|---|---|---|---|---|
admin_division_name | 1637 | 9396 | 11033 | 3 | 36 | 0 | 887 |
country_code | 1564 | 9469 | 11033 | 2 | 2 | 0 | 139 |
country_name | 1562 | 9471 | 11033 | 4 | 32 | 0 | 141 |
3) Injuries / Fatalities
The variables “injuries” and “fatalities” can allow us to determine which are the events with the most impact. We need to analyze carefully if the data that we have is enough to create a analysis without bias.
variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
fatality_count | 1385 | 9648 | 11033 | 3.22 | 59.89 | 0 | 0 | 0 | 1 | 5000 | ▇▁▁▁▁▁▁▁ |
injury_count | 5674 | 5359 | 11033 | 0.75 | 8.46 | 0 | 0 | 0 | 0 | 374 | ▇▁▁▁▁▁▁▁ |
4) Landslides: size / category / trigger
- The variable “landslide_size” can allow us to see the size of each of the events, and group them. Also, we can check which is most frequent size of landslide.
- The variable “landslide_trigger” can allow us to see which are the most common triggers.
variable | missing | complete | n | min | max | empty | n_unique |
---|---|---|---|---|---|---|---|
landslide_category | 1 | 11032 | 11033 | 19 | 19 | 0 | 14 |
landslide_setting | 69 | 10964 | 11033 | 16 | 16 | 0 | 14 |
landslide_size | 9 | 11024 | 11033 | 12 | 12 | 0 | 6 |
landslide_trigger | 23 | 11010 | 11033 | 23 | 23 | 0 | 18 |
5) Relationship between variables
Just a few relationship interesting to exploit:
- Injuries / Fatalities per country.
- Country of occurrence and size of the landslide.
Conclusions
Just using one library we have a better idea about our dataset, let try with Dataexplorer in “Landslides Dataset from NASA (Part II)” to see what other paths to explore we can find.
Sources
Kirschbaum, D. B., Adler, R., Hong, Y., Hill, S., & Lerner-Lam, A. (2010). A global landslide catalog for hazard applications: method, results, and limitations. Natural Hazards, 52(3), 561–575. doi:10.1007/s11069-009-9401-4. [1] Kirschbaum, D.B., T. Stanley, Y. Zhou (In press, 2015). Spatial and Temporal Analysis of a Global Landslide Catalog. Geomorphology. doi:10.1016/j.geomorph.2015.03.016. [2]