Exploratory data analysis on factors that might determine the survival of the titanic passengers.

Rapeephan Duangjanchot
4 min readJan 29, 2021

What’s Exploratory Data Analysis ?

Exploratory Data Analysis is a critical process which can be used to investigate pattern, anomalies, hypothesis and assumptions of raw data. The process including statistical and graphical tools.

Data exploration

Raw data was downloaded from https://www.kaggle.com/c/titanic. There are 12 variables that contain Passenger ID, Survived, Ticket class(pclass), Name, Sex, Age, Number of sibling and spouse (SibSp), Number of parent and child (Parch), Ticket number (Ticket), Passenger fare (Fare), Cabin number (Cabin) and Port of embarkation (Embark).

Available factors from raw data

Then, I created data status table to check type, quality and quantity of data. Table was generated by using df_status function of funModeling library in R. Three types of data (integer, character and numeric) and 891 identified passengers were provided in the table. However, there are 177 missing data in q_na of Age. I decided to remove rows that contain missing data. I performed filtered data by using is.na function in R. Therefore, I used 714 identified passengers for analysis. Next, I created descriptive statistics table from filtered data.

Data status table

Descriptive statistics table

Descriptive statistics help simplify amounts of data in a sensible way. This table was created by using describe funtion of Hmisc library in R. Six variables that were used are Age, Survived, Plcass, SibSp, Parch and Fare. Because they are numeric and ineger type. Table shows passenger age which is about 0.42 to 80-year-old and average is 29.7-year-old and most frequently are around 15 to 43-years-old. Most passengers didn’t survived and travelled alone. This three variables can be simply considered from median value that equal zero ( 0 = non-survivor, traveling alone) and positive skew value. Three ticket classes (class 1, class 2 and class 3) cost about 0–512.33 GBP. Trend of ticket classes correspond to passenger fare. Ticket class 3 (the cheapest class) has the highest passengers because Pclass has high median and negative skew, and Fare has low median and positive skew.

Descriptive statistics table

Multivariate graphical model

Multivariate graphical tools were used to present relationship between two or more set of data. Proportional bar chart was chosen to display ratio of survivors and non-survivors in each factor. We can hypothesize which passenger group tend to survive.

Factor: Pclass

Pclass is a class of ticket which refected to socio-economics status. From this chart, we can notice that passengers in upper class have highest ratio of survival rate. Passenger in the upper class cloud be famous person or influencer.

Factor: Gender

This proportional plot show that female trend to have higher survival rate than male. Maybe, males sacrificed seat in lifeboat for females.

Factor: SibSp

SibSp factor refers to number of sibling and spouse who traveled with a passenger. Passengers who have one sibling and spouse trend to have high survival rate.

Factor: Parch

Parch factor refers to number of parent and child who traveled with a passenger. Passengers who have one to three parents and children trend to have high survival rate.

Factor: Embarked

Embarked refers to port of embarktion. Passengers who embarked at Cherbourg port trend to have high survival rate.

Five factors above are numeric data that can directly plot into bar chart. Next is integer data.

Factor: Age

From this histogram, I decided to divide age into seven ranges.

Passengers who are about 0 – 10-year-old trend to have high survival rate. It seems like adults sacrificed seat in lifeboat for children.

Factor: Fare

From this histogram, I decided to divide fare into three ranges.

Passengers who paid high-cost trend to have high survival rate. This trend corresponded to ticket class.

In my opinion, I think factors that might be considered as survival factors are gender, socio-economics status and age.

Here is my R-script https://github.com/Rapeephan107/Titanic.git

--

--