Titanic 資料分析#1

preprocessing : 

ggplot(train, aes(x=Survived))+geom_bar(aes(fill=Sex))

由圖可看出男性死亡比例高,女性較低 → Sex affects Survived


train$Age_level[train$Age < 10] <- "0-10"
train$Age_level[train$Age >= 10 & train$Age <20] <- "10-20"
train$Age_level[train$Age >= 20 & train$Age <30] <- "20-30"
train$Age_level[train$Age >= 30 ] <- "30-"
train$Age_level <- as.factor(train$Age_level)
ggplot(train, aes(x=Survived))+geom_bar(aes(fill=Age_level))

Age_level在0和1中比例差不多 → no affect


經分析(Embarked過程省略),Age及Embarked可先刪除

使用knn填補missing value

install.packages("DMwR")
library(DMwR)
library(lattice)
library(grid)
install.packages("rpart")
library(rpart)

titanic$Survived <- as.factor(titanic$Survived)
titanic$Embarked[titanic$Embarked==""] <- "S"

knn_titanic <- knnImputation(titanic)

tree_fit <- rpart(Survived ~ Pclass+Sex+SibSp+Parch+Fare, knn_titanic, method = "class")
predicting <- predict(tree_fit, testing[,c("Pclass","Sex","SibSp","Parch","Fare")],type = "class")
predicting <- as.numeric(predicting)
to_submit <- cbind(PassengerId=testing$PassengerId, Survived=(predicting-1))
write.csv(to_submit, "titanic_submit_4_preprocessing_knn&class",row.names = FALSE)

decision tree 預估結果為0.789






留言

這個網誌中的熱門文章

填補遺漏值(Missing Value)方法

Titanic資料分析#2

ggplot2 繪圖套件