Titanic 資料分析#1
preprocessing :
ggplot(train, aes(x=Survived))+geom_bar(aes(fill=Sex))
由圖可看出男性死亡比例高,女性較低 → Sex affects Survived
train$Age_level[train$Age < 10] <- "0-10"
train$Age_level[train$Age >= 10 & train$Age <20] <- "10-20"
train$Age_level[train$Age >= 20 & train$Age <30] <- "20-30"
train$Age_level[train$Age >= 30 ] <- "30-"
train$Age_level <- as.factor(train$Age_level)
ggplot(train, aes(x=Survived))+geom_bar(aes(fill=Age_level))
Age_level在0和1中比例差不多 → no affect
經分析(Embarked過程省略),Age及Embarked可先刪除
使用knn填補missing value
install.packages("DMwR")
library(DMwR)
library(lattice)
library(grid)
install.packages("rpart")
library(rpart)
titanic$Survived <- as.factor(titanic$Survived)
titanic$Embarked[titanic$Embarked==""] <- "S"
knn_titanic <- knnImputation(titanic)
tree_fit <- rpart(Survived ~ Pclass+Sex+SibSp+Parch+Fare, knn_titanic, method = "class")
predicting <- predict(tree_fit, testing[,c("Pclass","Sex","SibSp","Parch","Fare")],type = "class")
predicting <- as.numeric(predicting)
to_submit <- cbind(PassengerId=testing$PassengerId, Survived=(predicting-1))
write.csv(to_submit, "titanic_submit_4_preprocessing_knn&class",row.names = FALSE)
留言
張貼留言