發表文章

Titanic資料分析 #4

圖片
##Read files path_1 <- "D:/R language/Kaggle/Titanic/train.csv " train <- read.csv(path_1) path_2 <- "D:/R language/Kaggle/Titanic/test.csv" test <- read.csv(path_2) ##Install packages install.packages("DMwR") library(DMwR) install.packages("randomForest") library(randomForest) install.packages("party") library(party) ##Data explore test$Survived <- NA combine <- rbind(train, test) str(combine) 'data.frame': 1309 obs. of  12 variables:  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...  $ Name       : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...  $ Si...

分類演算法

CART ( Classififcation and regression tree )           每個節點分成兩條分支,每一分支 資料量低 且 同質性高 (更pure),purity 通常是最大正確化 or 極小分類錯誤率,然而正確性愈高不代表達到 purity 的目的(若分枝條件範圍訂的愈細,可達最高正確率 ex.100筆data,設定99條分枝條件達100%分類,樹狀圖卻顯得過於繁雜且龐大) 使用 Gini Index : 1-p1^2-p2^2           Gini小者表該attribute較適合作為node條件。∆info(資訊獲利):父節點分割前的不純度與子節點分割後的不純度差異,愈大者挑選為策是條件 Greedy algorithm           Decision Tree Model 進行節點切割,從某點開始尋找最佳化解答,且以當下節點 (node) 進行最佳化分析,切割 (split) 完後產生新節點,不再回頭更改之前產生node的決策 Tree-based model 優點 模型建置容易,且易於理解 模型的邏輯特性,能有效處理各種型態變數(數值型/名目型),無須進行變數的清理或轉換( pre-process ) 有效處理 missing data(NA) /how?是指有許多impute missing value的方法?/ ,自動進行篩選變數(feature selection),在許多實際建模問題中為一實用功能 注意 : 建模時隱含 feature selection ,然而若原有兩個 predictors 是重要且高度相關,tree model 選擇時採隨機方式作為 split 變數,則兩變數相關性會弱化 Selection bias : Tree model 決定split變數時頃向選取有較多不一樣數值(distinct value)的變數                         ...

Titanic資料分析#3

圖片
#Read files path_1 <- "D:/R language/Kaggle/Titanic/train.csv " train <- read.csv(path_1) path_2 <- "D:/R language/Kaggle/Titanic/test.csv" test <- read.csv(path_2) install.packages("DMwR") library(DMwR) install.packages("rpart") library(rpart) test$Survived <- NA train$Embarked[train$Embarked==""] <- "S" combine <- rbind(train, test) #使用knn補齊缺失值 combine <- knnImputation(combine) #作預處理,透過找到同家族的人求出新的 attribute : FamilyId combine$Familymemb <- combine$SibSp+combine$Parch+1 combine$Name <- as.character(combine$Name) combine$Title <- sapply(combine$Name, FUN=function(x){strsplit(x,split="[,.]")[[1]][2]}) combine$Title <- sub(" ","",combine$Title) #刪除空格 combine$Title[combine$Title %in% c('Mme', 'Mlle')] <- 'Mlle' combine$Title[combine$Title %in% c('Capt', 'Don', 'Major', 'Sir')] ...

Titanic資料分析#2

Titanic資料分析#2 2017/3/11 今天練習用 Kaggle 給的 train data 自己做一個正確率的測試。比較 knn 或 mice 填補的值配上 cart 或 randomforest 建出來的模型何者正確率較高。   # Read files path_1 <- " D:/R language/Kaggle/Titanic/train.csv " train <- read.csv( path_1 ) path_2 <- " D:/R language/Kaggle/Titanic/test.csv " test <- read.csv( path_2 ) # Install Packages install.packages( " DMwR " ) library( DMwR ) install.packages( " rpart " ) library( rpart ) install.packages( " randomForest " ) library( randomForest ) install.packages( " mice " ) library( mice ) #將train以7:3分成訓練和測試data # Build testing mechanism train $ Embarked [ train $ Embarked == " " ] <- " S " check_training <- subset( train [ 1 : round( 0.7 * 891 ),]) original_check_testing <- subset( train [(round( 0.7 * 891 ) + 1 ) : 891 ,]) check_testing <- subset( train [(round( 0.7 * 891 ) + 1 ) : 891 ,], select = - c( Survived )) ...