[KFQ] KFQ 1차 프로젝트 / 국내 영화관람객수 예측

본 포스팅은 KFQ 한국품질재단 인공지능 개발자 교육과정 1차 프로젝트의 발표 ppt 입니다.

저희 조의 주제는 코로나19에 따른 국내 영화산업 동향 및 전망 분석이었습니다.

PPT 및 프로젝트 설명

1. 프로젝트 개요

코로나로 인해서 무너지고 있는 영화산업, 이에 대한 반등으로 증가하고 있는 국내 OTT 시장. 이러한 뉴스를 다들 접해 보셨을 거라 생각한다.

저희 조는 이러한 뉴스에서 아이디어를 얻어서 프로젝트 주제를 잡고, 다양한 분석 방법을 통해 국내 영화 산업이 나아가야할 방향성 등의 인사이트를 도출하고자 했다.

2. 프로젝트 요약 및 환경

저희는 간트차트에서 확인할 수 있듯이, 많은 데이터 테이블을 사용하여, 주제선정 이후 데이터 수집과 EDA, 전처리에서 많은 시간을 사용했다.

분석과 학습 등의 과정은 R에서 진행하였고, 국내통계 사이트 등의 OpenAPI를 Json 형식으로 크롤링, 핸들링할 때에는 파이썬을 같이 사용했다.

3. 진행과정 및 산출물

수집하고 실제 사용한 데이터는 총 6가지이다.

영화 관람객 수 데이터
- 일별 총 관객수 및 매출액
- 사용변수: 전체 상영 영화 수, 전국 스크린 수, 전체 관람객 수 등
- 변수 정제: 변수형 변경 (Char 변수 > 수치형 변수)
- 시각화: 국내/해외/전체영화, 코로나 전/후
- 이상치: box-cox 변환
- 출처: KOFIC KOBIS 영화관 입장권 통합 전산망
  코로나 이전과 이후 관람객 수 변화, 토-일-수의 순서로 관람객이 많은 것, 계절에 대한 패턴이 존재하는 것 등을 시각화 자료에서 확인 할 수 있다.
코로나 관련 데이터
- 결측치: 뉴스 기사와 비교 후 수기 작성
- 사용변수: 신규/누적 확진자 수, 신규/누적 사망자 수, 치료 중 환자 수 등
- 변수 정제: 신규 인원에 관한 파생 변수 생성, 치료 중 환자 수 > 생존한 치료 중 환자 수
- 출처: 공공데이터포털 보건복지부 코로나19 감염현황 조회서비스 Open API
영화 검색량
- 사용변수: 수집기간 기준 top100 영화명을 바탕으로 ‘영화’ 키워드 검색량
- 출처: Naver DataLab 통합 검색어 트렌드 API
  인기 상영작의 개봉, 수상 전후로 관람객 수가 급증하는 것을 시각화 자료에서 확인 할 수 있다.
날씨
- 결측치: 강수량/적설량은 0, 기온 및 습도 결측치는 제거
- 사용변수: 최고/최저/평균 기온, 평균 강수량, 평균/최고 풍속, 평균 습도, 평균 적설량 등
- 변수 정제: 일별 불쾌지수 파생변수 추가, 지역별로 구분된 데이터 group by로 일자별로 평균값 사용
- 출처: 기상청 지상 (종관, ASOS) 일자료 조회서비스 Open API
미세먼지
- 결측치: 평균값으로 대체
- 사용변수: 미세먼지농도
- 변수 정제: 지역별로 구분된 데이터 group by로 일자별로 평균값 사용
공휴일
- 사용변수: 공휴일 날짜
- 변수정제: 주말/공휴일을 묶어 휴일, 나머지 평일로 범주형 변수 추가

시간의 흐름에 따른 시각화 자료에서, 코로나 관련 변수들에 있어서는 확실하게 관람객수가 감소한 모습을 볼수 있다.

4. 모델 검증 및 분석

분석에 앞서 각 변수들의 상관관계 그래프에서 유의미한 변수들을 확인하고

산점도 그래프를 통해 선형회귀분석을 진행하기로 하였다.

모델링에 앞서, 전처리과정에서 정규성을 만족하지 못하는 전체 관람객 수에 대하여 box-cox 변환을 진행했다.

변환 이후 정규화가 된 것과 이상치가 제거된 모습을 확인 할 수 있다.

4가지 범주형 변수에 대하여 등분산성 검정을 진행하고, 등분산성을 만족치 못하는 변수에 대해

추가로 welch t 검정을 통해, 평균에는 차이가 존재하여 통계적으로 유의미한 변수라 판단했다.

Anova 검정을 통해 유의미한 변수들의 판단과 함께 각 변수들의 교호관계도 확인했다.

해당 변수들로 선형 회귀를 진행하였을 때 81%의 설명력이 나왔지만, 변수가 너무 많아 발생하는 차원문제를 해결하기 위해서

best subset, 전진선택법, 후진소거법, 단계적 선택법의 4가지 변수 선택법을 모두 진행하였고 모두 동일한 8개의 변수를 선택하였다.

선택된 변수를 사용한 선형 회귀 모델에서 설명력 80%, 다중공선성에 문제 없음을 확인하였고

추가로 이상치를 제거하여 설명력을 83%까지 올렸다.

x 변수들을 scaling 하였을 때, 변수를 많이 선택할 수록 오차율이 낮아지는 것을 볼 수 있었지만, 차원 문제로 8개의 변수를 선택하였고, 이 역시 동일한 결론을 얻었다.

하지만 해당 모델의 완성 후, durbin watson 검정을 통해 독립성이 만족하지 않는 것을 발견, 이를 보완하기 위한 비선형 모델과 시계열 모델을 추가로 고려했다.

비선형 모델 중, 비선형 데이터에 적합하며, 오버피팅과 다중공선성을 방지 할 수 있고, 변수 중요도를 파악할 수 있는 장점에서 랜덤포레스트 모델을 사용했다.

최적의 결과를 확인하였을 때, 치료 중 환자 수가 학습에 있어서 가장 중요한 변수로 파악되었고, 시각화 자료에서 확인 할 수 있듯이, 대유행 시기에 맞춰 급증하는 치료 중 환자 수와 함께 급감하는 관람객 수를 볼 수 있다.

시계열 분석을 통해 경향성, 계절성을 분리시킨 이동평균 모델, 자기상관함수 확인을 통한 자기회귀모델에서 유의 수준 이상의 p 값 확인을 통해 시계열 arima 모델을 생성했다.

3개월, 6개월, 12개월, 24개월에 대한 예측 그래프와 함께 예측된 2021년 5월의 평균 관객 92458명, 실제 평균관객 141230명에서 box-cox 정규화 이후 4.875%의 상대오차로 예측이 완료되었다.

해당 오차의 차이는, 분노의 질주, 크루엘라 등 인기 개봉작에 의한 관객수 영향과 코로나 백신의 접종이 진행되며 일어난 것으로 파악 중이다.

5. 결론

랜덤 포레스트 모델로부터 치료 중 환자 수가 영화 관객 수 예측에 있어서 압도적으로 중요한 변수라는 것과 함께

시계열 모델로 계절성과 트렌드에 민감한 향후 2년간의 관객 수를 예측했다.

앞서 독립성을 만족하지 못하여, 분석의 의미가 퇴색된 선형회귀모델과 앙상블하여 더 뛰어난 모델을 생성할 경우 더욱 뛰어난 예측이 가능할 것으로 보이며

백신의 접종이 꾸준히 진행 되고 있기에, 관람객 수가 회복되기까지 국내 영화산업계가 조금만 더 버텨준다면, 이전 수준으로 회복하여 다시 빛을 볼 수 있을 것으로 예상된다.

보고서

느낀점

협업에서의 양식 통일의 필요성을 느꼈다.

또한 소스 코드 업데이트 시, 기록이 남아있지 않아 좀 애먹었다. (사실 아직도 애먹고있다. 나중에 다시봐도 뭔말인지 읽기 힘들정도)

덕분에 Git을 사용해야겠다는 생각을 시작하였고, 흐르고 흘러 블로그까지 만들 수 있었다.

프로젝트 진행 기간

2021년 5월 7일 금요일 ~ 6월 7일 월요일

R 전체 코드 (주의, 좀 김, 정리가 안되어있음..)

  
# setting
rm(list=ls())
getwd()
setwd('C:/Users/eric3/Desktop/example/large')
setwd("C:\\19tak\\ai 개발자 과정\\prj1\\data")

library(car)
library(leaps)
library(tidyverse)
library(lubridate)
library(corrplot)
library(glmnet)
library(ggpubr)
library(rstatix)
library(onewaytests)
library(gganimate)
library(dplyr)

options(warn = -1)
par(mfrow=c(1,1))

# Start
# 열 재조정
# 일단은 전체 데이터에 대해서만
all = read.csv('all_final2.csv')
all = all[-1]
all = all %>% 
  mutate(pm = exp(log_pm),
         Date = as.Date(Date)) %>% 
  select(-c(log_pm)) %>% 
  select(c(1,26,25,2:11,27,12:27)) %>% 
  rename(research_volume = count)

# 미세먼지 
# 1. exp 취하기
# 2. 범주 만들기
dim(all)
      
## 전체 영화에 대한 분석 ## (all)
### EDA ###
str(all)
# 1. 결측치 확인 : 무
sum(!complete.cases(all))

### 2. 이상치 확인
all_num = all %>% select(-c(Date,covid,weekdays,weekdays_kind,d_index))
str(all_num)

# u+- 3sigma 밖은 이상치로 판단
rm_outlier <- function(x){
  mu = mean(x)
  sigma = sd(x)
  if (x >= mu-3*sigma & x <= mu+3*sigma){
    print('no outlier')
  } else{
    return(x)    
  }
}

apply(all_num,2,rm_outlier)
# 이상치 없음을 확인

# 상관관계 시각화
cor = all %>% select(-c(Date,covid, weekdays,weekdays_kind, d_index, deathCnt)) %>% 
  cor() %>% round(2)
par(mfrow=c(1,1))
# 유의성 matrix
cor.mtest <- function(mat, conf.level = 0.95) {
  mat <- as.matrix(mat)
  n <- ncol(mat)
  p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
  diag(p.mat) <- 0
  diag(lowCI.mat) <- diag(uppCI.mat) <- 1
  for (i in 1:(n - 1)) {
    for (j in (i + 1):n) {
      tmp <- cor.test(mat[, i], mat[, j], conf.level = conf.level)
      p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
      lowCI.mat[i, j] <- lowCI.mat[j, i] <- tmp$conf.int[1]
      uppCI.mat[i, j] <- uppCI.mat[j, i] <- tmp$conf.int[2]
    }
  }
  return(list(p.mat, lowCI.mat, uppCI.mat))
}

res1 <- cor.mtest(cor, 0.95)

## 유의수준에 따라 유의하지 않은 경우 X 표시하기
corrplot.mixed(cor,lower="color",upper="number",p.mat=res1[[1]],sig.level=0.05, mar = c(4,4,4,4),tl.cex = 0.5, tl.col= 'blue')
l=length(cor[1,])
for(i in 1:l){
  for(j in 1:l)
  {
    if(i >= j) next
    text(i,11-j+1,round(res1[[1]][i,j],2))
  }
}

# 산점도 파악
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor=0.1, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- cor(x, y)
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) 
    cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor)
}

# 수치형 변수에 대해서만
pairs(all_num, lower.panel = panel.cor, upper.panel = panel.smooth)
str(all2)

#####################################
# Visualization
# all2 : 연도/월/일로 분할 해보기
str(all)
all2 = all %>% mutate(year = year(Date),
                      month = month(Date),
                      day = day(Date))

# 시간이 지남에 따른 최고 풍속과 관객수간의 관계
hwind <- all2 %>% ggplot(aes(mean_high_wind, a_aud, color = covid))+
  geom_point(show.legend = T, alpha=1)+
  scale_y_log10()+
  theme_bw()+
  labs(x = "평균적인 최고 풍속", y = "관객수")
hwind
hwind + transition_time(year) +
  labs(title = "Year: {frame_time}")+
  ease_aes('sine-in-out')

# 치료중인 확진자 수과 관객수간의 관계
care <- all2 %>% ggplot(aes(careCnt,a_aud, color = covid))+
  geom_point(show.legend = T, alpha=1)+
  scale_y_log10()+
  theme_bw()+
  labs(x = "치료중인 확진자 수 ", y = "관객수")
care
care + transition_time(year) +
  labs(title = "Year: {frame_time}")+
  ease_aes('sine-in-out')

# 누적 확진자 수과 관객수간의 관계
cumu <- all2 %>% ggplot(aes(decideCnt,a_aud, color = covid))+
  geom_point(show.legend = T, alpha=1)+
  scale_y_log10()+
  theme_bw()+
  labs(x = "누적 확진자 수 ", y = "관객수")
cumu
cumu + transition_time(year) +
  labs(title = "Year: {frame_time}")+
  ease_aes('sine-in-out')

# 신규 확진자 수와 관객수 간의 관계
new <- all2 %>% ggplot(aes(newCnt,a_aud, color = covid))+
  geom_point(show.legend = T, alpha=1)+
  scale_y_log10()+
  theme_bw()+
  labs(x = "신규 확진자 수 ", y = "관객수")
new
new + transition_time(year) +
  labs(title = "Year: {frame_time}")+
  ease_aes('sine-in-out')

# research volume
research <- all2 %>% ggplot(aes(research_volume, a_aud, color = covid))+
  geom_point(show.legend = T, alpha=1)+
  scale_y_log10()+
  theme_bw()+
  labs(x = "검색량", y = "관객수")
research
research + transition_time(year) +
  labs(title = "Year: {frame_time}")+
  ease_aes('sine-in-out')

# 최고 풍속과 관객 수
## 미세먼지에는 영향을 받는것 같진 않다
# 미세먼지 범주 추가해서 봐보자
p <- ggplot(data=all2)+
  geom_point(aes(x=mean_high_wind, y=a_aud))
p
p+ facet_grid(covid~weekdays_kind)+
  scale_colour_viridis_d(option = "inferno")

# 불쾌지수도 딱히
p1 <- ggplot(data=all2)+
  geom_point(aes(x=mean_high_wind, y=a_aud,color= d_index))
p1+facet_grid(covid~weekdays_kind)+
  scale_colour_viridis_d(option = "inferno")

# 신규확진자 수와 관객 수
p2 <- ggplot(data=all2)+
  geom_point(aes(x=newCnt, y=a_aud,color= pm))
p2+facet_grid(covid~weekdays_kind)+
  scale_colour_viridis_d(option = "inferno")

# 뚜렷한 경향성을 보이는 변수는 찾지 못함

################# 월 단위 ############
# 18/19/20/21년 월 단위 평균 관객 수
all_2_2018 = all2 %>% filter(year == 2018) %>%
  group_by(month) %>%
  summarise(mean_audience = mean(a_aud))
all_2_2019 = all2 %>% filter(year == 2019) %>%
  group_by(month) %>%
  summarise(mean_audience = mean(a_aud))
all_2_2020 = all2 %>% filter(year == 2020) %>%
  group_by(month) %>%
  summarise(mean_audience = mean(a_aud))
all_2_2021 = all2 %>% filter(year == 2021) %>%
  group_by(month) %>%
  summarise(mean_audience = mean(a_aud))

all_2_2018 = all_2_2018 %>% mutate(year = 2018)
all_2_2019 = all_2_2019 %>% mutate(year = 2019)
all_2_2020 = all_2_2020 %>% mutate(year = 2020)
all_2_2021 = all_2_2021 %>% mutate(year = 2021)

month = rbind(all_2_2018,all_2_2019,all_2_2020,all_2_2021) %>% 
  arrange(month)
month

# 
c = month %>% ggplot(aes(month, mean_audience))+
  geom_bar(stat = 'identity')+
  theme_bw()
c = c+facet_wrap(~year)+ scale_x_continuous(breaks = c(1,2,3,4,5,6,7,8,9,10,11,12))+
  theme_minimal()+
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank())+ 
  ggtitle('월별 평균 관객 수')+
  theme(plot.title = element_text(colour = 'black', size=20))+
  theme(axis.title.x = element_text(colour = 'blue',size = 12),
        axis.title.y = element_text(color =  'blue', size = 12))
c
# animation 
c +transition_states(year,  state_length = 8, transition_length = 3) +
  ease_aes('quintic-in')

################# 요일 단위 ############
# 18/19/20/21년 요일 단위 평균 관객 수
all_2_2018 = all2 %>% filter(year == 2018) %>%
  group_by(weekdays) %>%
  summarise(mean_audience = mean(a_aud))
all_2_2019 = all2 %>% filter(year == 2019) %>%
  group_by(weekdays) %>%
  summarise(mean_audience = mean(a_aud))
all_2_2020 = all2 %>% filter(year == 2020) %>%
  group_by(weekdays) %>%
  summarise(mean_audience = mean(a_aud))
all_2_2021 = all2 %>% filter(year == 2021) %>%
  group_by(weekdays) %>%
  summarise(mean_audience = mean(a_aud))

all_2_2018 = all_2_2018 %>% mutate(year = 2018)
all_2_2019 = all_2_2019 %>% mutate(year = 2019)
all_2_2020 = all_2_2020 %>% mutate(year = 2020)
all_2_2021 = all_2_2021 %>% mutate(year = 2021)

weekdays = rbind(all_2_2018,all_2_2019,all_2_2020,all_2_2021)

# 
d = weekdays %>% ggplot(aes(weekdays, mean_audience))+
  geom_bar(stat = 'identity')+
  theme_bw()+facet_wrap(~year)+scale_x_discrete(labels = c('Fri','Mon','Sat','Sun','Thu','Tue','Wed'))
d = d+theme_minimal()+
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank())+ 
  ggtitle('요일별 평균 관객 수')+
  theme(plot.title = element_text(colour = 'black', size=20))+
  theme(axis.title.x = element_text(colour = 'blue',size = 12),
        axis.title.y = element_text(color =  'blue', size = 12))
d

# animation 
d +transition_states(year,  state_length = 8, transition_length = 3) +
  ease_aes('quintic-in')

# EDA 과정에서 통계적 분석
###### statistical test #####
### one-way anova
# 모집단의 분산이 다를 때 표본의 평균차이를 검정하는 Welch Test 확인
# p-value가 작다? : 등분산성 만족하지 않음
bartlett.test(a_aud ~ covid, data = all)
welch.test(a_aud~ covid, data = all)
bartlett.test(a_aud ~ weekdays, data = all)
welch.test(a_aud~ weekdays, data = all)
bartlett.test(a_aud ~ weekdays_kind, data = all)
welch.test(a_aud~ weekdays_kind, data = all)
bartlett.test(a_aud ~ d_index, data = all) # 등분산 만족
summary(aov(a_aud~ d_index, data=all)) # 평균간 차이가 없다!
# y를 scaling해보고 시도해보자
## box-cox 변환
hist(all$a_aud)
hist(log(all$a_aud))
boxCox(all$a_aud~1)
a = boxCox(all$a_aud~1)
a
which.max(a$y) # 55 index
a$x[55]
all_box = all %>% mutate(a_aud = (a_aud^0.1818182-1)/0.1818182)
hist(all_box$a_aud)
all_box = all %>% mutate(a_aud = (a_aud^0.1818182-1)/0.1818182)
bartlett.test(a_aud ~ covid, data = all_box)
bartlett.test(a_aud ~ weekdays, data = all_box)
bartlett.test(a_aud ~ weekdays_kind, data = all_box) # 얘만 등분산 만족하지 않음
bartlett.test(a_aud ~ d_index, data = all_box)

# 추가 test
welch.test(a_aud~ weekdays_kind, data = all_box)

summary(aov(a_aud ~ covid, data = all_box))
summary(aov(a_aud ~ weekdays, data = all_box))
summary(aov(a_aud ~ weekdays_kind, data = all_box))
summary(aov(a_aud ~ d_index, data = all_box)) # 혼자 중요하지 않네


# weekdays는 그룹이 7개이기 때문에 multiple comparison 필요
attach(all_box)
weekdays <- factor(weekdays)
levels(weekdays) <- c("mon",'tue','wed','thu','fri','sat','sun')
tapply(a_aud,weekdays,mean)
tapply(a_aud,weekdays,sd)

# one way anova - multiple comparison 자주 사용되는 (holm : 본페르니보다 덜 보수적)
pairwise.t.test(a_aud, weekdays, "bonferroni")
pairwise.t.test(a_aud, weekdays, "holm")

# 결론 : 두 방법 모두 (월요일,금요일),(월요일과 일요일),(화요일과 금요일),(화요일과 토요일),(수요일과 목요일),(금요일과 토요일),(금요일과 일요일)은 모평균의 차이가 없다 
# 차이가 있는 요일쌍은 (월,화) (월,수) (월,목) (월,토) (화,수) (화,목) (화,일) (수,금) (수,토) (수,일) (목,토) (목,금) (목,일) (토,일)
# 오늘 영화관에 사람이 많다면, 언제가야 사람이 적을까에 대한 고찰 가능

#1
# covid & 평일여부
interaction.plot(x.factor = all_box$covid, trace.factor = all_box$weekdays_kind, 
                 response = all_box$a_aud, fun = mean, 
                 type = "b", legend = TRUE, 
                 xlab = "covid", ylab="영화관객 수",
                 pch=c(1,19), col = c("#00AFBB", "#E7B800") , main = 'Covid & 평일여부',
                 trace.label = "평일 여부",lwd = 2)
#2
# covid & 요일
interaction.plot(x.factor = all_box$covid, trace.factor = all_box$weekdays, 
                 response = all_box$a_aud, fun = mean, 
                 type = "b", legend = TRUE, 
                 xlab = "covid", ylab="영화관객 수",
                 pch=c(1,19), col = 1:7, main = 'Covid & 요일',
                 trace.label = "요일",lwd = 2)

boxplot(a_aud ~ covid * weekdays, data=all_box, frame = F, 
        col=4:5, xlab = 'covid*요일',ylab="영화관객 수")
#3
# 요일 & 평일여부
interaction.plot(x.factor = all_box$weekdays_kind, trace.factor = all_box$weekdays, 
                 response = all_box$a_aud, fun = mean, 
                 type = "b", legend = TRUE, 
                 xlab = "요일", ylab="영화관객 수",
                 pch=c(1,19), col = 10:16, main = '요일 & 평일여부',
                 trace.label = "평일 여부",lwd = 2)
#4
# covid & 미세먼지
all_box_pm = all_box %>% mutate(pm = ifelse(pm<=30, '좋음',
                                    ifelse(pm<=80, '보통',
                                           ifelse(pm<=150,'나쁨','매우나쁨'))))
# 순서형 범주로 만들어야함
all_box_pm[,"pm"] <- factor(all_box_pm[, "pm"], ordered=TRUE)
levels(all_box_pm$pm) = c('좋음','보통','나쁨','매우 나쁨')
table(all_box_pm$pm)
# 미세먼지와 코로나 전후 변수 사이 교호작용 존재 X
all_box_pm %>% ggline(x = "covid", y = "a_aud", color = "pm",
                  add = c("mean_se"),
                  colour = c("#00AFBB", "#E7B800",'#FC4E07','#GA3E07'))
boxplot(a_aud ~ covid * pm, data=all_box_pm, frame = FALSE, 
        col = 3:4, xlab = 'covid*미세먼지',ylab="영화관객 수")
#5 평일여부 & 미세먼지
interaction.plot(x.factor = all_box$weekdays_kind, trace.factor = all_box_pm$pm, 
                 response = all_box$a_aud, fun = mean, 
                 type = "b", legend = TRUE, 
                 xlab = "평일 여부", ylab="영화관객 수",
                 pch=c(1,19), col = 5:8, main = '미세먼지 & 평일여부',
                 trace.label = "미세먼지 정도",lwd = 2)
#6 요일 & 미세먼지
interaction.plot(x.factor = all_box$weekdays, trace.factor = all_box_pm$pm, 
                 response = all_box$a_aud, fun = mean, 
                 type = "b", legend = TRUE, 
                 xlab = "요일", ylab="영화관객 수",
                 pch=c(1,19), col = 5:8, main = '미세먼지 & 요일',
                 trace.label = "미세먼지 정도",lwd = 2)
## two way anova test
#1
aov1 <- aov(a_aud ~ weekdays_kind + covid + weekdays_kind*covid, data = all_box)
summary(aov1) 
#2
aov2 <- aov(a_aud ~ weekdays + covid + weekdays*covid, data = all_box_pm)
summary(aov2)
#3
aov3 <- aov(a_aud ~ weekdays + weekdays_kind + weekdays*weekdays_kind, data = all_box_pm)
summary(aov3)
#4
aov4 <- aov(a_aud ~ pm + covid + pm*covid, data = all_box_pm)
summary(aov4)
#5
aov5 <- aov(a_aud ~ pm + weekdays_kind + pm*weekdays_kind, data = all_box_pm)
summary(aov5)
#6
aov6 <- aov(a_aud ~ pm + weekdays + pm*weekdays, data = all_box_pm)
summary(aov6)
#7
aov7 <- aov(a_aud ~ pm + weekdays + weekdays_kind+ covid, data = all_box_pm)
summary(aov7)
# 3-way anova
replications(a_aud~ covid*weekdays*weekdays_kind, data=all_box)
with(all_box, tapply(a_aud, list(covid, weekdays), mean));
with(all_box, tapply(a_aud, list(covid, weekdays_kind), mean));
with(all_box, tapply(a_aud, list(weekdays, weekdays_kind), mean));
#8
aov_v <- aov(a_aud ~ covid*weekdays*weekdays_kind, data=all_box)
summary(aov_v)
TukeyHSD(aov_v, conf.level=.95)
options('contrasts')
summary.lm(aov_v)

### Modeling ###
all_box = all_box %>% select(-c(Date))
m3 = lm(a_aud ~., data=all_box)
summary(m3) # 0.81 # 모든 변수포함
m4 = lm(a_aud ~ a_screen+covid+weekdays+weekdays_kind+covid*weekdays_kind + decideCnt + careCnt + d_val ,
        data = all_box) # 상호작용 텀 추가 : 0.805
summary(m4)
m5 = lm(a_aud ~ a_screen+covid+weekdays+weekdays_kind+decideCnt + careCnt + d_val ,
        data = all_box)
summary(m5)
m5 %>% durbinWatsonTest()

# 회귀 진단
par(mfrow = c(2,2))
m4 %>% plot
m4 %>%  durbinWatsonTest()
m4%>% residuals %>% durbinWatsonTest() # 독립성 위반  # -> 시계열모형 고려를 해야한다.
m4%>% residuals %>% shapiro.test() # 정규성 만족
a = all %>% select(-c(Date))
v = lm(a_aud~., data=a)
summary(v)

# 자기상관 모델
par(mfrow=c(1,1))
plot(a$a_aud, type="b")
z <- acf(a$a_aud, type=c('correlation'),plot=TRUE)
round(z$acf,4)

# User Defined Function of Autocorrelation
acf_func <- function(y, lag_k){
  # y: input vector
  # lag_k : Lag order of k
  N = length(y) # total number of observations
  y_bar = mean(y)
  # Variance
  var = sum((y - y_bar)^2) / N
  # Autocovariance
  auto_cov = sum((y[1:(N-lag_k)] - y_bar) * (y[(1+lag_k):(N)] - y_bar)) / N
  # Autocorrelation coefficient = Autocovariance / Variance
  r_k = auto_cov / var
  return(r_k)  
}

acf <- data.frame()
y = all_box$a_aud
for (k in 0:(length(y)-1)){  
  acf_k <- round(acf_func(y, lag_k = k), 4)
  acf[k+1, 'lag'] = k
  acf[k+1, 'ACF'] = acf_k
}
acf %>% head()
N <- length(y)
qnorm(0.975, mean=0, sd=1, lower.tail=TRUE)/ sqrt(N)
# 1.96 / 루트(1096)
qnorm(0.025, mean=0, sd=1, lower.tail=TRUE)/ sqrt(N)

# 시계열 분석 #
all %>% head()
before_c = a %>% filter(covid == 'before')
before_c %>% nrow()
after_c = a %>% filter(covid == 'after')
after_c %>% nrow()
plot.ts(cbind(before_c$Date, before_c$a_aud), main = '추세선')
plot.ts(cbind(after_c$Date, after_c$a_aud), main = '추세선')

# 전처리 간단한
# 시계열 .... y값만 가지고 y값을 예측한다 
# y값만 이용
a$a_aud
all2$a_aud
# 월마다 평균 (관객 수를 월마다 평균)
all2 %>% head()
all3 = all2 %>% mutate(ym = ifelse(nchar(month)==1,'one','two'))
nrow(all3)
p = all3 %>% filter(ym =='one') %>% mutate(month = paste0('0',month),
                                           ym = paste0(year,month)) 
q = all3 %>% filter(ym !='one') %>% mutate(ym = paste0(year,month)) 
all3 = rbind(p,q) 
# 2018년 5월 / 18년 6월 ... 21년 5월 .. 관객수의 평균
# 2018 + 05 -> 201805 .. 문자열 
all3
ts = all3 %>% group_by(ym) %>% 
  summarise(m = mean(a_aud)) 
ts
data = ts$m
data

# 시계열 자료형성
tsdata <- ts(data, start = c(2018,5), frequency = 12)
tsdata
# 추세선 확인
par(mfrow = c(1,1))
ts.plot(tsdata)

# 시계열 분해
plot(stl(tsdata, 'periodic'))

# 변동 요인제거
t <- decompose(tsdata)
attributes(t)
plot(t)

# 계절요인 제거 
rm_season = tsdata - t$seasonal
plot(rm_season)

# 추세요인 제거
rm_trend = tsdata - t$trend
plot(rm_trend)

# 불규칙 요인만 확인
random = t$random
plot(random)

# 자기상관 함수
acf(na.omit(tsdata, main = '자기상관함수', col='red'))
acf(na.omit(rm_season, main = '자기상관함수', col='red'))
acf(na.omit(rm_trend, main = '자기상관함수', col='red'))
acf(na.omit(random, main = '자기상관함수', col='red'))

# 부분 자기상관 함수
pacf(na.omit(tsdata, main = '자기상관함수', col='red'))
pacf(na.omit(rm_season, main = '자기상관함수', col='red'))
pacf(na.omit(rm_trend, main = '자기상관함수', col='red'))
pacf(na.omit(random, main = '자기상관함수', col='red'))

# 차분 시각화
plot(diff(tsdata, differnces=1))
plot(diff(rm_season, differnces=1))
par(mfrow= c(1,2))
ts.plot(tsdata)
diff <- diff(tsdata)
plot(diff)

library(forecast)
# 시계열 모형 식별
ts_model <- auto.arima(tsdata)
ts_model
model <- arima(tsdata, c(0,1,0), seasonal = list(order = c(1,0,0)))
model
tsdiag(model)
# 자기상관 없고, 규칙성 없음 
Box.test(model$residuals, lag=1, type = 'Ljung')
par(mfrow=c(2,2))
fore <- forecast(model,h=3)
plot(fore)
fore2 <- forecast(model, h=6)
fore2
plot(fore2)
fore3 <- forecast(model, h=12)
fore3
plot(fore3)
fore4 <- forecast(model, h=24)
fore4
plot(fore4)
fore
fore2
fore3
fore4

# 이동평균법 vs 지수평활법
#install.packages('TTR')
library(TTR)

#####
# graph #
#####
library(car)
library(MASS)
library(leaps)
library(corrplot)
library(glmnet)
########
# 본격적인 모델링
# best subset selection #
# 요일 포함 bs - 0.79
# 요일 포함하는게 나음
bs = regsubsets(a_aud~., data=all_box, nbest=1)
bs1 = summary(bs)
bs1$adjr2
bs1$bic
bs1$cp
round(coef(bs,5),5)
names(coef(bs,8))[2:9]
par(mfrow =c(2,2))
plot(bs1$adjr2, xlab = '# of Variables', ylab ='AdjR2', type = 'l')
l = which.max(bs1$adjr2)
points(l, bs1$adjr2[l], col= 'red', cex=2, pch=20)
plot(bs1$adjr2, xlab = '# of Variables', ylab ='rss', type = 'l')
l = which.max(bs1$adjr2)
points(l, bs1$adjr2[l], col= 'red', cex=2, pch=20)
plot(bs1$cp, xlab = '# of Variables', ylab ='cp', type = 'l')
l = which.min(bs1$cp)
points(l, bs1$cp[l], col= 'red', cex=2, pch=20)
plot(bs1$bic, xlab = '# of Variables', ylab ='bic', type = 'l')
l = which.min(bs1$bic)
points(l, bs1$bic[l], col= 'red', cex=2, pch=20)
par(mfrow =c(2,2))
plot(bs, scale='r2')
plot(bs, scale='adjr2')
plot(bs, scale='Cp')
plot(bs, scale='bic')

# forward
fw = regsubsets(a_aud~., data=all_box, nbest=1, method= 'forward')
fs = summary(fw)
fs$adjr2
fs$bic
fs$cp
par(mfrow=c(1,1))
plot(fs$adjr2, xlab = '# of Variables', ylab ='AdjR2', type = 'l')
l = which.max(fs$adjr2)
points(l, fs$adjr2[l], col= 'red', cex=2, pch=20)
round(coef(fw,5),5)
round(coef(fw,8),5)

# backward
bk = regsubsets(a_aud~., data=all_box, nbest=1, method= 'backward')
bk1= summary(bk)
bk1$adjr2
bk1$bic
bk1$cp
plot(bk1$adjr2, xlab = '# of Variables', ylab ='AdjR2', type = 'l')
l = which.max(bk1$adjr2)
points(l, bk1$adjr2[l], col= 'red', cex=2, pch=20)
coef(bk,8)

# stepwise
sw = regsubsets(a_aud~., data=all_box, nbest=1, method= 'seqrep')
ss = summary(sw)
ss$adjr2
ss$bic
ss$cp
round(coef(bs,8),5)
round(coef(fw,8),5)
round(coef(bk,8),5)
round(coef(sw,8),5)
# 모두 동일

# 회귀 진단 
d = lm(a_aud~ covid+a_screen+weekdays+weekdays_kind+decideCnt+careCnt+d_val, data = all_box)
summary(d) # 설명력 80.1퍼
# vif check
vif(d)
par(mfrow = c(3,2))
plot(d, which=1:6)

# cook's distance
cooks.distance(d)[cooks.distance(d)>1] # 0개의 influential point
length(rstudent(d)) # 1096개

# 표준잔차 >2 : 이상치라고 판단 (선형회귀분석에 있어서)
aa = rstudent(d)[rstudent(d)>2] 
length(aa)  # 길이 : 31
idx = rownames(as.data.frame(aa))
idx = as.numeric(idx)

# 이상치 제거
all_box = all_box[-idx,]

# 확인차
d2 = lm(a_aud~ covid+a_screen+weekdays+weekdays_kind+decideCnt+careCnt+d_val, data = all_box)
summary(d2)
vif(d2)

# 0.83의 설명력
plot(d2) # good fitting

# 모델 검증
dim(all_box) # 31개 항 제거

# train_test split
library(caTools)
set.seed(6)
train_idx = sample.split(all_box$a_aud, SplitRatio = 0.8)
train = all_box[train_idx,]
test = all_box[train_idx==F,]

# check
dim(all_box) #1065
nrow(train) #852
nrow(test) #213

# x,y split
y_train = train[,c('a_aud')]
x_train = train %>% select(-c(a_aud))
y_test = test[,c('a_aud')]
x_test = test %>% select(-c(a_aud))
predict(d2, newdata = test, interval = 'confidence')
base_yhat = predict(d2, newdata = test)
base_yhat

# test mse function
test_mse <- function(true_y, yhat) mean((true_y-yhat)^2)

# baseline : 변수 선택없이 회귀 # 18.10643
test_mse(base_yhat, y_test)
test_mse(base_yhat, y_test)

# 변수선택한 회귀
# 변수를 몇개 선택해야할까?
predict.regsubsets = function(object,newdata,id,...){
  form = as.formula(object$call[[2]]) # object은 regsubsets() object
  mat = model.matrix(form,newdata) # model.matrix(model fit한 formula, data)
  coefi = coef(object, id=id) # regsubsets() 결과에서 var이 ~개일 때의 coef 저장하기
  xvars = names(coefi) # coefi 칼럼명 뽑아내기
  mat[,xvars]%*%coefi
}

cc = regsubsets(a_aud~., data=all_box, nbest=1, nvmax=19)
val.error = rep(0,19)
for (i in 1:19){
  yhat = predict.regsubsets(cc,newdata = test, i)
  val.error[i] = test_mse(y_test,yhat)
}
which.min(val.error)
val.error # MSE
# 계속 줄어드는 양상... but 차원까지 생각해서
# id = 19, 15.70039 최저 ... id=5 , 22.03  / id=12, 16.63519

# y 변수 : box-cox
# x 변수 : 어떠한 변환도 없음
# 이 이후부터
## x 변수에 scaling 적용한다면?
# standardscaling
str(x_train)
# all_box2 : x변수 스케일링, y변수 box cox O
# all_box : x변수 스케일링 X, y변수 box cox O

# scaling 한 df 를 따로 만들기
all_box2 = all_box

for (i in 1:length(all_box2)){
  if (colnames(all_box2[i]) != 'covid' & colnames(all_box2[i]) != 'weekdays' & 
      colnames(all_box2[i]) != 'weekdays_kind' & colnames(all_box2[i]) != 'd_index' & colnames(all_box2[i]) != 'a_aud'){
    all_box2[,i] = scale(all_box2[,i])
    all_box2[,i] = as.vector(all_box2[,i])
  }
}
str(all_box2)

# 다시 스플릿
set.seed(6)
train_idx = sample.split(all_box2$a_aud, SplitRatio = 0.8)
train = all_box2[train_idx,]
test = all_box2[train_idx==F,]

# check
dim(all_box2) #1065
nrow(train) #852
nrow(test) #213

# x,y split
y_train = train[,c('a_aud')]
x_train = train %>% select(-c(a_aud))
y_test = test[,c('a_aud')]
x_test = test %>% select(-c(a_aud))
dd = regsubsets(a_aud~., data=all_box2, nbest=1, nvmax=19)
val.error = rep(0,19)
for (i in 1:19){
  yhat = predict.regsubsets(dd,newdata = test, i)
  val.error[i] = test_mse(y_test,yhat)
}
which.min(val.error)
val.error
round(coef(dd,5),5)
round(coef(cc,5),5) # 달라지지 않음 # 어차피 예측값도 같고, coef도 같아서!
# all_box로 해도 되고 all_box2로 해도 됨
#################################
# cv 
# Ridge, Lasso 돌리려면 더미변수로 만들어야한다
# cv 돌리기 전
# all_box = 더미 변수 만들기 전
# all_new2 = 더미변수 만든 후
str(all_box)
all_dummy = all_box %>% mutate(weekdays_kind = ifelse(weekdays_kind == 'on',1,0),
                               weekdays = ifelse(weekdays == 'Sunday',0,
                                                 ifelse(weekdays == 'Monday',1,
                                                        ifelse(weekdays == 'Tuesday',2,
                                                               ifelse(weekdays == 'Wednesday',3,
                                                                      ifelse(weekdays == 'Thursday',4,
                                                                             ifelse(weekdays == 'Friday',5,6)))))),
                               
                                   d_index = ifelse(d_index == 'comfy',0,1),
                               
                                   covid = ifelse(covid == '후',1,0)) 
table(all_dummy$weekdays)

# Re - split
set.seed(6)
train_idx = sample.split(all_dummy$a_aud, SplitRatio = 0.8)
train = all_dummy[train_idx,]
test = all_dummy[train_idx==F,]

# check
dim(all_dummy) #1065
nrow(train) #852
nrow(test) #213

# x,y split
y_train = train[,c('a_aud')]
x_train = train %>% select(-c(a_aud))
y_test = test[,c('a_aud')]
x_test = test %>% select(-c(a_aud))
str(all_dummy)
str(x_train)

######## Ridge ########
# ridge 1 : 모든 변수를 이용해서 ridge cv
fit1 <- glmnet(as.matrix(x_train), as.matrix(y_train), family= 'gaussian', alpha = 0)
cv1 <- cv.glmnet(as.matrix(x_train), as.matrix(y_train), family = 'gaussian', alpha = 0)
cv1$lambda.min # 0.6503258
coef(cv1, s='lambda.min')
par(mfrow = c(2,2))
plot(fit1, xvar='lambda')
plot(cv1)
log(0.067) # -2.7 
           # log(lambda)가 -2.7일때 최저의 mse

# 최적의 람다로 재피팅 후 predict 구하기
ridge1.fit <- glmnet(as.matrix(x_train), as.matrix(y_train), family = 'gaussian',alpha=0, lambda = cv1$lambda.min)
ridge1_coef = predict(ridge.fit, s=cv1$lambda.min, type = 'coefficients',newdata=test)
ridge1_coef = round(ridge1_coef,4)
ridge1_coef

# newdata를 쓰나 안쓰나 coefficient 값은 동일
ridge1.yhat <- predict(ridge1.fit, s=cv1$lambda.min, newx = as.matrix(test.x))
te.mse(yhat, test_y) # OLS :0.3470985
te.mse(ridge1.yhat, test.y) # ridge : 0.2934386

## ridge 2 : 8개 변수만을 이용해서 ridge cv
summary(d2)
# weekday_kindon / covid후 / a_released / a_screen / careCnt / mean_high_wind / mean_av_wind / count
# -> weekdays_kind / covid / a_released / a_screen / careCnt / mean_high_wind / mean_av_wind / count
train.xx = train.x %>% select(weekdays_kind, covid, a_released, a_screen, careCnt, mean_high_wind, mean_av_wind, count)
test.xx = test.x %>% select(weekdays_kind, covid, a_released, a_screen, careCnt, mean_high_wind, mean_av_wind, count)
test_new = test %>% select(a_aud, weekdays_kind, covid, a_released, a_screen, careCnt, mean_high_wind, mean_av_wind, count)

# train.y는 동일
fit2 <- glmnet(as.matrix(train.xx), as.matrix(train.y), family= 'gaussian', alpha = 0)
cv2 <- cv.glmnet(as.matrix(train.xx), as.matrix(train.y), family = 'gaussian', alpha = 0)
cv2$lambda.min # 0.06703521 -> 0.0670562
ridge2_coef = round(coef(cv2, s='lambda.min'),4)
ridge2_coef # 차이가 있음
par(mfrow = c(1,1))
plot(fit2, xvar='lambda')
plot(cv2)
log(0.067) # -2.7 
# log(lambda)가 -2.7일때 최저의 mse

# 8개 변수만으로 fitting
# 최적의 람다로 재피팅 후 predict 구하기
ridge2.fit <- glmnet(as.matrix(train.xx), as.matrix(train.y), family = 'gaussian',alpha=0, lambda = cv2$lambda.min)
ridge2_coef = predict(ridge2.fit, s=cv2$lambda.min, type = 'coefficients',newdata=test_new)
ridge2_coef = round(ridge2_coef,4)
ridge2_coef
ridge1_coef # 비교
# newdata를 쓰나 안쓰나 coefficient 값은 동일
ridge2.yhat <- predict(ridge2.fit, s=cv2$lambda.min, newx = as.matrix(test.xx))
te.mse(yhat, test_y) # OLS :0.3470985
te.mse(ridge1.yhat, test.y) # ridge : 0.2934386
te.mse(ridge2.yhat, test.y) # ridge : 0.2866408

###### Lasso #######
# lasso 1 : 모든 변수를 이용해서 lasso cv
fit3 <- glmnet(as.matrix(train.x), as.matrix(train.y), family= 'gaussian', alpha = 1)
cv3 <- cv.glmnet(as.matrix(train.x), as.matrix(train.y), family = 'gaussian', alpha = 1)
cv3$lambda.min # 0.001444682
coef(cv3, s='lambda.min')
par(mfrow = c(1,1))
plot(fit3, xvar='lambda')
plot(cv3)
log(0.001444682) # -6.539866

# 최적의 람다로 재피팅 후 predict 구하기
lasso1.fit <- glmnet(as.matrix(train.x), as.matrix(train.y), family = 'gaussian',alpha=1, lambda = cv3$lambda.min)
lasso1_coef = predict(lasso1.fit, s=cv3$lambda.min, type = 'coefficients',newdata=test)
lasso1_coef = round(lasso1_coef,4)
lasso1_coef  # coef(cv3, s='lambda.min')와 같음
lasso1.yhat <- predict(lasso1.fit, s=cv3$lambda.min, newx = as.matrix(test.x))
te.mse(lasso1.yhat, test.y) # lasso : 0.3013272

## lasso 2 : 8개 변수를 이용해서 lasso_cv
train.xx = train.x %>% select(weekdays_kind, covid, a_released, a_screen, careCnt, mean_high_wind, mean_av_wind, count)
test.xx = test.x %>% select(weekdays_kind, covid, a_released, a_screen, careCnt, mean_high_wind, mean_av_wind, count)

# train.y는 동일
fit4 <- glmnet(as.matrix(train.xx), as.matrix(train.y), family= 'gaussian', alpha = 1)
cv4 <- cv.glmnet(as.matrix(train.xx), as.matrix(train.y), family = 'gaussian', alpha = 1)
cv4$lambda.min # 0.001444682 -> 0.0008267002 (더 낮아짐)
lasso2_coef = round(coef(cv4, s='lambda.min'),4)
lasso2_coef # 차이가 있음
par(mfrow = c(2,2))
plot(fit4, xvar='lambda')
plot(cv4)
log(0.0008267002) # -7.098068

# 8개 변수만으로 fitting
# 최적의 람다로 재피팅 후 predict 구하기
lasso2.fit <- glmnet(as.matrix(train.xx), as.matrix(train.y), family = 'gaussian',alpha=1, lambda = cv4$lambda.min)
lasso2_coef = predict(lasso2.fit, s=cv4$lambda.min, type = 'coefficients',newdata=test_new)
lasso2_coef = round(lasso2_coef,4)
lasso2_coef

# newdata를 쓰나 안쓰나 coefficient 값은 동일
lasso2.yhat <- predict(lasso2.fit, s=cv4$lambda.min, newx = as.matrix(test.xx))
te.mse(lasso2.yhat, test.y) # lasso : 0.292503

# 랜덤포레스트와 부스팅은 나중에
# 
#####################
# all_box = 더미 변수 만들기 전
# all_new2 = 더미변수 만든 후
# random forest 적용하려고 만든 코드
all_dummy = all_1_noout %>% mutate(weekdays_kind = ifelse(weekdays_kind == 'on',1,0),
                                     d_index = ifelse(d_index == 'comfy',0,1),
                                     covid = ifelse(covid == '후',1,0)) 
str(all_dummy)

## 더미변수 만든 후 모델 돌려보기
# 더미변수 만들기 전과 비교해보았을 때 summary 결과가 같음
d_dummy = lm(a_aud~ weekdays_kind+covid+a_released+a_screen+careCnt+mean_high_wind+mean_av_wind+count, data=all_dummy)
summary(d_dummy)
par(mfrow = c(3,2))
plot(d_new, which=1:6)
head(all_1_noout)
str(all_1_noout)
set.seed(10)
idx <- createDataPartition(all_dummy$weekdays_kind, p=0.8, list=F)
length(idx)
train <- all_1_noout[idx,]
test <- all_1_noout[-idx,]
nrow(all_dummy) # 1051
nrow(train) # 842
nrow(test) # 209
test_y = test$a_aud
str(all_dummy)
# 에러 : variables ‘weekdays_kind’, ‘covid’ were specified with different types from the fit
# predict(d_dummy, newdata = test, interval = 'confidence')
# yhat = predict(d_dummy, newdata = test)
# yhat
# 
# yhat - test_y
# 
# mean(abs(yhat- test_y)) # MAE : 0.4575668
# 
# mean((yhat- test_y)^2) #MSE : 0.3470985
# 
# sqrt(mean((yhat- test_y)^2)) # rMSE : 0.5891507
# 
# summary(d_new) # 67퍼의 설명력

# tidymodels 이용하기 
### 랜덤포레스트 ###
# RF
library(tidyverse)
library(tidymodels)
# install.packages('tidymodels')
str(all)
split <- all_box %>% select(-c(Date,covid,weekdays,weekdays_kind,d_index))%>% initial_split(prop=0.8)
split
split %>% training()
split %>% testing()
str(split)

# 모든 변수 다 사용해보기 # x 변수 : 16개
recipe <- split %>% training() %>% 
  recipe(a_aud ~. ) %>% 
  step_corr(all_predictors()) %>% 
  step_center(all_predictors(), -all_outcomes()) %>% 
  step_scale(all_predictors(), -all_outcomes()) %>% 
  prep()
recipe # 다중공산성 문제는 없음
testing <- recipe %>%
  bake(split %>% testing()) 

# juice
training <- recipe %>% 
  juice()
training
# modeling
rf <- rand_forest(trees=100, mode='regression') %>% 
  set_engine('randomForest') %>% 
  fit(a_aud~., data= training)
rf
rf %>% 
  predict(testing)
rf %>% 
  predict(testing) %>% 
  bind_cols(testing)

# 성능 검증
rf %>% 
  predict(testing) %>% 
  bind_cols(testing) %>% 
  metrics(truth = a_aud, estimate= .pred)

# 랜덤포레스트 - adjR^2: 0.812
pred = rf %>% predict(testing)
pred = pred$.pred
real_y = testing$a_aud
error = (pred - real_y)/real_y
mean(error)

### 0601 shiny 활용
library(shiny)

getwd()
setwd("C:/Users/eric3/Documents/AI 프로젝트/0524/final_cor/shiny")
getwd()
runApp()

[KFQ] KFQ 1차 프로젝트 / 국내 영화관람객수 예측

1. 프로젝트 개요

2. 프로젝트 요약 및 환경

3. 진행과정 및 산출물

4. 모델 검증 및 분석

5. 결론

느낀점

관련된 글

[R] 데이터 분석 환경 만들기

[ML] 파이썬 머신러닝 환경 구성

[ML] Jupyter Notebook 초기 폴더 설정