--- title: "Introduction to autoTS" author: "Vivien Roussez" package: "autoTS" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to autoTS} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r} knitr::opts_chunk$set(warning = F,message = F,fig.width = 8,fig.height = 5) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(lubridate)) library(autoTS) ``` # Introduction ## What does this package do ? The `autoTS` package provides a high level interface for **univariate time series** predictions. It implements many algorithms, most of them provided by the `forecast` package. The main goals of the package are : - Simplify the preparation of the time series ; - Train the algorithms and compare their results, to chose the best one ; - Gather the results in a final **tidy dataframe** ## What are the inputs ? The package is designed to work on one time series at a time. Parallel calculations can be put on top of it (see example below). The user has to provide 2 simple vectors : - One with the dates (s.t. the `lubridate` package can parse them) - The second with the corresponding values ## Warnings This package implements each algorithm with a unique parametrization, meaning that the user cannot tweak the algorithms (eg modify SARIMA specfic parameters). # Exemple on real-world data For this example, we will use the GDP quarterly data of the european countries provided by eurostat. The database can be downloaded from [this page](https://ec.europa.eu/eurostat/web/national-accounts/data/database) and then chose "GDP and main components (output, expenditure and income) (namq_10_gdp)" and then adjust the time dimension to select all available data and download as a csv file with the correct formatting (1 234.56). The csv is in the "Data" folder of this notebook. ```{r} tmp_dir <- tempdir() %>% normalizePath() unzip(zipfile = "../inst/extdata/namq_10_gdp.zip",exdir = tmp_dir) dat <- read.csv(paste0(tmp_dir,"/namq_10_gdp_1_Data.csv")) file.remove(paste0(tmp_dir,"/namq_10_gdp_1_Data.csv"),paste0(tmp_dir,"/namq_10_gdp_Label.csv")) str(dat) head(dat) ``` ## Data preparation First, we have to clean the data (not too ugly though). First thing is to convert the TIME column into a well known date format that lubridate can handle. In this example, the `yq` function can parse the date without modification of the column. Then, we have to remove the blank in the values that separates thousands... Finally, we only keep data since 2000 and the unadjusted series in current prices. After that, we should get one time series per country ```{r} dat <- mutate(dat,dates=yq(as.character(TIME)), values = as.numeric(stringr::str_remove(Value," "))) %>% filter(year(dates)>=2000 & S_ADJ=="Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)" & UNIT == "Current prices, million euro") filter(dat,GEO %in% c("France","Austria")) %>% ggplot(aes(dates,values,color=GEO)) + geom_line() + theme_minimal() + labs(title="GDP of (completely) random countries") ``` Now we're good to go ! ## Prediction on a random country Let's see how to use the package on one time series : - Extract dates and values of the time series you want to work on - Create the object containing all you need afterwards - Train algo and determine which one is the best (over the last known year) - Implement the best algorithm on full data ```{r} ex1 <- filter(dat,GEO=="France") preparedTS <- prepare.ts(ex1$dates,ex1$values,"quarter") ## What is in this new object ? str(preparedTS) plot.ts(preparedTS$obj.ts) ggplot(preparedTS$obj.df,aes(dates,val)) + geom_line() + theme_minimal() ``` Get the best algorithm for this time series : ```{r} ## What is the best model for prediction ? best.algo <- getBestModel(ex1$dates,ex1$values,"quarter",graph = F) names(best.algo) print(paste("The best algorithm is",best.algo$best)) best.algo$graph.train ``` You find in the result of this function : - The name of the best model - The errors of each algorithm on the test set - The graphic of the train step - The prepared time series - The list of used algorithm (that you can customize) The result of this function can be used as direct input of the `my.prediction` function ```{r} ## Build the predictions final.pred <- my.predictions(bestmod = best.algo) tail(final.pred,24) ggplot(final.pred) + geom_line(aes(dates,actual.value),color="black") + geom_line(aes_string("dates",stringr::str_remove(best.algo$best,"my."),linetype="type"),color="red") + theme_minimal() ``` Not too bad, right ? # Scaling predictions Let's say we want to make a prediction for each country in the same time and be the fastest possible $\rightarrow$ let's combine the package's functions with parallel computing. We have to reshape the data to get one column per country and then iterate over the columns of the data frame. ## Prepare data ```{r} suppressPackageStartupMessages(library(tidyr)) dat.wide <- select(dat,GEO,dates,values) %>% group_by(dates) %>% spread(key = "GEO",value = "values") head(dat.wide) ``` ## Compute bulk predictions **Note :** The following code is not executed for this vignette but does work (you can try it at home) ```{r,eval=FALSE} library(doParallel) pipeline <- function(dates,values) { pred <- getBestModel(dates,values,"quarter",graph = F) %>% my.predictions() return(pred) } doMC::registerDoMC(parallel::detectCores()-1) # parallel backend (for UNIX) system.time({ res <- foreach(ii=2:ncol(dat.wide),.packages = c("dplyr","autoTS")) %dopar% pipeline(dat.wide$dates,pull(dat.wide,ii)) }) names(res) <- colnames(dat.wide)[-1] str(res) ``` ## There is no free lunch... There is no best algorithm in general $\Rightarrow$ depends on the data ! Likewise, this is not executed in this vignette, but works if you want to replicate it. ```{r,eval=FALSE} sapply(res,function(xx) colnames(select(xx,-dates,-type,-actual.value)) ) %>% table() sapply(res,function(xx) colnames(select(xx,-dates,-type,-actual.value)) ) ```