--- title: "Quick Start" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quick Start} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # The qeML Package: "Quick and Easy" Machine Learning ## "Easy for learners, powerful for advanced users"


# What this package is about * "Quick and Easy" ML * **MUCH SIMPLER USER INTERFACE** than caret, mlr3, tidymodels, etc. * easy for learners, powerful/convenient for experts * Ideal for teaching! * numerous built-in real datasets. * includes **tutorials** on major ML methods * Special features for those experienced in ML * advanced functions for feature selection and model development * advanced ML algorithms, including some novel/unusual ones * advanced plotting utilities ## Easy model fit--first examples The letters 'qe' in the package title stand for "quick and easy," alluding to the convenience goal of the package. We bring together a variety of machine learning (ML) tools from standard R packages, providing wrappers with a uniform, **extremely simple** interface. Hence the term "quick and easy." For instance, consider the **mlb1** data included in the package, consisting of data on professional baseball players. As usual in R, we load the data: ``` r > data(mlb1) ``` Here is what the data looks like: ``` r > head(mlb1) Position Height Weight Age 1 Catcher 74 180 22.99 2 Catcher 74 215 34.69 3 Catcher 72 210 30.78 4 First_Baseman 72 210 35.43 5 First_Baseman 73 188 35.71 6 Second_Baseman 69 176 29.39 ``` The qe-series function calls are of the very simple form ``` r qe_function_name(dataset,variable_to_predict) ``` For instance, say we wish to predict player weights. For the random forests ML algorithm, we would make the simple call ``` r qeRF(mlb1,'Weight') ``` For gradient boosting, the call would be similar, ``` r qeGBoost(mlb1,'Weight') ``` and so on. **IT COULDN'T BE EASIER**! No setup, predefinitions etc.; just make a simple call. Default values are used on the above calls, but nondefaults can be specified, e.g. ``` r qeRF(mlb1,'Weight',nTree=200) ``` ## Prediction Each qe-series function is paired with a **predict** method, e.g. to predict player weight: ::: {.fullwidth} ``` r > z <- qeGBoost(mlb1,'Weight') > x <- data.frame(Position='Catcher',Height=73,Age=28) > predict(z,x) [1] 204.2406 ``` ::: A catcher of height 73 and age 28 would be predicted to have weight about 204. Categorical variables can be predicted too. Where possible, class probabilities are computed in addition to class. Let's predict player position from the physical characteristics: ::: {.fullwidth} ``` r > w <- qeGBoost(mlb1,'Position') > predict(w,data.frame(Height=73,Weight=185,Age=28)) $predClasses [1] "Relief_Pitcher" $probs Catcher First_Baseman Outfielder Relief_Pitcher Second_Baseman [1,] 0.02396515 0.03167778 0.2369061 0.2830575 0.1421796 Shortstop Starting_Pitcher Third_Baseman [1,] 0.0592867 0.1824601 0.04046717 ``` ::: A player of height 73, weight 185 and age 28 would be predicted to be a relief pitcher, with probability 0.28. The second most-likely position would be outfielder, and so on. ## Holdout sets By default, the qe functions reserve a *holdout set* on which to assess model accuracy. The remaining data form the *training set*. After a model is fit to the training set, we use it to predict the holdout data, so as to assess the predictive power of our model. (To specify no holdout, set holdout=NULL in the call.) ``` r > z <- qeRF(mlb1,'Weight') holdout set has 101 rows > z$testAcc [1] 14.45285 > z$baseAcc [1] 17.22356 ``` The mean absolute prediction error (MAPE) on the holdout data was about 14.5 pounds. On the other hand, if we had simply predicted every player using the overall mean weight, the MAPE would be about 17.2. So, using height, age and player position for our prediction did improve things. Of course, since the holdout set is random, the same is true for the above accuracy numbers. To gauge the predictive power of a model over many holdout sets, one can use **replicMeans()**, which is available in qeML via automatic loading of the **regtools** package. Say for 100 holdout sets: ``` r > replicMeans(100,"qeRF(mlb1,'Weight')$testAcc") [1] 13.6354 attr(,"stderr") [1] 0.1147791 ``` So the true MAPE for this model on new data is estimated to be 13.6. The standard error is also output, to gauge whether 100 replicates is enough. # Tutorials The package includes tutorials for those with no background in machine learning, as well as tutorials on advanced topics. A few examples (showing how they are invoked): * **vignette('ML_Overview')**; for those with no prior ML background * **vignette('Overfitting')**; plugging "overfitting" into Google yielded 49,400,000 results!--but what is overfitting REALLY about?; read here! * **vignette('Feature_Selection')**; we often need to pare down our set of predictor variables, both to save computation and prevent overfitting; how can this be done, especially in **qeML**? * **vignette('PCA_and_UMAP')**; this vignette first takes a closer, more practical look at Principal Components Analysis, then gives an overview of UMAP, a relatively new nonlinear alternative to PCA # Full function list, by category Type **vignette('Function_list')**. # Package author: Norm Matloff, UC Davis Professor of computer science, and former professor of statistics; 2017 Ziegal Award for book, *Statistical Regression and Classification: from Linear Models to Machine Learning*; Distinguished Teaching Award and Outstanding Public Service Award, UC Davis; [bio here](https://heather.cs.ucdavis.edu/matloff.html).