Title: | Quick and Easy Machine Learning Tools |
---|---|
Description: | The letters 'qe' in the package title stand for "quick and easy," alluding to the convenience goal of the package. We bring together a variety of machine learning (ML) tools from standard R packages, providing wrappers with a simple, convenient, and uniform interface. |
Authors: | Norm Matloff [aut, cre] |
Maintainer: | Norm Matloff <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2.1 |
Built: | 2024-11-14 23:28:35 UTC |
Source: | https://github.com/matloff/qeml |
Miscellaneous specialized plots.
plotPairedResids(data,qeOut) plotClassesUMAP(data,classVar) qeFreqParcoord(dataName,k=25,opts=NULL) qePlotCurves(curveData,xCol=1,yCol=2,grpCol=3, xlab=names(curveData)[xCol],ylab=names(curveData)[yCol], loess=TRUE,legendTitle,legendSpace=1.1,legendPos='topright') qePlotCurves(curveData,xCol=1,yCol=2,grpCol=3, xlab=names(curveData)[xCol],ylab=names(curveData)[yCol],loess=TRUE, legendTitle=names(curveData)[grpCol],legendSpace=1.1,legendPos="topright", wide=FALSE,wideTimeColName=NULL,wideTimeColPresent=NULL, wideTimeColBase=1:nrow(curveData),wideGrpColName=NULL, wideValueColName=NULL) qeMittalGraph(dataMitt,xlab="x",ylab="y",legendTitle="curve",loess=TRUE)
plotPairedResids(data,qeOut) plotClassesUMAP(data,classVar) qeFreqParcoord(dataName,k=25,opts=NULL) qePlotCurves(curveData,xCol=1,yCol=2,grpCol=3, xlab=names(curveData)[xCol],ylab=names(curveData)[yCol], loess=TRUE,legendTitle,legendSpace=1.1,legendPos='topright') qePlotCurves(curveData,xCol=1,yCol=2,grpCol=3, xlab=names(curveData)[xCol],ylab=names(curveData)[yCol],loess=TRUE, legendTitle=names(curveData)[grpCol],legendSpace=1.1,legendPos="topright", wide=FALSE,wideTimeColName=NULL,wideTimeColPresent=NULL, wideTimeColBase=1:nrow(curveData),wideGrpColName=NULL, wideValueColName=NULL) qeMittalGraph(dataMitt,xlab="x",ylab="y",legendTitle="curve",loess=TRUE)
data |
A data frame or equivalent.. |
dataMitt |
A data frame or equivalent. "X" and "Y" columns, followed by a group column, an R factor. |
qeOut |
An object returned from one of the qe-series predictive functions.. |
classVar |
Name of the column containing class information. |
dataName |
Quoted name of a data frame. |
k |
Number of nearest neighbors. |
opts |
Options to be passed to |
curveData |
Data to be plotted. |
xCol |
Column name or number containing "X". |
yCol |
Column name or number containing "Y". |
grpCol |
Column name or number containing group name, a character vector or factor. |
xlab |
X-axis label. |
ylab |
Y-axis label. |
loess |
If TRUE, do loess smoothing within each group. |
legendTitle |
Legend title. |
legendSpace |
Factor by which to expand vertical space, to accommodate a top-situated legend. |
legendPos |
Position of legend within plot. |
curveData |
A data frame, "X" values in column 1. |
xlab |
Label for X-axis. |
ylab |
Label for Y-axis. |
wide |
TRUE if |
wideTimeColName |
Name to be used for "X"-axis. |
wideTimeColPresent |
If TRUE, a time column already exists. |
wideTimeColBase |
"Time" values for each group. |
wideGrpColName |
Group name. |
wideGrpValueName |
"Y" value name. |
The plotPairedResids
function plots model residuals against pairs
of features, for example for model validation. Pairs are chosen
randomly.
The function qeFreqParcoord
is a qeML
interface to the
cdparcoord
package.
The function qePlotCurves
plots X-Y curves for one or more
groups. Within each group, the (X,Y) pairs are plotted, possibly with
loess
smoothing. Input data format long by default, but
can be wide.
The function qeMittalGraph
is similar to qePlotCurves
,
except that it displays multiplicative change over the X-axis. All
curves start at height 1.0. (There may be some exceptions to this if
loess
is TRUE.) The X-axis could be time or some model parameter,
say in graphing prediction accuracy against number of nearest neighbors
for different datasets.
Norm Matloff
## Not run: data(pef) linout <- qeLin(pef,'wageinc') plotPairedResids(pef,linout) data(lsa) # plot LSAT score against undergradute GPA, for each law school cluster # (reputed quality of the law school) qePlotCurves(lsa,6,5,9,legendSpace=1.35) data(currency) curr <- cbind(1:nrow(currency),currency) names(curr)[1] <- 'weeknum' qePlotCurves(currency,wide=TRUE,wideTimeColName='weeknum', wideTimeColPresent=FALSE,wideGrpColName='country',wideValueColName='rate') qeMittalGraph(curr,'weeknum','rate','country') # Canadian dollar and pound in one cluster, and franc, mark and # yen in another ## End(Not run)
## Not run: data(pef) linout <- qeLin(pef,'wageinc') plotPairedResids(pef,linout) data(lsa) # plot LSAT score against undergradute GPA, for each law school cluster # (reputed quality of the law school) qePlotCurves(lsa,6,5,9,legendSpace=1.35) data(currency) curr <- cbind(1:nrow(currency),currency) names(curr)[1] <- 'weeknum' qePlotCurves(currency,wide=TRUE,wideTimeColName='weeknum', wideTimeColPresent=FALSE,wideGrpColName='country',wideValueColName='rate') qeMittalGraph(curr,'weeknum','rate','country') # Canadian dollar and pound in one cluster, and franc, mark and # yen in another ## End(Not run)
Data on incidence of breast cancer among women in Sweden. Goal of the study was to investigate whether the incidence increases with the onset of menopause.
Included here with the permission of Prof. Yudi Pawitan, Karolinska Institutet, Stockholm.
The data are in the form of an R list. Each element of the list corresponds to one offering of the course. Fields are: Class level; major (two different computer science majors, LCSI in Letters and Science and ECSE in engineering); quiz grade average (scale of 4.0, A+ counting as 4.3); homework grade average (same scale); and course letter grade.
From Wai Mun Fong and Sam Ouliaris, "Spectral Tests of the Martingale Hypothesis for Exchange Rates", Journal of Applied Econometrics, Vol. 10, No. 3, 1995, pp. 255-271. Weekly exchange rates against US dollar, over the period 7 August 1974 to 29 March 1989.
This is the Bike Sharing dataset (day records only) from the UC Irvine Machine Learning Dataset Repository. Included here with permission of Dr. Hadi Fanaee.
The day
data is as on UCI; day1
is modified so that the
numeric weather variables are on their original scale.
The day2
is the same as day1
, except that dteday
has been removed, and season
, mnth
, weekday
and
weathersit
have been converted to R factors.
See https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset for details.
Belkin and others have shown that some machine learning algorithms exhibit surprising behavior when in overfitting settings. The classic U-shape of mean loss plotted against model complexity may be followed by a surprise second "mini-U."
Alternatively, one might keep the model complexity fixed while varying the number of data points n, including over a region in which n is smaller than the complexity value of the model. The surprise here is that mean loss may actually increase with n in the overfitting region.
The function doubleD
facilitates easy exploration of this
phenomenon.
doubleD(qeFtnCall,xPts,nReps,makeDummies=NULL,classif=FALSE)
doubleD(qeFtnCall,xPts,nReps,makeDummies=NULL,classif=FALSE)
qeFtnCall |
Quoted string; somewhere should include 'xPts[i]'. |
xPts |
Range of values to be used in the experiments, e.g. a vector of degrees for polynomial models. |
nReps |
Number of repetitions for each experiment, typically the number in the holdout set. |
makeDummies |
If non-NULL, call |
classif |
Set TRUE if this is a classification problem. |
The function will run the code in qeFtnCall
nreps
times for each level specified in xPts
, recording the test and
training error in each case. So, for each level, we will have a mean
test and training error.
Each call in xPts
results in one line in the return value
of doubleD
. The return matrix can then be plotted, using the
generic plot.doubleD
. Mean test (red) and training (blue) accuracy
will be plotted against xPts
.
Norm Matloff
## Not run: data(mlb1) hw <- mlb1[,2:3] doubleD('qePolyLin(hw,"Weight",deg=xPts[i])',1:20,250) ## End(Not run)
## Not run: data(mlb1) hw <- mlb1[,2:3] doubleD('qePolyLin(hw,"Weight",deg=xPts[i])',1:20,250) ## End(Not run)
IBM data from Kaggle, https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset.
data(empAttrition)
data(empAttrition)
The Stanford WordBank data on vocabulary acquisition in young children. The file consists of about 5500 rows. (There are many NA values, though, and only about 2800 complete cases.) Variables are age, birth order, sex, mother's education and vocabulary size.
US economic growth measures.
Courtesy of the Economic Policy Institute.
data(EPIWgProduct)
data(EPIWgProduct)
Utilties to help build models, both in specific applications such as time series and text analysis, and in general tools..
qeCompare(data,yName,qeFtnList,nReps,opts=NULL,seed=9999) qeFT(data,yName,qeftn,pars,nCombs,nTst,nXval,showProgress=TRUE) qeText(data,yName,kTop=50,stopWords=tm::stopwords("english"), qeName,opts=NULL,holdout=floor(min(1000,0.1*nrow(data)))) qeTS(lag,data,qeName,opts=NULL,holdout=floor(min(1000,0.1*length(data)))) ## S3 method for class 'qeText' predict(object,newDocs,...) ## S3 method for class 'qeTS' predict(object,newx,...)
qeCompare(data,yName,qeFtnList,nReps,opts=NULL,seed=9999) qeFT(data,yName,qeftn,pars,nCombs,nTst,nXval,showProgress=TRUE) qeText(data,yName,kTop=50,stopWords=tm::stopwords("english"), qeName,opts=NULL,holdout=floor(min(1000,0.1*nrow(data)))) qeTS(lag,data,qeName,opts=NULL,holdout=floor(min(1000,0.1*length(data)))) ## S3 method for class 'qeText' predict(object,newDocs,...) ## S3 method for class 'qeTS' predict(object,newx,...)
... |
Further arguments. |
object |
Object returned by a qe-series function. |
newx |
New data to be predicted. |
newDocs |
Vector of new documents to be predicted. |
lag |
number of recent values to use in predicting the next. |
qeName |
Name of qe-series predictive function, e.g. 'qeRF'. |
stopWords |
Stop lists to use. |
nTst |
Number of parameter combinations. |
kTop |
Number of most-frequent words to use. |
data |
Dataframe, training set. Classification case is signaled via labels column being an R factor. |
yName |
Name of the class labels column. |
holdout |
If not NULL, form a holdout set of the specified size. After fitting to the remaining data, evaluate accuracy on the test set. |
qeFtnList |
Character vector of |
nReps |
Number of holdout sets to generate. |
opts |
R list of optional arguments for none, some or all of th
functions in |
seed |
Seed for random number generation. |
qeftn |
Quoted string, specifying the name of a qe-series machine learning method. |
pars |
R list of hyperparameter ranges. See
|
nCombs |
Number of hyperparameter combinations to run.
See |
nXval |
Number of cross-validations to run.
See |
showProgress |
If TRUE, show results as they arise.
See |
Overviews of the functions:
qeTs
is a tool for time series modeling
qeText
is a tool for textual modeling
qeCompare
facilitates comparison among models
qeFT
does a random grid search for optimal hyperparameter
values
Norm Matloff
data(mlb1) # predict Weight in the mlb1 dataset, using qeKNN, with k = 5 and 25, # with 10 cross-validations qeFT(mlb1,'Weight','qeKNN',list(k=c(5,25)),nTst=100,nXval=10)
data(mlb1) # predict Weight in the mlb1 dataset, using qeKNN, with k = 5 and 25, # with 10 cross-validations qeFT(mlb1,'Weight','qeKNN',list(k=c(5,25)),nTst=100,nXval=10)
Random subset of 500 records.
https://archive.ics.uci.edu/ml/datasets/covertype
From https://github.com/sharmaroshan/Churn-Modelling-Dataset.
Character variables and bernoulli variables have been converted to factors. The first three cols, e.g. customer ID, have been deleted.
The tenure col is apparently length of time with the firm.
Law School Admissions dataset from the Law School Admissions Council (LSAC). The dataset was originally collected for a study called 'LSAC National Longitudinal Bar Passage Study' by Linda Wightman in 1998.
Most of the names are self-explanatory, but note that: The two decile scores are class standing in the first and third years of law school, and 'cluster' refers to the reputed quality of the law school. Two variables of particular interest might be the student's score on the Law School Admission Test (LSAT) and a logical variable indicating whether the person passed the bar examination.
Note that the 'age' variable is apparently birth year, e.g. 69 meaning 1969.
This is data consists of capital letter frequencies obtained at https://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.h tml
Heights, weights, ages etc. of major league baseball players. A new variable has been added, consolidating positions into Infielders, Outfielders, Catchers and Pitchers.
The mlb1
version has only Position, Height, Weight and Age.
Included here with the permission of the UCLA Statistics Department.
The MovieLens dataset, https://grouplens.org/, is a standard example in the recommender systems literature. Here we give demographic data for each user, plus the mean rating and number of ratings. One may explore, for instance, the relation between ratings and age.
This data set is adapted from the Adult data from the UCI Machine Learning Repository, which was in turn adapted from Census data on adult incomes and other demographic variables. The UCI data is used here with permission from Ronny Kohavi.
The variables are:
gt50
, which converts the original >50K
variable
to an indicator variable; 1 for income greater than $50,000, else 0
edu
, which converts a set of education levels to
approximate number of years of schooling
age
gender
, 1 for male, 0 for female
mar
, 1 for married, 0 for single
Note that the education variable is now numeric.
10,000 records on five variables, extracted from https://data.cityofnewyork.us/Transportation/2018-Yellow-Taxi-Trip-Data/t29m-gskq.
data(nyctaxi)
data(nyctaxi)
Italian olive oils data set, as used in Graphics of Large Datasets: Visualizing a Million, by Antony Unwin, Martin Theus and Heike Hofmann, Springer, 2006. Included here with permission of Dr. Martin Theus.
ML methods for prediction in which features are subject to missing values.
qeLinMV(data,yName) qeLogitMV(data,yName,yesYVal) qeKNNMV(data,yName,kmax) ## S3 method for class 'qeLinMV' predict(object,newx,...) ## S3 method for class 'qeLogitMV' predict(object,newx,...) ## S3 method for class 'qeKNNMV' predict(object,newx,...)
qeLinMV(data,yName) qeLogitMV(data,yName,yesYVal) qeKNNMV(data,yName,kmax) ## S3 method for class 'qeLinMV' predict(object,newx,...) ## S3 method for class 'qeLogitMV' predict(object,newx,...) ## S3 method for class 'qeKNNMV' predict(object,newx,...)
... |
Further arguments. |
object |
An object returned by one of the |
data |
Dataframe, training set. Classification case is signaled via labels column being an R factor. |
yName |
Name of the class labels column. |
newx |
New data to be predicted. |
kmax |
Number of nearest neighbors in training set. |
yesYVal |
Y value to be considered "yes," to be coded 1 rather than 0. |
These are wrappers to the toweranNA package. Linear, logistic and kNN interfaces are available.
Norm Matloff
sum(is.na(airquality)) # 44 NAs, good test example z <- qeKNNMV(airquality,'Ozone',10) # example of new case, insert an NA in 1st row aq2 <- airquality[2,-1] aq2$Wind <- NA predict(z,aq2) # 28.1
sum(is.na(airquality)) # 44 NAs, good test example z <- qeKNNMV(airquality,'Ozone',10) # example of new case, insert an NA in 1st row aq2 <- airquality[2,-1] aq2$Wind <- NA predict(z,aq2) # 28.1
This data set is adapted from the 2000 Census (5% sample, person records). It is mainly restricted to programmers and engineers in the Silicon Valley area. (Apparently due to errors, there are some from other ZIP codes.)
There are three versions:
prgeng
, the original data, with categorical variables,
e.g. Occupation, in their original codes
pef
, same as peFactors
, but having only columns
for age, education, occupation, gender, wage income and weeks
worked. The education column has been collapsed to Master's degree,
PhD and other, coded 'z14', 'z16' and 'zzzOther'. Most cases are in
the latter category.
svcensus
, same as pef
, but with the column
name 'sex' replaced by 'gender'.
The variable codes, e.g. occupational codes, are available from https://usa.ipums.org/usa/volii/occ2000.shtml. (Short code lists are given in the record layout, but longer ones are in the appendix Code Lists.)
The variables are:
age
, with a U(0,1) variate added for jitter
cit
, citizenship; 1-4 code various categories of
citizens; 5 means noncitizen (including permanent residents)
educ
: 01-09 code no college; 10-12 means some college;
13 is a bachelor's degree, 14 a master's, 15 a professional degree and
16 is a doctorate
occ
, occupation
birth
, place of birth
wageinc
, wage income
wkswrkd
, number of weeks worked
yrentry
, year of entry to the U.S. (0 for natives)
powpuma
, location of work
gender
, 1 for male, 2 for female
Quick access to machine learning methods, with a very simple interface. "Works right out of the box!": Just one call needed to fit, no preliminary setup of model etc. The simplicity also makes the series useful for teaching.
qeLogit(data,yName,holdout=floor(min(1000,0.1*nrow(data))),yesYVal=NULL) qeLin(data,yName,noBeta0=FALSE,holdout=floor(min(1000,0.1*nrow(data)))) qeKNN(data,yName,k=25,scaleX=TRUE,smoothingFtn=mean,yesYVal=NULL, expandVars=NULL,expandVals =NULL,holdout=floor(min(1000,0.1*nrow(data)))) qeRF(data,yName,nTree=500,minNodeSize=10,mtry=floor(sqrt(ncol(data)))+1, holdout=floor(min(1000,0.1*nrow(data)))) qeRFranger(data,yName,nTree=500,minNodeSize=10, mtry=floor(sqrt(ncol(data)))+1,deweightPars=NULL, holdout=floor(min(1000,0.1*nrow(data))),yesYVal="") qeRFgrf(data,yName,nTree=2000,minNodeSize=5,mtry=floor(sqrt(ncol(data)))+1, ll=FALSE,lambda=0.1,splitCutoff=sqrt(nrow(data)),quantls=NULL, holdout=floor(min(1000,0.1*nrow(data)))) qeSVM(data,yName,gamma=1.0,cost=1.0,kernel='radial',degree=2, allDefaults=FALSE,holdout=floor(min(1000,0.1*nrow(data)))) qeGBoost(data,yName,nTree=100,minNodeSize=10,learnRate=0.1, holdout=floor(min(1000,0.1*nrow(data)))) qeAdaBoost(data, yName, treeDepth = 3, nRounds = 100, rpartControl = NULL, holdout = floor(min(1000, 0.1 * nrow(data)))) qeLightGBoost(data,yName,nTree=100,minNodeSize=10,learnRate=0.1, holdout=floor(min(1000,0.1*nrow(data)))) qeNeural(data,yName,hidden=c(100,100),nEpoch=30, acts=rep("relu",length(hidden)),learnRate=0.001, conv=NULL,xShape=NULL, holdout=floor(min(1000,0.1*nrow(data)))) qeLASSO(data,yName,alpha=1,holdout=floor(min(1000,0.1*nrow(data)))) qePolyLin(data,yName,deg=2,maxInteractDeg = deg, holdout=floor(min(1000,0.1*nrow(data)))) qePolyLog(data,yName,deg=2,maxInteractDeg = deg, holdout=floor(min(1000,0.1*nrow(data)))) qePCA(data,yName,qeName,opts=NULL,pcaProp, holdout=floor(min(1000,0.1*nrow(data)))) qeUMAP(data,yName,qeName,opts=NULL, holdout=floor(min(1000,0.1*nrow(data))),scaleX=FALSE, nComps=NULL,nNeighbors=NULL) qeDT(data,yName,alpha=0.05,minsplit=20,minbucket=7,maxdepth=0,mtry=0, holdout=floor(min(1000,0.1*nrow(data)))) qeFOCI(data,yName,numCores=1,parPlat="none", yesYLevel=NULL) qeFOCIrand(data,yName,xSetSize,nXSets) qeFOCImult(data,yName,numCores=1, parPlat="none",coalesce='union') qeLinKNN(data,yName,k=25,scaleX=TRUE,smoothingFtn=mean, expandVars=NULL,expandVals=NULL, holdout=floor(min(1000,0.1*nrow(data)))) qePolyLASSO(data,yName,deg=2,maxInteractDeg=deg,alpha=0, holdout=floor(min(1000,0.1*nrow(data)))) qeROC(dataIn,qeOut,yLevelName) qeXGBoost(data,yName,nRounds=250, params=list(eta=0.3,max_depth=6,alpha=0), holdout=floor(min(1000,0.1*nrow(data)))) qeDeepnet(data,yName,hidden=c(10),activationfun="sigm", learningrate=0.8,momentum=0.5,learningrate_scale=1, numepochs=3,batchsize=100,hidden_dropout=0,yesYVal=NULL, holdout=floor(min(1000,0.1*nrow(data)))) qeRpart(data,yName,minBucket=10,holdout=floor(min(1000, 0.1*nrow(data)))) qeParallel(data,yName,qeFtnName,dataName,opts=NULL,cls=1, libs=NULL,holdout=NULL) checkPkgLoaded(pkgName,whereObtain='CRAN') ## S3 method for class 'qeParallel' predict(object,newx,...) ## S3 method for class 'qeLogit' predict(object,newx,...) ## S3 method for class 'qeLin' predict(object,newx,useTrainRow1=TRUE,...) ## S3 method for class 'qeKNN' predict(object,newx,newxK=1,...) ## S3 method for class 'qeRF' predict(object,newx,...) ## S3 method for class 'qeRFranger' predict(object,newx,...) ## S3 method for class 'qeRFgrf' predict(object,newx,...) ## S3 method for class 'qeSVM' predict(object,newx,...) ## S3 method for class 'qeGBoost' predict(object,newx,newNTree=NULL,...) ## S3 method for class 'qeLightGBoost' predict(object,newx,...) ## S3 method for class 'qeNeural' predict(object,newx,k=NULL,...) ## S3 method for class 'qeLASSO' predict(object,newx,...) ## S3 method for class 'qePoly' predict(object,newx) ## S3 method for class 'qePCA' predict(object,newx,...) ## S3 method for class 'qeUMAP' predict(object,newx,...) ## S3 method for class 'qeDeepnet' predict(object,newx,...) ## S3 method for class 'qeRpart' predict(object,newx,...) ## S3 method for class 'qeLASSO' plot(x,...) ## S3 method for class 'qeRF' plot(x,...) ## S3 method for class 'qeRpart' plot(x,boxPalette=c("red","yellow","green","blue"),...)
qeLogit(data,yName,holdout=floor(min(1000,0.1*nrow(data))),yesYVal=NULL) qeLin(data,yName,noBeta0=FALSE,holdout=floor(min(1000,0.1*nrow(data)))) qeKNN(data,yName,k=25,scaleX=TRUE,smoothingFtn=mean,yesYVal=NULL, expandVars=NULL,expandVals =NULL,holdout=floor(min(1000,0.1*nrow(data)))) qeRF(data,yName,nTree=500,minNodeSize=10,mtry=floor(sqrt(ncol(data)))+1, holdout=floor(min(1000,0.1*nrow(data)))) qeRFranger(data,yName,nTree=500,minNodeSize=10, mtry=floor(sqrt(ncol(data)))+1,deweightPars=NULL, holdout=floor(min(1000,0.1*nrow(data))),yesYVal="") qeRFgrf(data,yName,nTree=2000,minNodeSize=5,mtry=floor(sqrt(ncol(data)))+1, ll=FALSE,lambda=0.1,splitCutoff=sqrt(nrow(data)),quantls=NULL, holdout=floor(min(1000,0.1*nrow(data)))) qeSVM(data,yName,gamma=1.0,cost=1.0,kernel='radial',degree=2, allDefaults=FALSE,holdout=floor(min(1000,0.1*nrow(data)))) qeGBoost(data,yName,nTree=100,minNodeSize=10,learnRate=0.1, holdout=floor(min(1000,0.1*nrow(data)))) qeAdaBoost(data, yName, treeDepth = 3, nRounds = 100, rpartControl = NULL, holdout = floor(min(1000, 0.1 * nrow(data)))) qeLightGBoost(data,yName,nTree=100,minNodeSize=10,learnRate=0.1, holdout=floor(min(1000,0.1*nrow(data)))) qeNeural(data,yName,hidden=c(100,100),nEpoch=30, acts=rep("relu",length(hidden)),learnRate=0.001, conv=NULL,xShape=NULL, holdout=floor(min(1000,0.1*nrow(data)))) qeLASSO(data,yName,alpha=1,holdout=floor(min(1000,0.1*nrow(data)))) qePolyLin(data,yName,deg=2,maxInteractDeg = deg, holdout=floor(min(1000,0.1*nrow(data)))) qePolyLog(data,yName,deg=2,maxInteractDeg = deg, holdout=floor(min(1000,0.1*nrow(data)))) qePCA(data,yName,qeName,opts=NULL,pcaProp, holdout=floor(min(1000,0.1*nrow(data)))) qeUMAP(data,yName,qeName,opts=NULL, holdout=floor(min(1000,0.1*nrow(data))),scaleX=FALSE, nComps=NULL,nNeighbors=NULL) qeDT(data,yName,alpha=0.05,minsplit=20,minbucket=7,maxdepth=0,mtry=0, holdout=floor(min(1000,0.1*nrow(data)))) qeFOCI(data,yName,numCores=1,parPlat="none", yesYLevel=NULL) qeFOCIrand(data,yName,xSetSize,nXSets) qeFOCImult(data,yName,numCores=1, parPlat="none",coalesce='union') qeLinKNN(data,yName,k=25,scaleX=TRUE,smoothingFtn=mean, expandVars=NULL,expandVals=NULL, holdout=floor(min(1000,0.1*nrow(data)))) qePolyLASSO(data,yName,deg=2,maxInteractDeg=deg,alpha=0, holdout=floor(min(1000,0.1*nrow(data)))) qeROC(dataIn,qeOut,yLevelName) qeXGBoost(data,yName,nRounds=250, params=list(eta=0.3,max_depth=6,alpha=0), holdout=floor(min(1000,0.1*nrow(data)))) qeDeepnet(data,yName,hidden=c(10),activationfun="sigm", learningrate=0.8,momentum=0.5,learningrate_scale=1, numepochs=3,batchsize=100,hidden_dropout=0,yesYVal=NULL, holdout=floor(min(1000,0.1*nrow(data)))) qeRpart(data,yName,minBucket=10,holdout=floor(min(1000, 0.1*nrow(data)))) qeParallel(data,yName,qeFtnName,dataName,opts=NULL,cls=1, libs=NULL,holdout=NULL) checkPkgLoaded(pkgName,whereObtain='CRAN') ## S3 method for class 'qeParallel' predict(object,newx,...) ## S3 method for class 'qeLogit' predict(object,newx,...) ## S3 method for class 'qeLin' predict(object,newx,useTrainRow1=TRUE,...) ## S3 method for class 'qeKNN' predict(object,newx,newxK=1,...) ## S3 method for class 'qeRF' predict(object,newx,...) ## S3 method for class 'qeRFranger' predict(object,newx,...) ## S3 method for class 'qeRFgrf' predict(object,newx,...) ## S3 method for class 'qeSVM' predict(object,newx,...) ## S3 method for class 'qeGBoost' predict(object,newx,newNTree=NULL,...) ## S3 method for class 'qeLightGBoost' predict(object,newx,...) ## S3 method for class 'qeNeural' predict(object,newx,k=NULL,...) ## S3 method for class 'qeLASSO' predict(object,newx,...) ## S3 method for class 'qePoly' predict(object,newx) ## S3 method for class 'qePCA' predict(object,newx,...) ## S3 method for class 'qeUMAP' predict(object,newx,...) ## S3 method for class 'qeDeepnet' predict(object,newx,...) ## S3 method for class 'qeRpart' predict(object,newx,...) ## S3 method for class 'qeLASSO' plot(x,...) ## S3 method for class 'qeRF' plot(x,...) ## S3 method for class 'qeRpart' plot(x,boxPalette=c("red","yellow","green","blue"),...)
... |
Further arguments. |
cls |
Cluster in the sense of parallel package. If not of
class |
libs |
Character vector listing libraries needed to be loaded for
|
Drop out fraction for hidden layer. |
|
batchsize |
Batch size. |
numepochs |
Number of iterations to conduct. |
learningrate |
Learning rate. |
momentum |
Momemtum |
learningrate_scale |
Learning rate will be multiplied by this at each iteration, allowing for decay. |
activationfun |
Can be 'sigm', 'tanh' or 'linear'. |
newNTree |
Number of trees to use in prediction. |
newxK |
If predicting new cases, number of nearest neighbors to
smooth in the object returned by |
useTrainRow1 |
If TRUE, take names in |
newx |
New data to be predicted. |
object |
An object returned by a qe-series function. |
minsplit |
Minimum number of data points in a node. |
minbucket |
Minimum number of data points in a terminal node. |
minBucket |
Minimum number of data points in a terminal node. |
maxdepth |
Maximum number of levels in a tree. |
qeName |
Name of qe-series predictive function. |
qeFtnName |
Name of qe-series predictive function. |
conv |
R list specifying the convolutional layers, if any. |
deweightPars |
Values for de-emphasizing variables in a tree node split, e.g. 'list(age=0.2,gender=0.5)'. |
allDefaults |
Use all default values of the wrapped function. |
expandVars |
Columns to be emphasized. |
expandVals |
Emphasis values; a value less than 1 means de-emphasis. |
mtry |
Number of variables randomly tried at each split. |
yesYVal |
Y value to be considered "yes," to be coded 1 rather than 0. |
yesYLevel |
Y value to be considered "yes," to be coded 1 rather than 0. |
noBeta0 |
No intercept term. |
pcaProp |
Desired proportion of overall variance for the PCs.' |
data |
Dataframe, training set. Classification case is signaled via labels column being an R factor. |
dataIn |
See |
qeOut |
Output from a qe-series function. |
yName |
Name of the class labels column. |
holdout |
If not NULL, form a holdout set of the specified size. After fitting to the remaining data, evaluate accuracy on the test set. |
k |
Number of nearest neighbors. In functions other than
|
smoothingFtn |
As in |
scaleX |
Scale the features. |
nTree |
Number of trees. |
minNodeSize |
Minimum number of data points in a tree node. |
learnRate |
Learning rate. |
Vector of units per hidden layer. Fractional values
indicated dropout proportions. Can be specified as a string, e.g.
'100,50', for use with |
|
nEpoch |
Number of iterations in neural net. |
acts |
Vector of names of the activation functions, one per hidden layer. Choices inclde 'relu', 'sigmoid', 'tanh', 'softmax', 'elu', 'selu'. |
alpha |
In the case of |
gamma |
Scale parameter in |
cost |
Cost parameter in |
kernel |
In the case of |
degree |
Degree of SVM polynomial kernel, if any. |
opts |
R list of optional arguments for none, some or all of th
functions in |
nComps |
Number of UMAP components to extract. |
nNeighbors |
Number of nearest neighbors to use in UMAP. |
ll |
If TRUE, use local linear forest. |
lambda |
Ridge lambda for local linear forest. |
splitCutoff |
For leaves smaller than this value, do not fit linear model. Just use the linear model fit to the entire dataset. |
xShape |
Input X data shape, e.g. c(28,28) for 28x28 grayscale
images. Must be non-NULL if |
treeDepth |
Number of levels in each tree. |
nRounds |
Number of boosting rounds. |
rpartControl |
An R list specifying properties of fitted trees. |
numCores |
Number of cores to use in parallel computation. |
parPlat |
Parallel platforParallel platform. Valid values are
'none', 'cluster' (output of |
xSetSize |
Size of subsets of the predictor variables. |
nXSets |
Number of subsets of the predictor variables. |
coalesce |
Method for combining variable sets. |
deg |
Degree of a polynomial. |
maxInteractDeg |
Maximul degree of interaction terms in a polynomial. |
yLevelName |
Name of the class to be considered a positive response in a classification problem. |
params |
Tuning parameters for |
boxPalette |
Color palette. |
pkgName |
Name of wrapped package. |
whereObtain |
Location. |
x |
A qe-series function return object. |
As noted, these functions are intended for quick, first-level analysis of regression/machine learning problems. Emphasis here is on convenience and simplicity.
The idea is that, given a new dataset, the analyst can quickly and easily try fitting a number of models in succession, say first k-NN, then random forests:
# built-in data on major league baseball players > data(mlb) > mlb <- mlb[,3:6] # position, height, weight, age # fit models > knnout <- qeKNN(mlb,'Weight',k=25) > rfout <- qeRF(mlb,'Weight') # mean abs. pred. error on holdout set, in pounds > knnout$testAcc [1] 11.75644 > rfout$testAcc [1] 12.6787 # predict a new case > newx <- data.frame(Position='Catcher',Height=73.5,Age=26) > predict(knnout,newx) [,1] [1,] 204.04 > predict(rfout,newx) 11 199.1714 # many of the functions include algorithm-specific output > lassout <- qeLASSO(mlb,'Weight') holdout set has 101 rows > lassout$testAcc [1] 14.27337 > lassout$coefs # sparse result? 10 x 1 sparse Matrix of class "dgCMatrix" s1 (Intercept) -109.2909416 Position.Catcher 0.4408752 Position.First_Baseman 4.8308437 Position.Outfielder . Position.Relief_Pitcher . Position.Second_Baseman -0.7846501 Position.Shortstop -4.2291338 Position.Starting_Pitcher . Height 4.0039114 Age 0.5352793
The holdout
argument triggers formation of a holdout set
and the corresponding cross-validation evaluation of predictive power.
Note that if a holdout is formed, the return value will consist of the
fit on the training set, not on the full original dataset.
The qe*
functions do model fit. Each of them has a
predict
method, and some also have a plot
method.
Arguments for qe*
are at least:
data
yName
holdout
Typically there are also algorithm-specific hyperparameter arguments.
Arguments for predict
are at least:
object
, the return value from qe*
newx
, a data frame of points to be predicted
For both the fitting function and the prediction function, there may be additional algorithm-specific parameters; default values are provided.
Some notes on specific functions:
The function qeLin
handles not only the usual OLS models
but also classification problems as multivariate-outcome linear
models. If one's goal is prediction, it can be much faster than
qeLogit
, often with comparable accuracy.
Regularization in linear/generalized linear models is
implemented in qeLASSO
and other functions with names
containing 'LASSO', as well as qeNCVregCV
. The latter,
wrappping the MCP and other regularization methods, wraps the package
of the same name.
Several functions fit polynomial models. The qePolyLin
function does polynomial regression of the indicated degree. In the
above example degree 3 means all terms through degree 3, e.g.
Height * Age^2
. Dummy variables are handled properly, e.g.
no powers of a dummy are generatd. The logistic polynomial
regression version is qePolyLog
, and there is a LASSO version,
qePolyLASSO
.
Several random forests implementations are offered:
qeRF
wraps randomForest
in the package of the same name;
qeRFranger
wraps ranger
in the package of the same name;
qeRFgrf
wraps regression_forest
and
ll_regression_forest
in grf (the latter does local
linear smoothing). There is also qeDT
, using
the party package.
Several implementations of gradient boosting are offered,
including qeGBoost
using the gbm package,
qelightGBoost
using lightgbm, and qeXGBoost
wrapping xgboost.
Several functions involve dimension reduction/feature
selection. Pre-mapping to lower-dimensional manifolds can be done via
qePCA
and qeUMAP
. For instance, the former will first
extract the specified number of principal components, then fit the
user's desired ML model, say k-NN (qeKNN
) or gradient boosting
(qeGBoost
).
The qeFOCI
function does feature selection
in a basically assumption-free manner. It handles numeric and binary
Y (the latter coded 1,0). For categorical Y, use qeFOCImult
.
The function qeFOCIrand
applies FOCI to many subsets of the
input dataset, eventually returning the union of the outputs; this is
useful if the dataset has many NA values.
Neural network models are implemented by qeNeural
and qeDeepnet
, based on keras and deepnet.
The qeLinKNN
function offers a hybrid approach. It
first fits a linear model, then applies k-Nearest Neighbors to the
residuals. The qePolyLinKNN
function does the same in with a
polynomial fit.
The qeIso
function is intended mainly for use as a
smoothing method in calibration actions.
In most cases, the full basket of options in the wrapped function is not reflected. Use of arguments not presented in the qe function requires direct use the relevant packages.
The value returned by qe*
functions depends on the algorithm, but
with some commonality, e.g. classif
, a logical value indicating
whether the problem was of classification type.
If a holdout set was requested, an additional returned component will be
testAcc
, the accuracy on the holdout set. This will be Mean
Absolute Prediction Error in the regression case, and proportion of
misclassified cases in the classification case.
The value returned by the predict
functions is an
R list with components as follows:
Classification case:
predClasses
: R factor instance of predicted class labels
probs
: vector/matrix of class probabilities; in the 2-class
case, a vector, the probabilities of Y = 1
Regression case: vector of predicted values
Norm Matloff
# see also 'details' above ## Not run: data(peFactors) pef <- peFactors[,c(1,3,5,7:9)] # most people in the dataset have at least a Bachelor's degree; so let's # just consider Master's (code 14) and PhD (code 16) as special pef$educ <- toSubFactor(pef$educ,c('14','16')) # predict occupation; 6 classes, 100, 101, 102, 106, 140, 141, using SVM svmout <- qeSVM(pef,'occ',holdout=NULL) # as example of prediction, take the 8th case, but change the gender and # age to female and 25; note that by setting k to non-null, we are # requesting that conditional probabilities be calculated, via # knnCalib(), here using 25 nearest neighbors newx <- pef[8,-3] newx$sex <- '2' newx$age <- 25 predict(svmout,newx,k=25) # $predClasses # 8 # 100 # Levels: 100 101 102 106 140 141 # $dvals # 102/101 102/100 102/141 102/140 102/106 101/100 101/141 # 8 -0.7774038 -0.5132022 0.9997894 1.003251 0.999688 -0.4023077 1.000419 # 101/140 101/106 100/141 100/140 100/106 141/140 141/106 140/106 # 8 1.000474 0.9997371 1.000088 1.000026 1.000126 0.9460703 -0.4974625 -1.035721 # # $probs # 100 101 102 106 140 141 # [1,] 0.24 0.52 0.12 0.08 0 0.04 # # so, occupation code 100 is predicted, with a 0.36 conditional # probability # if holdout evaluation is desired as well, say 1000 cases, seed 9999: > svmout <- qeSVM(pef,'occ',holdout=c(1000,9999)) > svmout$testAcc [1] 0.622 # 62 # linear # lm() doesn't like numeric factor levels, so prepend an 'a' pef$occ <- prepend('a',pef$occ) lmout <- qeLin(pef,'occ') predict(lmout,pef[1,-3]) # occ 100, prob 0.3316 lmout <- qeLin(pef,'wageinc') predict(lmout,pef[1,-5]) # 70857.79 ## End(Not run)
# see also 'details' above ## Not run: data(peFactors) pef <- peFactors[,c(1,3,5,7:9)] # most people in the dataset have at least a Bachelor's degree; so let's # just consider Master's (code 14) and PhD (code 16) as special pef$educ <- toSubFactor(pef$educ,c('14','16')) # predict occupation; 6 classes, 100, 101, 102, 106, 140, 141, using SVM svmout <- qeSVM(pef,'occ',holdout=NULL) # as example of prediction, take the 8th case, but change the gender and # age to female and 25; note that by setting k to non-null, we are # requesting that conditional probabilities be calculated, via # knnCalib(), here using 25 nearest neighbors newx <- pef[8,-3] newx$sex <- '2' newx$age <- 25 predict(svmout,newx,k=25) # $predClasses # 8 # 100 # Levels: 100 101 102 106 140 141 # $dvals # 102/101 102/100 102/141 102/140 102/106 101/100 101/141 # 8 -0.7774038 -0.5132022 0.9997894 1.003251 0.999688 -0.4023077 1.000419 # 101/140 101/106 100/141 100/140 100/106 141/140 141/106 140/106 # 8 1.000474 0.9997371 1.000088 1.000026 1.000126 0.9460703 -0.4974625 -1.035721 # # $probs # 100 101 102 106 140 141 # [1,] 0.24 0.52 0.12 0.08 0 0.04 # # so, occupation code 100 is predicted, with a 0.36 conditional # probability # if holdout evaluation is desired as well, say 1000 cases, seed 9999: > svmout <- qeSVM(pef,'occ',holdout=c(1000,9999)) > svmout$testAcc [1] 0.622 # 62 # linear # lm() doesn't like numeric factor levels, so prepend an 'a' pef$occ <- prepend('a',pef$occ) lmout <- qeLin(pef,'occ') predict(lmout,pef[1,-3]) # occ 100, prob 0.3316 lmout <- qeLin(pef,'wageinc') predict(lmout,pef[1,-5]) # 70857.79 ## End(Not run)
This data is suitable for NLP analysis. It consists of all the quizzes I've given in undergraduate courses, 143 quizzes in all.
It is available in two forms. First, quizzes
is a data.frame,
143 rows and 2 columns. Row i consists of a single character vector
comprising the entire quiz i, followed by the course name (as an R
factor). The second form is an R list, 143 elements. Each list element
is a character vector, one vector element per line of the quiz.
The original documents were LaTeX files. They have been run through the
detex
utility to remove most LaTeX commands, as well as removing
the LaTeX preambles separately.
The names of the list elements are the course names, as follows:
ECS 50: a course in machine organization
ECS 132: an undergraduate course in probabilistic modeling
ECS 145: a course in scripting languages (Python, R)
ECS 158: an undergraduate course in parallel computation
ECS 256: a graduate course in probabilistic modeling
Utilities to manipulate R factors, extending the ones in regtools.
levelCounts(data) dataToTopLevels(data,lowCountThresholds) factorToTopLevels(f,lowCountThresh=0) cartesianFactor(dataName,factorNames,fNameSep = ".") qeRareLevels(x, yName, yesYVal = NULL)
levelCounts(data) dataToTopLevels(data,lowCountThresholds) factorToTopLevels(f,lowCountThresh=0) cartesianFactor(dataName,factorNames,fNameSep = ".") qeRareLevels(x, yName, yesYVal = NULL)
data |
A data frame or equivalent. |
f |
An R factor. |
lowCountThresh |
Factor levels will counts below this value will not be used for this factor. |
lowCountThresholds |
An R list of column names and their
corresponding values of |
dataName |
A quoted name of a data frame or equivalent. |
factorNames |
A vector of R factor names. |
fNameSep |
A character to be used as a delimiter in the names of the levels of the output factor. |
x |
A data frame. |
yName |
Quoted name of the response variable. |
yesYVal |
In the case of binary Y, the factor level to be considered positive. |
Often one has an R factor in which one or more levels are rare in the
data. This could cause problems, say in performing cross-validation; a
level in the test set might be "new," not having appeared in the
training set. Toward this end, factorToTopLevels
will remove
rare levels from a factor; dataToTopLevels
applies this to an
entire data frame.
Also toward this end, the function levelCounts
simply applies
table()
to each column of data
, returning the result as an
R list. (If more than 10 levels, it returns NA.
The function cartesianFactor
generates a "superfactor" from
individual ones; e.g. if factors f1 and f2 have n1 and n2 levels, the
output is a new factor with n1 * n2 levels.
The function qeRareLevels
checks all columns in a data frame in
terms of being an R factor with rare levels.
Norm Matloff
data(svcensus) levelCounts(svcensus) # e.g. finds there are 15182 men, 4908 women f1 <- svcensus$gender # 2 levels f2 <- svcensus$occ # 6 levels z <- cartesianFactor('svcensus',c('gender','occ')) head(z) # [1] female.102 male.101 female.102 male.100 female.100 male.100 # 12 Levels: female.100 female.101 female.102 female.106 ... male.141
data(svcensus) levelCounts(svcensus) # e.g. finds there are 15182 men, 4908 women f1 <- svcensus$gender # 2 levels f2 <- svcensus$occ # 6 levels z <- cartesianFactor('svcensus',c('gender','occ')) head(z) # [1] female.102 male.101 female.102 male.100 female.100 male.100 # 12 Levels: female.100 female.101 female.102 female.106 ... male.141
See OpenML repository, https://www.openml.org/search?type=data&sort=runs&id=38&status=active.
"Thyroid disease records supplied by the Garavan Institute and J. Ross Quinlan, New South Wales Institute, Syndney, Australia. 1987."
data(ThyroidDisease)
data(ThyroidDisease)
Miscellaneous functions, used mainly internally in the package, but of possible use externally.
buildQEcall(qeFtnName,dataName,yName=NULL,opts=NULL,holdout=NULL, holdoutArg=TRUE) evalr(toexec) newDFRow(dta,yName,x,dtaRowNum=1)
buildQEcall(qeFtnName,dataName,yName=NULL,opts=NULL,holdout=NULL, holdoutArg=TRUE) evalr(toexec) newDFRow(dta,yName,x,dtaRowNum=1)
qeFtnName |
Quoted name of a qeML predictive function. |
dataName |
Quoted name of a data frame. |
yName |
Quoted name of a column to be predicted. |
opts |
Non-default arguments for the function specified
in |
holdout |
Size of holdout set, if any. |
holdoutArg |
A value TRUE means the function specified in
|
toexec |
Quoted string containing an R function call. |
dta |
A data frame. |
x |
An R list specifying fields to be set. |
dtaRowNum |
Row number in 'dta' to be used as a basis. |
The function qeFtnName
does what its name implies: It assembles
a string consisting of a qeML function call. Typically the latter
is then executed via evalr. See for instance the source code of
qeLeaveOut1Var
.
R's generic predict
function generally required that the input
rows match the original training data in name and class. The
newDFRow
function can be used to construct such a row.
Norm Matloff
# function to list all the objects loaded by the specified package lsp <- function(pkg) { cmd <- paste('ls(package:',pkg,')') evalr(cmd) } lsp('regtools') # outputs # [1] "clusterApply" "clusterApplyLB" "clusterCall" # [4] "clusterEvalQ" "clusterExport" "clusterMap" # ...
# function to list all the objects loaded by the specified package lsp <- function(pkg) { cmd <- paste('ls(package:',pkg,')') evalr(cmd) } lsp('regtools') # outputs # [1] "clusterApply" "clusterApplyLB" "clusterCall" # [4] "clusterEvalQ" "clusterExport" "clusterMap" # ...
Miscellaneous functions, some used mainly internally in the package, but of possible use externally.
buildQEcall(qeFtnName,dataName,yName=NULL,opts=NULL,holdout=NULL, holdoutArg=TRUE) evalr(toexec) newDFRow(dta,yName,x,dtaRowNum=1) replicMeansMatrix(nReps,cmd,nCols=NULL) Data(datasetName) wideToLongWithTime(data,timeColName,timeColPresent=TRUE, timeColSeq=c(1,1),grpColName=NULL,valueColName=NULL)
buildQEcall(qeFtnName,dataName,yName=NULL,opts=NULL,holdout=NULL, holdoutArg=TRUE) evalr(toexec) newDFRow(dta,yName,x,dtaRowNum=1) replicMeansMatrix(nReps,cmd,nCols=NULL) Data(datasetName) wideToLongWithTime(data,timeColName,timeColPresent=TRUE, timeColSeq=c(1,1),grpColName=NULL,valueColName=NULL)
data |
Data frame or equivalent. Other than a "time" column, all columns must be numeric, one column per group (see below). |
qeFtnName |
Quoted name of a qeML predictive function. |
dataName |
Quoted name of a data frame. |
yName |
Quoted name of a column to be predicted. |
opts |
Non-default arguments for the function specified
in |
holdout |
Size of holdout set, if any. |
holdoutArg |
A value TRUE means the function specified in
|
toexec |
Quoted string containing an R function call. |
nReps |
Number of replications. |
cmd |
Quoted string containing an R function call. If multiple statements, enclose with braces. |
nCols |
Number of columns for output. |
dta |
A data frame. |
x |
An R list specifying fields to be set. |
dtaRowNum |
Row number in 'dta' to be used as a basis. |
datasetName |
Quoted string of dataset to be loaded. |
The function qeFtnName
does what its name implies: It assembles
a string consisting of a qeML function call. Typically the latter
is then executed via evalr. See for instance the source code of
qeLeaveOut1Var
.
R's generic predict
function generally required that the input
rows match the original training data in name and class. The
newDFRow
function can be used to construct such a row.
The function replicMeansMatrix
will eventually replace
regtools::replicMeans
. It runs the specified code many times,
with the code assumed to have some random component such as in
simulation or in investigation of a random algorithm.
The function Data
is a convenience function that combines calls
to data
and str
.
The function wideToLongWithTime
plot "Y" versus "X" for different
groups, one per column in data
other than "X", if present. "X" is
typically time but can be any ordered numeric quantity. If "X" is not
present, the arguments specify how the user wants one created.
Norm Matloff
# function to list all the objects loaded by the specified package lsp <- function(pkg) { cmd <- paste('ls(package:',pkg,')') evalr(cmd) } lsp('regtools') # outputs # [1] "clusterApply" "clusterApplyLB" "clusterCall" # [4] "clusterEvalQ" "clusterExport" "clusterMap" # ... # mean of scalar quantity replicMeansMatrix(1000,'rnorm(1)^2') # mean of vector quantity replicMeansMatrix(1000,'c(rnorm(1),rnorm(1)^2)',2) # mean of matrix quantity replicMeansMatrix(1000, '{z1=rnorm(1); z2=rnorm(1); x=z1; y=z1+z2; rbind(c(x,x^2),c(y,y^2))}',2) data(currency) zLong <- wideToLongWithTime(currency,'weeknum',timeColPresent=FALSE, grpColName='country',valueColName='rate')
# function to list all the objects loaded by the specified package lsp <- function(pkg) { cmd <- paste('ls(package:',pkg,')') evalr(cmd) } lsp('regtools') # outputs # [1] "clusterApply" "clusterApplyLB" "clusterCall" # [4] "clusterEvalQ" "clusterExport" "clusterMap" # ... # mean of scalar quantity replicMeansMatrix(1000,'rnorm(1)^2') # mean of vector quantity replicMeansMatrix(1000,'c(rnorm(1),rnorm(1)^2)',2) # mean of matrix quantity replicMeansMatrix(1000, '{z1=rnorm(1); z2=rnorm(1); x=z1; y=z1+z2; rbind(c(x,x^2),c(y,y^2))}',2) data(currency) zLong <- wideToLongWithTime(currency,'weeknum',timeColPresent=FALSE, grpColName='country',valueColName='rate')
Various approaches to assessing relative importance of one's features.
qeLeaveOut1Var(data,yName,qeFtnName,nReps,opts=list())
qeLeaveOut1Var(data,yName,qeFtnName,nReps,opts=list())
data |
Dataframe, training set. Classification case is signaled via labels column being an R factor. |
yName |
Name of the class labels column. |
qeFtnName |
Quoted |
nReps |
Number of holdout sets to generate. |
opts |
R list of optional arguments for none, some or all of th
functions in |
Many methods have been developed assessing relative importance of one's features. A few that we consider most useful are accessible here.
As a quick assessment, the qeLeave1VarOut
function, with call
form as above, simply compares predictive ability with and without
the given feature.
Some methods rely on reweighting:
qeKNN
qeRFranger
Others make use of order of entry of a variable into the prediction model:
qeFOCI
qeLASSO
Norm Matloff
data(pef) qeLeaveOut1Var(pef,'wageinc','qeLin',5) # in order of impact, wkswrkd largest, then education etc.
data(pef) qeLeaveOut1Var(pef,'wageinc','qeLin',5) # in order of impact, wkswrkd largest, then education etc.
Various measurements on weather variables collected by NASA. Downloaded
via nasapower
; see that package for documentation.