Title: | Regression and Classification Tools |
---|---|
Description: | Tools for linear, nonlinear and nonparametric regression and classification. Novel graphical methods for assessment of parametric models using nonparametric methods. One vs. All and All vs. All multiclass classification, optional class probabilities adjustment. Nonparametric regression (k-NN) for general dimension, local-linear option. Nonlinear regression with Eickert-White method for dealing with heteroscedasticity. Utilities for converting time series to rectangular form. Utilities for conversion between factors and indicator variables. Some code related to "Statistical Regression and Classification: from Linear Models to Machine Learning", N. Matloff, 2017, CRC, ISBN 9781498710916. |
Authors: | Norm Matloff [aut, cre] |
Maintainer: | Norm Matloff <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.7.3 |
Built: | 2025-01-21 21:30:14 UTC |
Source: | https://github.com/matloff/regtools |
This package provides a broad collection of functions useful for regression and classification analysis, and machine learning.
Parametric modeling:
nonlinear regression: nlshc
ridge regression: ridgelm, plot
missing values (also see our toweranNA package): lmac,makeNA,coef.lmac,vcov.lmac,pcac
Diagnostic plots:
regression diagnostics: parvsnonparplot, nonparvsxplot, nonparvarplot
other: boundaryplot, nonparvsxplot
Classification:
unbalanced data: classadjust (see UnbalancedClasses.md)
All vs. All: avalogtrn, avalogpred
k-NN reweighting: exploreExpVars, plotExpVars, knnFineTune
Machine learning (also see qeML package):
k-NN: kNN, kmin, knnest, knntrn, preprocessx, meany, vary, loclin, predict, kmin, pwplot, bestKperPoint, knnFineTune
neural networks: krsFit,multCol
advanced grid search: fineTuning, fineTuningPar, plot.tuner, knnFineTune
loss: l1, l2, MAPE, ROC
Dummies and R factors Utilities:
conversion between factors and dummies: dummiesToFactor, dummiesToInt, factorsToDummies, factorToDummies, factorTo012etc, dummiesToInt, hasFactors, charsToFactors, makeAllNumeric
dealing with superset and subsets of factors: toSuperFactor, toSubFactor
Statistics:
mm
Matrix:
multCols, constCols
Time series:
convert rectangular to TS: TStoX
Text processing:
textToXY
Misc.:
scaling: mmscale, unscale
data frames: catDFRow, tabletofakedf
R: getNamedArgs, ulist
discretize
The data are in the form of an R list. Each element of the list corresponds to one offering of the course. Fields are: Class level; major (two different computer science majors, LCSI in Letters and Science and ECSE in engineering); quiz grade average (scale of 4.0, A+ counting as 4.3); homework grade average (same scale); and course letter grade.
From Wai Mun Fong and Sam Ouliaris, "Spectral Tests of the Martingale Hypothesis for Exchange Rates", Journal of Applied Econometrics, Vol. 10, No. 3, 1995, pp. 255-271. Weekly exchange rates against US dollar, over the period 7 August 1974 to 29 March 1989.
This is the Bike Sharing dataset (day records only) from the UC Irvine Machine Learning Dataset Repository. Included here with permission of Dr. Hadi Fanaee.
The day
data is as on UCI; day1
is modified so that the
numeric weather variables are on their original scale.
The day2
is the same as day1
, except that dteday
has been removed, and season
, mnth
, weekday
and
weathersit
have been converted to R factors.
See https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset for details.
The Stanford WordBank data on vocabulary acquisition in young children. The file consists of about 5500 rows. (There are many NA values, though, and only about 2800 complete cases.) Variables are age, birth order, sex, mother's education and vocabulary size.
Utilities from converting back and forth between factors and dummy variables.
xyDataframeToMatrix(xy) dummiesToInt(dms,inclLast=FALSE) factorToDummies(f,fname,omitLast=FALSE,factorInfo=NULL) factorsToDummies(dfr,omitLast=FALSE,factorsInfo=NULL,dfOut=FALSE) dummiesToFactor(dms,inclLast=FALSE) charsToFactors(dtaf) factorTo012etc(f,earlierLevels = NULL) discretize(x,endpts) getDFclasses(dframe) hasCharacters(dfr) hasFactors(x) toAllNumeric(w,factorsInfo=NULL) toSubFactor(f,saveLevels,lumpedLevel="zzzOther") toSuperFactor(inFactor,superLevels)
xyDataframeToMatrix(xy) dummiesToInt(dms,inclLast=FALSE) factorToDummies(f,fname,omitLast=FALSE,factorInfo=NULL) factorsToDummies(dfr,omitLast=FALSE,factorsInfo=NULL,dfOut=FALSE) dummiesToFactor(dms,inclLast=FALSE) charsToFactors(dtaf) factorTo012etc(f,earlierLevels = NULL) discretize(x,endpts) getDFclasses(dframe) hasCharacters(dfr) hasFactors(x) toAllNumeric(w,factorsInfo=NULL) toSubFactor(f,saveLevels,lumpedLevel="zzzOther") toSuperFactor(inFactor,superLevels)
dfOut |
If TRUE, return a data frame, otherwise a matrix. |
dms |
Matrix or data frame of dummy columns. |
inclLast |
When forming a factor from dummies, include the last dummy as a level if this is TRUE. |
xy |
A data frame mentioned for prediction, "Y" in last column. |
saveLevels |
In collapsing a factor, which levels to retain. |
lumpedLevel |
Name of new level to be created from levels not retained. |
x |
A numeric vector, except in |
endpts |
Vector to be used as |
f |
A factor. |
inFactor |
Original factor, to be extended. |
superLevels |
New levels to be added to the original factor. |
earlierLevels |
Previous levels found for this factor. |
fname |
A factor name. |
dfr |
A data frame. |
w |
A data frame. |
dframe |
A data frame, for which we wish to find the column classes. |
omitLast |
If TRUE, then generate only k-1 dummies from k factor levels. |
factorsInfo |
Attribute from output of |
factorInfo |
Attribute from output of |
dtaf |
A data frame. |
Many R users prefer to express categorical data as R factors, or often work with data that is of this type to begin with. On the other hand, many regression packages, e.g. lars, disallow factors. These utilities facilitate conversion from one form to another.
Here is an overview of the roles of the various functions:
factorToDummies
: Convert one factor to dummies, yielding a
matrix of dummies corresponding to that factor.
factorsToDummies
: Convert all factors to dummies, yielding
a matrix of dummies, corresponding to all factors in the input data
frame.
dummiesToFactor
: Convert a set of related dummies to a
factor.
factorTo012etc
: Convert a factor to a numeric code,
starting at 0.
dummiesToInt
: Convert a related set of dummies to a numeric code,
starting at 0.
charsToFactors
: Convert all character columns in a data
frame to factors.
toAllNumeric
: Convert all factors in a data frame to
dummies, yielding a new version of the data frame, including its
original nonfactor columns.
toSubFactor
: Coalesce some levels of a factor, yielding a
new factor.
toSuperFactor
: Add levels to a factor. Typically used in
prediction contexts, in which a factor in a data point to be predicted
does not have all the levels of the same factor in the training set.
\item xyDataframeToMatrix
: Given a data frame to be used in
a training set, with "Y" a factor in the last column, change to all
numeric, with dummies in place of all "X" factors and in place of the
"Y" factor.
The optional argument factorsInfo
is intended for use in prediction
contexts. Typically a set of new cases will not have all levels of the
factor in the training set. Without this argument, only an incomplete
set of dummies would be generated for the set of new cases.
A key point about changing factors to dummies is that, for later
prediction after fitting a model in our training set, one needs to use
the same transformations. Say a factor has levels 'abc', 'de' and 'f'
(and omitLast = FALSE
). If we later have a set of say two new
cases to predict, and their values for this factor are 'de' and 'f', we
would generate dummies for them but not for 'abc', incompatible with the
three dummies used in the training set.
Thus the factor names and levels are saved in attributes, and can be used as input: The relations are as follows:
factorsToDummies
calls factorToDummies
on each
factor it finds in its input data frame
factorToDummies
outputs and later inputs factorsInfo
factorsToDummies
outputs and later inputs factorsInfo
Other functions:
getDFclasses
: Return a vector of the classes of the columns
of a data frame.
discretize
: Partition range of a vector into (not
necessarily equal-length) intervals, and construct a factor from the
labels of the intervals that the input elements fall into.
hasCharacters, hasFactors
: Logical scalars, TRUE if the
input data frame has any character or factor columns.
The function factorToDummies
returns a matrix of dummy
variables, while factorsToDummies
returns a new version of the
input data frame, in which each factor is replaced by columns of
dummies. The function factorToDummies
is similar, but changes
character vectors to factors.
Norm Matloff
x <- factor(c('abc','de','f','de')) xd <- factorToDummies(x,'x') xd # x.abc x.de # [1,] 1 0 # [2,] 0 1 # [3,] 0 0 # [4,] 0 1 # attr(,"factorInfo") # attr(,"factorInfo")$fname # [1] "x" # # attr(,"factorInfo")$omitLast # [1] TRUE # # attr(,"factorInfo")$fullLvls # [1] "abc" "de" "f" w <- factor(c('de','abc','abc')) wd <- factorToDummies(w,'x',factorInfo=attr(xd,'factorInfo')) wd # x.abc x.de # [1,] 0 1 # [2,] 1 0 # [3,] 1 0 # attr(,"factorInfo") # attr(,"factorInfo")$fname # [1] "x" # # attr(,"factorInfo")$omitLast # [1] TRUE # # attr(,"factorInfo")$fullLvls # [1] "abc" "de" "f"
x <- factor(c('abc','de','f','de')) xd <- factorToDummies(x,'x') xd # x.abc x.de # [1,] 1 0 # [2,] 0 1 # [3,] 0 0 # [4,] 0 1 # attr(,"factorInfo") # attr(,"factorInfo")$fname # [1] "x" # # attr(,"factorInfo")$omitLast # [1] TRUE # # attr(,"factorInfo")$fullLvls # [1] "abc" "de" "f" w <- factor(c('de','abc','abc')) wd <- factorToDummies(w,'x',factorInfo=attr(xd,'factorInfo')) wd # x.abc x.de # [1,] 0 1 # [2,] 1 0 # [3,] 1 0 # attr(,"factorInfo") # attr(,"factorInfo")$fname # [1] "x" # # attr(,"factorInfo")$omitLast # [1] TRUE # # attr(,"factorInfo")$fullLvls # [1] "abc" "de" "f"
Detection falls in the elderly via physiological measurements. Obtained from Kaggle, but is no longer there.
Adds various extra features to grid search for specified tuning parameter/hyperparameter combinations: There is a plot() function, using parallel coordinates graphs to show trends among the different combinations; and Bonferroni confidence intervals are computed to avoid p-hacking. An experimental smoothing facility is also included.
fineTuning(dataset,pars,regCall,nCombs=NULL,specCombs=NULL,nTst=500, nXval=1,up=TRUE,k=NULL,dispOrderSmoothed=FALSE, showProgress=TRUE,...) fineTuningMult(dataset,pars,regCall,nCombs=NULL, nTst=500,nXval=1,up=TRUE,k=NULL,dispOrderSmoothed=FALSE, showProgress=TRUE,outDim=1,...) ## S3 method for class 'tuner' plot(x,...) knnFineTune(data,yName,k,expandVars,ws,classif=FALSE,seed=9999) fineTuningPar(cls,dataset,pars,regCall,nCombs=NULL,specCombs=NULL, nTst=500,nXval=1,up=TRUE,k=NULL,dispOrderSmoothed=FALSE)
fineTuning(dataset,pars,regCall,nCombs=NULL,specCombs=NULL,nTst=500, nXval=1,up=TRUE,k=NULL,dispOrderSmoothed=FALSE, showProgress=TRUE,...) fineTuningMult(dataset,pars,regCall,nCombs=NULL, nTst=500,nXval=1,up=TRUE,k=NULL,dispOrderSmoothed=FALSE, showProgress=TRUE,outDim=1,...) ## S3 method for class 'tuner' plot(x,...) knnFineTune(data,yName,k,expandVars,ws,classif=FALSE,seed=9999) fineTuningPar(cls,dataset,pars,regCall,nCombs=NULL,specCombs=NULL, nTst=500,nXval=1,up=TRUE,k=NULL,dispOrderSmoothed=FALSE)
... |
Arguments to be passed on by |
x |
Output object from |
cls |
A |
dataset |
Data frame etc. containing the data to be analyzed. |
data |
The data to be analyzed. |
yName |
Quoted name of "Y" in the column names of |
expandVars |
Indices of columns in |
ws |
Weights to be used for |
classif |
Set to TRUE for classification problems. |
seed |
Seed for random number generation. |
pars |
R list, showing the desired tuning parameter values. |
regCall |
Function to be called at each parameter combination, performing the model fit etc. |
nCombs |
Number of parameter combinations to run. If Null, all will be run |
.
nTst |
Number of data points to be in each holdout set. |
nXval |
Number of holdout sets/folds to be run for a given data partition and parameter combination. |
k |
Nearest-neighbor smoothing parameter. |
up |
If TRUE, display results in ascending order of performance value. |
dispOrderSmoothed |
Display in order of smoothed results. |
showProgress |
If TRUE, print each output line as it becomes ready. |
specCombs |
A data frame in which the user specifies hyperparameter parameter combinations to evaluate. |
outDim |
Number of components in the value returned by |
The user specifies the values for each tuning parameter in
pars
. This leads to a number of possible combinations of the
parameters. In many cases, there are more combinations than the user
wishes to try, so nCombs
of them will be chosen at random.
For each combination, the function will run the analysis specified by
the user in regCall
. The latter must have the call form
ftnName(dtrn,dtst,cmbi
Again, note that it is fineTuning
that calls this function. It
will provide the training and test sets dtrn
and dtst
, as
well as cmbi
("combination i"), the particular parameter
combination to be run at this moment.
Each chosen combination is run in nXval
folds. All specified
combinations are run fully, as opposed to a directional "hill descent"
search that hopes it might eliminate poor combinations early in the process.
The function knnFineTune
is a wrapper for fineTuning
for
k-NN problems.
The function plot.tuner
draws a parallel coordinates plot to
visualize the grid. The argument x
is the output of
fineTuning
. Arguments to specify in the ellipsis are:
col
is the column to be plotted;
disp
is the number to display, with 0
, -m
and
+m
meaning cases with the m
smallest 'smoothed' values, all
cases and the m
largest values of 'smoothed', respectively;
jit
avoids plotting coincident lines by adding jitter in the
amount jit * range(x) * runif(n,-0.5,0.5)
.
Object of class **”tuner'**. Contains the grid results, including upper bounds of approximate one-sided 95 univariate and Bonferroni-Dunn (adjusted for the number of parameter combinations).
Norm Matloff
# mlb data set, predict weight using k-NN, try various values of k tc <- function(dtrn,dtst,cmbi,...) { knnout <- kNN(dtrn[,-10],dtrn[,10],dtst[,-10],as.integer(cmbi[1])) preds <- knnout$regests mean(abs(preds - dtst[,10])) } data(mlb) mlb <- mlb[,3:6] mlb.d <- factorsToDummies(mlb) fineTuning(mlb.d,list(k=c(5,25)),tc,nTst=100,nXval=2) tc <- function(dtrn,dtst,cmbi,...) { knnout <- kNN(dtrn[,-10],dtrn[,10],dtst[,-10],as.integer(cmbi[1])) preds <- knnout$regests mean(abs(preds - dtst[,10])) } fineTuningMult(mlb.d,list(k=c(5,25)),tc,nTst=100,nXval=2) ## Not run: library(qeML) data(svcensus) tc1 <- function(dtrn,dtst,cmbi,...) { knnout <- qeKNN(dtrn,'wageinc',as.integer(cmbi[1]),holdout=NULL) preds <- predict(knnout,dtst[,-4]) mape <- mean(abs(preds - dtst[,4])) bigprobs75 <- mean(preds > 75000) c(mape,bigprobs75) } fineTuningMult(svcensus,list(k = c(10,25)),tc1,outDim=2) ## End(Not run)
# mlb data set, predict weight using k-NN, try various values of k tc <- function(dtrn,dtst,cmbi,...) { knnout <- kNN(dtrn[,-10],dtrn[,10],dtst[,-10],as.integer(cmbi[1])) preds <- knnout$regests mean(abs(preds - dtst[,10])) } data(mlb) mlb <- mlb[,3:6] mlb.d <- factorsToDummies(mlb) fineTuning(mlb.d,list(k=c(5,25)),tc,nTst=100,nXval=2) tc <- function(dtrn,dtst,cmbi,...) { knnout <- kNN(dtrn[,-10],dtrn[,10],dtst[,-10],as.integer(cmbi[1])) preds <- knnout$regests mean(abs(preds - dtst[,10])) } fineTuningMult(mlb.d,list(k=c(5,25)),tc,nTst=100,nXval=2) ## Not run: library(qeML) data(svcensus) tc1 <- function(dtrn,dtst,cmbi,...) { knnout <- qeKNN(dtrn,'wageinc',as.integer(cmbi[1]),holdout=NULL) preds <- predict(knnout,dtst[,-4]) mape <- mean(abs(preds - dtst[,4])) bigprobs75 <- mean(preds > 75000) c(mape,bigprobs75) } fineTuningMult(svcensus,list(k = c(10,25)),tc1,outDim=2) ## End(Not run)
Full set of tools for k-NN regression and classification, including both for direct usage and as tools for assessing the fit of parametric models.
kNN(x,y,newx=x,kmax,scaleX=TRUE,PCAcomps=0,expandVars=NULL,expandVals=NULL, smoothingFtn=mean,allK=FALSE,leave1out=FALSE, classif=FALSE, startAt1=TRUE,saveNhbrs=FALSE,savedNhbrs=NULL) knnest(y,xdata,k,nearf=meany) preprocessx(x,kmax,xval=FALSE) meany(nearIdxs,x,y,predpt) mediany(nearIdxs,x,y,predpt) vary(nearIdxs,x,y,predpt) loclin(nearIdxs,x,y,predpt) ## S3 method for class 'knn' predict(object,...) kmin(y,xdata,lossftn=l2,nk=5,nearf=meany) parvsnonparplot(lmout,knnout,cex=1.0) nonparvsxplot(knnout,lmout=NULL) nonparvarplot(knnout,returnPts=FALSE) l2(y,muhat) l1(y,muhat) MAPE(yhat,y) bestKperPoint(x,y,maxK,lossFtn="MAPE",classif=FALSE) kNNallK(x,y,newx=x,kmax,scaleX=TRUE,PCAcomps=0, expandVars=NULL,expandVals=NULL,smoothingFtn=mean, allK=FALSE,leave1out=FALSE,classif=FALSE,startAt1=TRUE) kNNxv(x,y,k,scaleX=TRUE,PCAcomps=0,smoothingFtn=mean, nSubSam=500) knnest(y,xdata,k,nearf=meany) loclogit(nearIdxs,x,y,predpt) mediany(nearIdxs,x,y,predpt) exploreExpVars(xtrn, ytrn, xtst, ytst, k, eVar, maxEVal, lossFtn, eValIncr = 0.05, classif = FALSE, leave1out = FALSE) plotExpVars(xtrn,ytrn,xtst,ytst,k,eVars,maxEVal,lossFtn, ylim,eValIncr=0.05,classif=FALSE,leave1out=FALSE)
kNN(x,y,newx=x,kmax,scaleX=TRUE,PCAcomps=0,expandVars=NULL,expandVals=NULL, smoothingFtn=mean,allK=FALSE,leave1out=FALSE, classif=FALSE, startAt1=TRUE,saveNhbrs=FALSE,savedNhbrs=NULL) knnest(y,xdata,k,nearf=meany) preprocessx(x,kmax,xval=FALSE) meany(nearIdxs,x,y,predpt) mediany(nearIdxs,x,y,predpt) vary(nearIdxs,x,y,predpt) loclin(nearIdxs,x,y,predpt) ## S3 method for class 'knn' predict(object,...) kmin(y,xdata,lossftn=l2,nk=5,nearf=meany) parvsnonparplot(lmout,knnout,cex=1.0) nonparvsxplot(knnout,lmout=NULL) nonparvarplot(knnout,returnPts=FALSE) l2(y,muhat) l1(y,muhat) MAPE(yhat,y) bestKperPoint(x,y,maxK,lossFtn="MAPE",classif=FALSE) kNNallK(x,y,newx=x,kmax,scaleX=TRUE,PCAcomps=0, expandVars=NULL,expandVals=NULL,smoothingFtn=mean, allK=FALSE,leave1out=FALSE,classif=FALSE,startAt1=TRUE) kNNxv(x,y,k,scaleX=TRUE,PCAcomps=0,smoothingFtn=mean, nSubSam=500) knnest(y,xdata,k,nearf=meany) loclogit(nearIdxs,x,y,predpt) mediany(nearIdxs,x,y,predpt) exploreExpVars(xtrn, ytrn, xtst, ytst, k, eVar, maxEVal, lossFtn, eValIncr = 0.05, classif = FALSE, leave1out = FALSE) plotExpVars(xtrn,ytrn,xtst,ytst,k,eVars,maxEVal,lossFtn, ylim,eValIncr=0.05,classif=FALSE,leave1out=FALSE)
nearf |
Function to be applied to a neighborhood. |
ylim |
Range of Y values for plot. |
lossFtn |
Loss function for plot. |
eVar |
Variable to be expanded. |
eVars |
Variables to be expanded. |
maxEVal |
Maximum expansion value. |
eValIncr |
Increment in range of expansion value. |
xtrn |
Training set for X. |
ytrn |
Training set for Y. |
xtst |
Test set for X. |
ytst |
Test set for Y. |
nearIdxs |
Indices of the neighbors. |
nSubSam |
Number of folds. |
x |
"X" data, predictors, one row per data point, in the training set. |
y |
Response variable data in the training set. Vector or matrix, the latter case for vector-valued response, e.g. multiclass classification. In that case, can be a vector, either (0,1,2,...,) or (1,2,3,...), which automatically is converted into a matrix of dummies. |
newx |
New data points to be predicted. If NULL in |
scaleX |
If TRUE, call |
PCAcomps |
If positive, transform |
expandVars |
Indices of columns in |
expandVals |
The corresponding expansion values. |
smoothingFtn |
Function to apply to the "Y" values in the
set of nearest neighbors. Built-in choices are |
allK |
If TRUE, find regression estimates for all |
leave1out |
If TRUE, omit the 1-nearest neighbor from analysis |
classif |
If TRUE, compute the predicted class labels, not just the regression function values |
startAt1 |
If TRUE, class labels start at 1, else 0. |
k |
Number of nearest neighbors |
saveNhbrs |
If TRUE, place output of |
savedNhbrs |
If non-NULL, this is the |
... |
Needed for consistency with generic. See Details below for 'arguments. |
xdata |
X and associated neighbor indices. Output of
|
object |
Output of |
predpt |
One point on which to predict, as a vector. |
kmax |
Maximal number of nearest neighbors to find. |
maxK |
Maximal number of nearest neighbors to find. |
xval |
Cross-validation flag. If TRUE, then the set of nearest neighbors of a point will not include the point itself. |
lossftn |
Loss function to be used in cross-validation
determination of "best" |
nk |
Number of values of |
lmout |
Output of |
knnout |
Output of |
cex |
R parameter to control dot size in plot. |
muhat |
Vector of estimated regression function values. |
yhat |
Vector of estimated regression function values. |
returnPts |
If TRUE, return matrix of plotted points. |
The kNN
function is the main tool here; knnest
is being
deprecated. (Note too qeKNN
, a wrapper for kNN
; more
on this below.) Here are the capabilities:
In its most basic form, the function will input training data and
output predictions for new cases newx
. By default this is
done for a single value of the number k
of nearest neighbors,
but by setting allK
to TRUE, the user can request that it be
done for all k
through the specified maximum.
In the second form, newx
is set to NULL in the call to
kNN
. No predictions are made; instead, the regression function
is estimated on all data points in x
, which are saved in the return
value. Future new cases can then be predicted from this saved object,
via predict.kNN
(called via the generic predict
).
The call form is predict(knnout,newx,newxK
, with a
default value of 1 for newxK
.
In this second form, the closest k
points to the newx
in
x
are determined as usual, but instead of averaging their Y
values, the average is taken over the fitted regression estimates at
those points. In this manner, there is almost no computational cost
in the prediction stage.
The second form is intended more for production use, so that neighbor distances need not be repeatedly recomputed.
Nearest-neighbor computation can be time-consuming. If more than one
value of k
is anticipated, for the same x
, y
and
newx
, first run with the largest anticipated value of
k
, with saveNhbrs
set to TRUE. Then for other values
of k
, set savedNhbrs
to the nhbrs
component in
the return value of the first call.
In addition, a novel feature allows the user to weight some
predictors more than others. This is done by scaling the given
predictor up or down, according to a specified value. Normally, this
should be done with scaleX = TRUE
, which applies
scale()
to the data. In other words, first we create a "level
playing field" in which all predictors have standard deviation 1.0,
then scale some of them up or down.
Alternatives are provided to calculating the mean Y in the given neighborhood, such as the median and the variance, the latter of possible use in dealing with heterogeneity in linear models.
Another choice of note is to allow local-linear smoothing, by
setting smoothingFtn
to loclin
. Here the value of the
regression function at a point is predicted from a linear fit to the
point's neighbors. This may be especially helpful to counteract bias
near the edges of the data. As in any regression fit, the number of
predictors should be considerably less than the number of neighbors.
Custom functions for smoothing can easily be written, say following
the pattern of loclin
.
The main alternative to kNN
is qeKNN
in the qe* ("quick
and easy") series. It is more convenient, e.g. allowing factor
inputs, but less flexible.
The functions ovaknntrn
and ovaknnpred
are multiclass
wrappers for knnest
and knnpred
, thus also deprecated.
Here y
is coded 0,1,...,m
-1 for the m
classes.
The tools here can be useful for fit assessment of parametric models.
The parvsnonparplot
function plots fitted values of
parameteric model vs. kNN fitted, nonparvsxplot
k-NN fitted
values against each predictor, one by one.
The functions l2
and l1
are used to define L2 and L1
loss.
Norm Matloff
x <- rbind(c(1,0),c(2,5),c(0,5),c(3,3),c(6,3)) y <- c(8,3,10,11,4) newx <- c(0,0) kNN(x,y,newx,2,scaleX=FALSE) # $whichClosest # [,1] [,2] # [1,] 1 4 # $regests # [1] 9.5 kNN(x,y,newx,3,scaleX=FALSE,smoothingFtn=loclin)$regests # 7.307692 knnout <- kNN(x,y,newx,2,scaleX=FALSE) knnout # $whichClosest # [,1] [,2] # [1,] 1 4 # ... ## Not run: data(mlb) mlb <- mlb[,c(4,6,5)] # height, age, weight # fit, then predict 75", age 21, and 72", age 32 knnout <- kNN(mlb[,1:2],mlb[,3],rbind(c(75,21),c(72,32)),25) knnout$regests # [1] 202.72 195.72 # fit now, predict later knnout <- kNN(mlb[,1:2],mlb[,3],NULL,25) predict(knnout,c(70,28)) # [1] 186.48 data(peDumms) names(peDumms) ped <- peDumms[,c(1,20,22:27,29,31,32)] names(ped) # fit, and predict income of a 35-year-old man, MS degree, occupation 101, # worked 50 weeks, using 25 nearest neighbors kNN(ped[,-10],ped[,10],c(35,1,0,0,1,0,0,0,1,50),25) $regests # [1] 67540 # fit, and predict occupation 101 for a 35-year-old man, MS degree, # wage $55K, worked 50 weeks, using 25 nearest neighbors z <- kNN(ped[,-c(4:8)],ped[,4],c(35,1,0,1,55,50),25,classif=TRUE) z$regests # [1] 0.16 16 z$ypreds # [1] 0 class 0, i.e. not occupation 101; round(0.24) = 0, # computed by user request, classif = TRUE # the y argument must be either a vector (2-class setting) or a matrix # (multiclass setting) occs <- as.matrix(ped[, 4:8]) z <- kNN(ped[,-c(4:8)],occs,c(35,1,0,1,72000,50),25,classif=TRUE) z$ypreds # [1] 3 occupation 3, i.e. 102, is predicted # predict occupation in general; let's bring occ.141 back in (was # excluded as a predictor due to redundancy) names(peDumms) # [1] "age" "cit.1" "cit.2" "cit.3" "cit.4" "cit.5" "educ.1" # [8] "educ.2" "educ.3" "educ.4" "educ.5" "educ.6" "educ.7" "educ.8" # [15] "educ.9" "educ.10" "educ.11" "educ.12" "educ.13" "educ.14" "educ.15" # [22] "educ.16" "occ.100" "occ.101" "occ.102" "occ.106" "occ.140" "occ.141" # [29] "sex.1" "sex.2" "wageinc" "wkswrkd" "yrentry" occs <- as.matrix(peDumms[,23:28]) z <- kNN(ped[,-c(4:8)],occs,c(35,1,0,1,72000,50),25,classif=TRUE) z$ypreds # [1] 3 prediction is occ.102 # try weight age 0.5, wkswrked 1.5; use leave1out to avoid overfit knnout <- kNN(ped[,-10],ped[,10],ped[,-10],25,leave1out=TRUE) mean(abs(knnout$regests - ped[,10])) # [1] 25341.6 # use of the weighted distance feature; deweight age by a factor of 0.5, # put increased weight on weeks worked, factor of 1.5 knnout <- kNN(ped[,-10],ped[,10],ped[,-10],25, expandVars=c(1,10),expandVals=c(0.5,1.5),leave1out=TRUE) mean(abs(knnout$regests - ped[,10])) # [1] 25196.61 ## End(Not run)
x <- rbind(c(1,0),c(2,5),c(0,5),c(3,3),c(6,3)) y <- c(8,3,10,11,4) newx <- c(0,0) kNN(x,y,newx,2,scaleX=FALSE) # $whichClosest # [,1] [,2] # [1,] 1 4 # $regests # [1] 9.5 kNN(x,y,newx,3,scaleX=FALSE,smoothingFtn=loclin)$regests # 7.307692 knnout <- kNN(x,y,newx,2,scaleX=FALSE) knnout # $whichClosest # [,1] [,2] # [1,] 1 4 # ... ## Not run: data(mlb) mlb <- mlb[,c(4,6,5)] # height, age, weight # fit, then predict 75", age 21, and 72", age 32 knnout <- kNN(mlb[,1:2],mlb[,3],rbind(c(75,21),c(72,32)),25) knnout$regests # [1] 202.72 195.72 # fit now, predict later knnout <- kNN(mlb[,1:2],mlb[,3],NULL,25) predict(knnout,c(70,28)) # [1] 186.48 data(peDumms) names(peDumms) ped <- peDumms[,c(1,20,22:27,29,31,32)] names(ped) # fit, and predict income of a 35-year-old man, MS degree, occupation 101, # worked 50 weeks, using 25 nearest neighbors kNN(ped[,-10],ped[,10],c(35,1,0,0,1,0,0,0,1,50),25) $regests # [1] 67540 # fit, and predict occupation 101 for a 35-year-old man, MS degree, # wage $55K, worked 50 weeks, using 25 nearest neighbors z <- kNN(ped[,-c(4:8)],ped[,4],c(35,1,0,1,55,50),25,classif=TRUE) z$regests # [1] 0.16 16 z$ypreds # [1] 0 class 0, i.e. not occupation 101; round(0.24) = 0, # computed by user request, classif = TRUE # the y argument must be either a vector (2-class setting) or a matrix # (multiclass setting) occs <- as.matrix(ped[, 4:8]) z <- kNN(ped[,-c(4:8)],occs,c(35,1,0,1,72000,50),25,classif=TRUE) z$ypreds # [1] 3 occupation 3, i.e. 102, is predicted # predict occupation in general; let's bring occ.141 back in (was # excluded as a predictor due to redundancy) names(peDumms) # [1] "age" "cit.1" "cit.2" "cit.3" "cit.4" "cit.5" "educ.1" # [8] "educ.2" "educ.3" "educ.4" "educ.5" "educ.6" "educ.7" "educ.8" # [15] "educ.9" "educ.10" "educ.11" "educ.12" "educ.13" "educ.14" "educ.15" # [22] "educ.16" "occ.100" "occ.101" "occ.102" "occ.106" "occ.140" "occ.141" # [29] "sex.1" "sex.2" "wageinc" "wkswrkd" "yrentry" occs <- as.matrix(peDumms[,23:28]) z <- kNN(ped[,-c(4:8)],occs,c(35,1,0,1,72000,50),25,classif=TRUE) z$ypreds # [1] 3 prediction is occ.102 # try weight age 0.5, wkswrked 1.5; use leave1out to avoid overfit knnout <- kNN(ped[,-10],ped[,10],ped[,-10],25,leave1out=TRUE) mean(abs(knnout$regests - ped[,10])) # [1] 25341.6 # use of the weighted distance feature; deweight age by a factor of 0.5, # put increased weight on weeks worked, factor of 1.5 knnout <- kNN(ped[,-10],ped[,10],ped[,-10],25, expandVars=c(1,10),expandVals=c(0.5,1.5),leave1out=TRUE) mean(abs(knnout$regests - ped[,10])) # [1] 25196.61 ## End(Not run)
Tools to complement existing neural networks software, notably a more "R-like" wrapper to fitting data with R's keras package.
krsFit(x,y,hidden,acts=rep("relu",length(hidden)),learnRate=0.001, conv=NULL,xShape=NULL,classif=TRUE,nClass=NULL,nEpoch=30, scaleX=TRUE,scaleY=TRUE) krsFitImg(x,y,hidden=c(100,100),acts=rep("relu",length(hidden)), nClass,nEpoch=30) ## S3 method for class 'krsFit' predict(object,...) diagNeural(krsFitOut)
krsFit(x,y,hidden,acts=rep("relu",length(hidden)),learnRate=0.001, conv=NULL,xShape=NULL,classif=TRUE,nClass=NULL,nEpoch=30, scaleX=TRUE,scaleY=TRUE) krsFitImg(x,y,hidden=c(100,100),acts=rep("relu",length(hidden)), nClass,nEpoch=30) ## S3 method for class 'krsFit' predict(object,...) diagNeural(krsFitOut)
object |
An object of class 'krsFit'. |
... |
Data points to be predicted, 'newx'. |
x |
X data, predictors, one row per data point, in the training set. Must be a matrix. |
y |
Numeric vector of Y values. In classification case
must be integers, not an R factor, and take on the values 0,1,2,...,
|
.
Vector of number of units per hidden layer, or the rate for a dropout layer. |
|
acts |
Vector of names of the activation functions, one per hidden layer. Choices inclde 'relu', 'sigmoid', 'tanh', 'softmax', 'elu', 'selu'. |
learnRate |
Learning rate. |
conv |
R list specifying the convolutional layers, if any. |
xShape |
Vector giving the number of rows and columns, and in the convolutional case with multiple channels, the number of channels. |
classif |
If TRUE, indicates a classification problem. |
nClass |
Number of classes. |
nEpoch |
Number of epochs. |
krsFitOut |
An object returned by |
scaleX |
If TRUE, scale X columns. |
scaleY |
If TRUE, scale Y columns. |
The krstFit
function is a wrapper for the entire pipeline
in fitting the R keras package to a dataset: Defining the model,
compiling, stating the inputs and so on. As a result, the wrapper
allows the user to skip those details (or not need to even know them),
and define the model in a manner more familiar to R users.
The paired predict.krsFit
takes as its first argument the output
of krstFit
, and newx
, the points to be predicted.
Norm Matloff
## Not run: library(keras) data(peDumms) ped <- peDumms[,c(1,20,22:27,29,32,31)] # predict wage income x <- ped[,-11] y <- ped[,11] z <- krsFit(x,y,c(50,50,50),classif=FALSE,nEpoch=25) preds <- predict(z,x) mean(abs(preds-y)) # something like 25000 x <- ped[,-(4:8)] y <- ped[,4:8] y <- dummiesToInt(y,FALSE) - 1 z <- krsFit(x,y,c(50,50,0.20,50),classif=TRUE,nEpoch=175,nClass=6) preds <- predict(z,x) mean(preds == y) # something like 0.39 # obtain MNIST training and test sets; the following then uses the # example network of # https://databricks-prod-cloudfront.cloud.databricks.com/ # public/4027ec902e239c93eaaa8714f173bcfc/2961012104553482/ # 4462572393058129/1806228006848429/latest.html # converted to use the krsFit wrapper x <- mntrn[,-785] / 255 y <- mntrn[,785] xShape <- c(28,28) # define convolutional layers conv1 <- list(type='conv2d',filters=32,kern=3) conv2 <- list(type='pool',kern=2) conv3 <- list(type='conv2d',filters=64,kern=3) conv4 <- list(type='pool',kern=2) conv5 <- list(type='drop',drop=0.5) # call wrapper, 1 dense hidden layer of 128 units, then dropout layer # with proportion 0.5 z <- krsFit(x,y,conv=list(conv1,conv2,conv3,conv4,conv5),c(128,0.5), classif=TRUE,nClass=10,nEpoch=10,xShape=c(28,28),scaleX=FALSE,scaleY=FALSE) # try on test set preds <- predict(z,mntst[,-785]/255) mean(preds == mntst[,785]) # 0.98 in my sample run ## End(Not run)
## Not run: library(keras) data(peDumms) ped <- peDumms[,c(1,20,22:27,29,32,31)] # predict wage income x <- ped[,-11] y <- ped[,11] z <- krsFit(x,y,c(50,50,50),classif=FALSE,nEpoch=25) preds <- predict(z,x) mean(abs(preds-y)) # something like 25000 x <- ped[,-(4:8)] y <- ped[,4:8] y <- dummiesToInt(y,FALSE) - 1 z <- krsFit(x,y,c(50,50,0.20,50),classif=TRUE,nEpoch=175,nClass=6) preds <- predict(z,x) mean(preds == y) # something like 0.39 # obtain MNIST training and test sets; the following then uses the # example network of # https://databricks-prod-cloudfront.cloud.databricks.com/ # public/4027ec902e239c93eaaa8714f173bcfc/2961012104553482/ # 4462572393058129/1806228006848429/latest.html # converted to use the krsFit wrapper x <- mntrn[,-785] / 255 y <- mntrn[,785] xShape <- c(28,28) # define convolutional layers conv1 <- list(type='conv2d',filters=32,kern=3) conv2 <- list(type='pool',kern=2) conv3 <- list(type='conv2d',filters=64,kern=3) conv4 <- list(type='pool',kern=2) conv5 <- list(type='drop',drop=0.5) # call wrapper, 1 dense hidden layer of 128 units, then dropout layer # with proportion 0.5 z <- krsFit(x,y,conv=list(conv1,conv2,conv3,conv4,conv5),c(128,0.5), classif=TRUE,nClass=10,nEpoch=10,xShape=c(28,28),scaleX=FALSE,scaleY=FALSE) # try on test set preds <- predict(z,mntst[,-785]/255) mean(preds == mntst[,785]) # 0.98 in my sample run ## End(Not run)
Various estimators that handle missing data via the Available Cases Method
lmac(xy,nboot=0) makeNA(m,probna) NAsTo0s(x) ZerosToNAs(x,replaceVal=0) ## S3 method for class 'lmac' coef(object,...) ## S3 method for class 'lmac' vcov(object,...) pcac(indata,scale=FALSE) loglinac(x,margin) tbltofakedf(tbl)
lmac(xy,nboot=0) makeNA(m,probna) NAsTo0s(x) ZerosToNAs(x,replaceVal=0) ## S3 method for class 'lmac' coef(object,...) ## S3 method for class 'lmac' vcov(object,...) pcac(indata,scale=FALSE) loglinac(x,margin) tbltofakedf(tbl)
replaceVal |
Value to be replaced by NA. |
xy |
Matrix or data frame, X values in the first columns, Y in the last column. |
indata |
Matrix or data frame. |
x |
Matrix or data frame, one column per variable. |
nboot |
If positive, number of bootstrap samples to take. |
probna |
Probability that an element will be NA. |
scale |
If TRUE, call |
tbl |
An R table. |
m |
Number of synthetic NAs to insert. |
object |
Output from |
... |
Needed for consistency with generic function. Not used. |
margin |
A list of vectors specifying the model, as in
|
The Available Cases (AC) approach applies to statistical methods that depend only on products of k of the variables, so that cases having non-NA values for those k variables can be used, as opposed to using only cases that are fully intact in all variables, the Complete Cases (CC) approach. In the case of linear regression, for instance, the estimated coefficients depend only on covariances between the variables (both predictors and response). This approach assumes thst the cases with missing values have the same distribution as the intact cases.
The lmac
function forms OLS estimates as with lm
, but
applying AC, in contrast to lm
, which uses the CC method.
The pcac
function is an AC substitute for prcomp
. The
data is centered, corresponding to a fixed value of center =
TRUE
in prcomp
. It is also scaled if scale
is TRUE,
corresponding scale = TRUE
in prcomp
. Due to AC,
there is a small chance of negative eigenvalues, in which case
stop
will be called.
The loglinac
function is an AC substitute for loglin
.
The latter takes tables as input, but loglinac
takes the raw
data. If you have just the table, use tbltofakedf
to
regenerate a usable data frame.
The makeNA
function is used to insert random NA values into
data, for testing purposes.
For lmac
, an object of class lmac
, with components
coefficients, as with lm
;
accessible directly or by calling coef
, as with lm
fitted.values, as with lm
residuals, as with lm
r2, (unadjusted) R-squared
cov, for nboot > 0
the estimated covariance matrix
of the vector of estimated regression coefficients; accessible
directly or by calling vcov
, as with lm
For pcac
, an R list, with components
sdev, as with prcomp
rotation, as with prcomp
For loglinac
, an R list, with components
param, estimated coefficients, as in loglin
fit, estimated expected call counts, as in loglin
Norm Matloff
n <- 25000 w <- matrix(rnorm(2*n),ncol=2) # x and epsilon x <- w[,1] y <- x + w[,2] # insert some missing values nmiss <- round(0.1*n) x[sample(1:n,nmiss)] <- NA nmiss <- round(0.2*n) y[sample(1:n,nmiss)] <- NA acout <- lmac(cbind(x,y)) coef(acout) # should be near pop. values 0 and 1
n <- 25000 w <- matrix(rnorm(2*n),ncol=2) # x and epsilon x <- w[,1] y <- x + w[,2] # insert some missing values nmiss <- round(0.1*n) x[sample(1:n,nmiss)] <- NA nmiss <- round(0.2*n) y[sample(1:n,nmiss)] <- NA acout <- lmac(cbind(x,y)) coef(acout) # should be near pop. values 0 and 1
This is data consists of capital letter frequencies obtained at http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.h tml
Various helper functions.
replicMeans(nrep,toReplic,timing=FALSE) stdErrPred(regObj,xnew) pythonBlankSplit(s) stopBrowser(msg = stop("msg not supplied")) doPCA(x,pcaProp) PCAwithFactors(x, nComps = ncol(x)) ulist(lst) prToFile(filename) partTrnTst(fullData,nTest=min(1000,round(0.2*nrow(fullData)))) findOverallLoss(regests,y,lossFtn = MAPE) getNamedArgs(argVec) multCols(x,cols,vals) probIncorrectClass(yhat, y, startAt1 = TRUE) propMisclass(y,yhat)
replicMeans(nrep,toReplic,timing=FALSE) stdErrPred(regObj,xnew) pythonBlankSplit(s) stopBrowser(msg = stop("msg not supplied")) doPCA(x,pcaProp) PCAwithFactors(x, nComps = ncol(x)) ulist(lst) prToFile(filename) partTrnTst(fullData,nTest=min(1000,round(0.2*nrow(fullData)))) findOverallLoss(regests,y,lossFtn = MAPE) getNamedArgs(argVec) multCols(x,cols,vals) probIncorrectClass(yhat, y, startAt1 = TRUE) propMisclass(y,yhat)
regests |
Fitted regression estimates, training set. |
y |
Y values, training set. |
yhat |
Predicted Y values |
startAt1 |
TRUE if indexing starts at 1, FALSE if starting at 0. |
lossFtn |
Loss functin. |
fullData |
A data frame or matrix. |
nTest |
Number of rows for the test set. |
filename |
Name of output file. |
lst |
An R list. |
x |
Matrix or data frame. |
pcaProp |
Fraction in [0,1], specifying number of PCA components to compute, in terms of fraction of total variance. |
nComps |
Number of PCA components. |
regObj |
An object of class |
xnew |
New X value to be predicted. |
nrep |
Number of replications. |
s |
A character string. |
toReplic |
Function call(s), as a quoted string, separated by semicolons if more than one call. |
timing |
If TRUE, find average elapsed time over the replicates. |
msg |
Character string, error message for existing debug browser. |
argVec |
R list or vector with named elements. |
cols |
A set of column numbers. |
vals |
A set of positive expansion numbers. |
The function PCAwithFactors
is a wrapper for
stats::prcomp
, to be used on data frames that contain at least on
R factor.
The function PCAwithFactors
returns an object of class
'PCAwithFactors'. with components pcout
, the object returned by
the wrapped call to prcomp
; factorsInfo
, factor conversion
information to be used with predict
; and preds
, the PCA
version of x
.
The function getNamedArgs
will assign in the caller's space
variables with the names and values in argVec
.
Norm Matloff
w <- list(a=3,b=8) getNamedArgs(w) a b u <- c(5,12,13) names(u) <- c('x','y','z') getNamedArgs(u) x y z
w <- list(a=3,b=8) getNamedArgs(w) a b u <- c(5,12,13) names(u) <- c('x','y','z') getNamedArgs(u) x y z
Heights, weights, ages etc. of major league baseball players. A new variable has been added, consolidating positions into Infielders, Outfielders, Catchers and Pitchers.
Included here with the permission of the UCLA Statistics Department.
The MovieLens dataset, https://grouplens.org/, is a standard example in the recommender systems literature. Here we give demographic data for each user, plus the mean rating and number of ratings. One may explore, for instance, the relation between ratings and age.
Method of Moments computation for almost any statistical problem that has derivatives with respect to theta. Capable of handling models that include parametric regression terms, but not need be a regression problem. (This is not Generalized Method of Moments; see the package gmm for the latter.)
mm(m,g,x,init=rep(0.5,length(m)),eps=0.0001,maxiters=1000)
mm(m,g,x,init=rep(0.5,length(m)),eps=0.0001,maxiters=1000)
m |
Vector of sample moments, "left-hand sides" of moment equations. |
g |
Function of parameter estimates, forming the "right-hand
sides." This is a multivariate-valued function, of dimensionality
equal to that of |
.
init |
Vector of initial guesses for parameter estimates. If components are named, these will be used as labels in the output. |
eps |
Convergence criterion. |
maxiters |
Maximum number of iterations. |
x |
Input data. |
Standard Newton-Raphson methods are used to solve for the parameter
estimates, with numericDeriv
being used to find the
approximate derivatives.
R list consisting of components tht
, the vector of parameter
estimates, and numiters
, the number of iterations performed.
Norm Matloff
x <- rgamma(1000,2) m <- c(mean(x),var(x)) g <- function(x,theta) { # from theoretical properties of gamma distr. g1 <- theta[1] / theta[2] g2 <- theta[1] / theta[2]^2 c(g1,g2) } # should output about 2 and 1 mm(m,g,x) ## Not run: library(mfp) data(bodyfat) # model as a beta distribution g <- function(x,theta) { t1 <- theta[1] t2 <- theta[2] t12 <- t1 + t2 meanb <- t1 / t12 m1 <- meanb m2 <- t1*t2 / (t12^2 * (t12+1)) c(m1,m2) } x <- bodyfat$brozek/100 m <- c(mean(x),var(x)) # about 4.65 and 19.89 mm(m,g,x) ## End(Not run)
x <- rgamma(1000,2) m <- c(mean(x),var(x)) g <- function(x,theta) { # from theoretical properties of gamma distr. g1 <- theta[1] / theta[2] g2 <- theta[1] / theta[2]^2 c(g1,g2) } # should output about 2 and 1 mm(m,g,x) ## Not run: library(mfp) data(bodyfat) # model as a beta distribution g <- function(x,theta) { t1 <- theta[1] t2 <- theta[2] t12 <- t1 + t2 meanb <- t1 / t12 m1 <- meanb m2 <- t1*t2 / (t12^2 * (t12+1)) c(m1,m2) } x <- bodyfat$brozek/100 m <- c(mean(x),var(x)) # about 4.65 and 19.89 mm(m,g,x) ## End(Not run)
Tools for multiclass classification, parametric and nonparametric.
avalogtrn(trnxy,yname) ovaknntrn(trnxy,yname,k,xval=FALSE) avalogpred() classadjust(econdprobs,wrongprob1,trueprob1) boundaryplot(y01,x,regests,pairs=combn(ncol(x),2),pchvals=2+y01,cex=0.5,band=0.10)
avalogtrn(trnxy,yname) ovaknntrn(trnxy,yname,k,xval=FALSE) avalogpred() classadjust(econdprobs,wrongprob1,trueprob1) boundaryplot(y01,x,regests,pairs=combn(ncol(x),2),pchvals=2+y01,cex=0.5,band=0.10)
pchvals |
Point size in base-R graphics. |
trnxy |
Data matrix, Y last. |
xval |
If TRUE, use leaving-one-out method. |
y01 |
Y vector (1s and 0s). |
regests |
Estimated regression function values. |
x |
X data frame or matrix. |
pairs |
Two-row matrix, column i of which is a pair of predictor variables to graph. |
cex |
Symbol size for plotting. |
band |
If |
yname |
Name of the Y column. |
k |
Number of nearest neighbors. |
econdprobs |
Estimated conditional class probabilities, given the predictors. |
wrongprob1 |
Incorrect, data-provenanced, unconditional P(Y = 1). |
trueprob1 |
Correct unconditional P(Y = 1). |
These functions aid classification in the multiclass setting.
The function boundaryplot
serves as a visualization technique,
for the two-class setting. It draws the boundary between predicted Y =
1 and predicted Y = 0 data points in 2-dimensional feature space, as
determined by the argument regests
. Used to visually assess
goodness of fit, typically running this function twice, say one for
glm
then for kNN
. If there is much discrepancy and the
analyst wishes to still use glm(), he/she may wish to add polynomial
terms.
The functions not listed above are largely deprecated, e.g. in favor of
qeLogit
and the other qe
-series functions.
Norm Matloff
## Not run: data(oliveoils) oo <- oliveoils[,-1] # toy example set.seed(9999) x <- runif(25) y <- sample(0:2,25,replace=TRUE) xd <- preprocessx(x,2,xval=FALSE) kout <- ovaknntrn(y,xd,m=3,k=2) kout$regest # row 2: 0.0,0.5,0.5 predict(kout,predpts=matrix(c(0.81,0.55,0.15),ncol=1)) # 0,2,0or2 yd <- factorToDummies(as.factor(y),'y',FALSE) kNN(x,yd,c(0.81,0.55,0.15),2) # predicts 0, 1or2, 2 data(peDumms) # prog/engr data ped <- peDumms[,-33] ped <- as.matrix(ped) x <- ped[,-(23:28)] y <- ped[,23:28] knnout <- kNN(x,y,x,25,leave1out=TRUE) truey <- apply(y,1,which.max) - 1 mean(knnout$ypreds == truey) # about 0.37 xd <- preprocessx(x,25,xval=TRUE) kout <- knnest(y,xd,25) preds <- predict(kout,predpts=x) hats <- apply(preds,1,which.max) - 1 mean(yhats == truey) # about 0.37 data(peFactors) # discard the lower educ-level cases, which are rare edu <- peFactors$educ numedu <- as.numeric(edu) idxs <- numedu >= 12 pef <- peFactors[idxs,] numedu <- numedu[idxs] pef$educ <- as.factor(numedu) pef1 <- pef[,c(1,3,5,7:9)] # ovalog ovaout <- ovalogtrn(pef1,"occ") preds <- predict(ovaout,predpts=pef1[,-3]) mean(preds == factorTo012etc(pef1$occ)) # about 0.39 # avalog avaout <- avalogtrn(pef1,"occ") preds <- predict(avaout,predpts=pef1[,-3]) mean(preds == factorTo012etc(pef1$occ)) # about 0.39 # knn knnout <- ovalogtrn(pef1,"occ",25) preds <- predict(knnout,predpts=pef1[,-3]) mean(preds == factorTo012etc(pef1$occ)) # about 0.43 data(oliveoils) oo <- oliveoils oo <- oo[,-1] knnout <- ovaknntrn(oo,'Region',10) # predict a new case that is like oo1[1,] but with palmitic = 950 newx <- oo[1,2:9,drop=FALSE] newx[,1] <- 950 predict(knnout,predpts=newx) # predicts class 2, South ## End(Not run)
## Not run: data(oliveoils) oo <- oliveoils[,-1] # toy example set.seed(9999) x <- runif(25) y <- sample(0:2,25,replace=TRUE) xd <- preprocessx(x,2,xval=FALSE) kout <- ovaknntrn(y,xd,m=3,k=2) kout$regest # row 2: 0.0,0.5,0.5 predict(kout,predpts=matrix(c(0.81,0.55,0.15),ncol=1)) # 0,2,0or2 yd <- factorToDummies(as.factor(y),'y',FALSE) kNN(x,yd,c(0.81,0.55,0.15),2) # predicts 0, 1or2, 2 data(peDumms) # prog/engr data ped <- peDumms[,-33] ped <- as.matrix(ped) x <- ped[,-(23:28)] y <- ped[,23:28] knnout <- kNN(x,y,x,25,leave1out=TRUE) truey <- apply(y,1,which.max) - 1 mean(knnout$ypreds == truey) # about 0.37 xd <- preprocessx(x,25,xval=TRUE) kout <- knnest(y,xd,25) preds <- predict(kout,predpts=x) hats <- apply(preds,1,which.max) - 1 mean(yhats == truey) # about 0.37 data(peFactors) # discard the lower educ-level cases, which are rare edu <- peFactors$educ numedu <- as.numeric(edu) idxs <- numedu >= 12 pef <- peFactors[idxs,] numedu <- numedu[idxs] pef$educ <- as.factor(numedu) pef1 <- pef[,c(1,3,5,7:9)] # ovalog ovaout <- ovalogtrn(pef1,"occ") preds <- predict(ovaout,predpts=pef1[,-3]) mean(preds == factorTo012etc(pef1$occ)) # about 0.39 # avalog avaout <- avalogtrn(pef1,"occ") preds <- predict(avaout,predpts=pef1[,-3]) mean(preds == factorTo012etc(pef1$occ)) # about 0.39 # knn knnout <- ovalogtrn(pef1,"occ",25) preds <- predict(knnout,predpts=pef1[,-3]) mean(preds == factorTo012etc(pef1$occ)) # about 0.43 data(oliveoils) oo <- oliveoils oo <- oo[,-1] knnout <- ovaknntrn(oo,'Region',10) # predict a new case that is like oo1[1,] but with palmitic = 950 newx <- oo[1,2:9,drop=FALSE] newx[,1] <- 950 predict(knnout,predpts=newx) # predicts class 2, South ## End(Not run)
This data set is adapted from the Adult data from the UCI Machine Learning Repository, which was in turn adapted from Census data on adult incomes and other demographic variables. The UCI data is used here with permission from Ronny Kohavi.
The variables are:
gt50
, which converts the original >50K
variable
to an indicator variable; 1 for income greater than $50,000, else 0
edu
, which converts a set of education levels to
approximate number of years of schooling
age
gender
, 1 for male, 0 for female
mar
, 1 for married, 0 for single
Note that the education variable is now numeric.
Extension of nls
to the heteroscedastic case.
nlshc(nlsout,type='HC')
nlshc(nlsout,type='HC')
nlsout |
Object of type 'nls'. |
type |
Eickert-White algorithm to use. See documentation for nls. |
Calls nls
but then forms a different estimated covariance
matrix for the estimated regression coefficients, applying the
Eickert-White technique to handle heteroscedasticity. This then
gives valid statistical inference in that setting.
Some users may prefer to use nlsLM
of the package
minpack.lm instead of nls
. This is fine, as both
functions return objects of class 'nls'.
Estimated covariance matrix
Norm Matloff
Zeileis A (2006), Object-Oriented Computation of Sandwich Estimators. Journal of Statistical Software, 16(9), 1–16, https://www.jstatsoft.org/v16/i09/.
# simulate data from a setting in which mean Y is # 1 / (b1 * X1 + b2 * X2) n <- 250 b <- 1:2 x <- matrix(rexp(2*n),ncol=2) meany <- 1 / (x %*% b) # reg ftn y <- meany + (runif(n) - 0.5) * meany # heterosced epsilon xy <- cbind(x,y) xy <- data.frame(xy) # see nls() docs nlout <- nls(X3 ~ 1 / (b1*X1+b2*X2), data=xy,start=list(b1 = 1,b2=1)) nlshc(nlout)
# simulate data from a setting in which mean Y is # 1 / (b1 * X1 + b2 * X2) n <- 250 b <- 1:2 x <- matrix(rexp(2*n),ncol=2) meany <- 1 / (x %*% b) # reg ftn y <- meany + (runif(n) - 0.5) * meany # heterosced epsilon xy <- cbind(x,y) xy <- data.frame(xy) # see nls() docs nlout <- nls(X3 ~ 1 / (b1*X1+b2*X2), data=xy,start=list(b1 = 1,b2=1)) nlshc(nlout)
Italian olive oils data set, as used in Graphics of Large Datasets: Visualizing a Million, by Antony Unwin, Martin Theus and Heike Hofmann, Springer, 2006. Included here with permission of Dr. Martin Theus.
Provides mininum-norm solutions to linear models, identical to OLS in standard situations, but allowing exploration of overfitting in the overparameterized case. Also provides a wrapper for the polynomial case.
penroseLM(d,yName) penrosePoly(d,yName,deg,maxInteractDeg=deg) ridgePoly(d,yName,deg,maxInteractDeg=deg) ## S3 method for class 'penroseLM' predict(object,...) ## S3 method for class 'penrosePoly' predict(object,...)
penroseLM(d,yName) penrosePoly(d,yName,deg,maxInteractDeg=deg) ridgePoly(d,yName,deg,maxInteractDeg=deg) ## S3 method for class 'penroseLM' predict(object,...) ## S3 method for class 'penrosePoly' predict(object,...)
... |
Arguments for the |
d |
Dataframe, training set. |
yName |
Name of the class labels column. |
deg |
Polynomial degree. |
maxInteractDeg |
Maximum degree of interaction terms. |
object |
A value returned by |
First, provides a convenient wrapper to the polyreg package for
polynomial regression. (See qePoly
here for an even higher-level
wrapper.) Note that this computes true polynomials, with
cross-product/interaction terms rather than just powers, and that dummy
variables are handled properly (to NOT compute powers).
Second, provides a tool for exploring the "double descent" phenomenon, in which prediction error may improve upon fitting past the interpolation point.
Norm Matloff
Phoneme detection, 2 types. Features are from harmonic analysis of th voice. From OpenML, https://www.openml.org/d/1489.
This data set is adapted from the 2000 Census (5% sample, person records). It is mainly restricted to programmers and engineers in the Silicon Valley area. (Apparently due to errors, there are some from other ZIP codes.)
There are three versions:
prgeng
, the original data, with categorical variables,
e.g. Occupation, in their original codes
peDumms
, same but with categorical variables
converted to dummies; due to the large number of levels the birth
and PUMA data is not included
peFactors
, same but with categorical variables
converted to factors
pef
, same as peFactors
, but having only columns
for age, education, occupation, gender, wage income and weeks
worked. The education column has been collapsed to Master's degree,
PhD and other.
The variable codes, e.g. occupational codes, are available from https://usa.ipums.org/usa/volii/occ2000.shtml. (Short code lists are given in the record layout, but longer ones are in the appendix Code Lists.)
The variables are:
age
, with a U(0,1) variate added for jitter
cit
, citizenship; 1-4 code various categories of
citizens; 5 means noncitizen (including permanent residents)
educ
: 01-09 code no college; 10-12 means some college;
13 is a bachelor's degree, 14 a master's, 15 a professional degree and
16 is a doctorate
occ
, occupation
birth
, place of birth
wageinc
, wage income
wkswrkd
, number of weeks worked
yrentry
, year of entry to the U.S. (0 for natives)
powpuma
, location of work
gender
, 1 for male, 2 for female
data(prgeng) data(peDumms) data(peFactors)
data(prgeng) data(peDumms) data(peFactors)
This data is suitable for NLP analysis. It consists of all the quizzes I've given in undergraduate courses, 143 quizzes in all.
It is available in two forms. First, quizzes
is a data.frame,
143 rows and 2 columns. Row i consists of a single character vector
comprising the entire quiz i, followed by the course name (as an R
factor). The second form is an R list, 143 elements. Each list element
is a character vector, one vector element per line of the quiz.
The original documents were LaTeX files. They have been run through the
detex
utility to remove most LaTeX commands, as well as removing
the LaTeX preambles separately.
The names of the list elements are the course names, as follows:
ECS 50: a course in machine organization
ECS 132: an undergraduate course in probabilistic modeling
ECS 145: a course in scripting languages (Python, R)
ECS 158: an undergraduate course in parallel computation
ECS 256: a graduate course in probabilistic modeling
Similar to lm.ridge
in MASS
packaged included
with R, but with a different kind of scaling and a little nicer
plotting.
ridgelm(xy,lambda = seq(0.01,1,0.01),mapback=TRUE) ## S3 method for class 'rlm' plot(x,y,...)
ridgelm(xy,lambda = seq(0.01,1,0.01),mapback=TRUE) ## S3 method for class 'rlm' plot(x,y,...)
xy |
Data, response variable in the last column. |
lambda |
Vector of desired values for the ridge parameter. |
mapback |
If TRUE, the scaling that had been applied to the original data will be map back to the original scale, so that the estimated regression coefficients are now on the scale of the original data. |
x |
Object of type 'rlm', output of |
y |
Needed for consistency with the generic. Not used. |
... |
Needed for consistency with the generic. Not used. |
Centers and scales the predictors X, and centers the response variable Y. Computes X'X and then solves [(X'X)/n + lambda I]b = X'Y/n for b. The 1/n factors are important, making the diagonal elements of (X'X)/n all 1s and thus facilitating choices for the lambdas in a manner independent of the data.
Calling plot
on the output of ridgelm
dispatches to
plot.rlm
, thus diplaying the ridge traces.
The function ridgelm
returns an object of class 'rlm', with
components bhats
, the estimated beta vectors, one column per
lambda value, and lambda
, a copy of the input.
Norm Matloff
See http://people.cs.uchicago.edu/~dinoj/manifold/swissroll.html for this version of Swiss Roll.
Running data(SwissRoll)
produces an object sw
.
"R-style," classification-oriented wrappers for the text2vec package.
textToXY(docs,labels,kTop=50,stopWords='a') textToXYpred(ttXYout,predDocs)
textToXY(docs,labels,kTop=50,stopWords='a') textToXYpred(ttXYout,predDocs)
docs |
Character vector, one element per document. |
predDocs |
Character vector, one element per document. |
labels |
Class labels, as numeric, character or factor. NULL is used at the prediction stage. |
kTop |
The number of most-frequent words to retain; 0 means retain all. |
stopWords |
Character vector of common words, e.g. prepositions
to delete. Recommended is |
ttXYout |
Output object from |
A typical classification/machine learning package will have as arguments a feature matrix X and a labels vector/factor Y. For a "bag of words" analysis in the text case, each row of X would be a document and each column a word.
The functions here are basically wrappers for generating X. Wrappers are convenient in that:
The text2vec package is rather arcane, so a "R-style" wrapper would be useful.
The text2vec are not directly set up to do classification, so the functions here provide the "glue" to do that.
The typical usage pattern is thus:
Run the documents vector and labels vector/factor through
textToXY
, generating X and Y.
Apply your favorite classification/machine learning package p to X and Y, returning o.
When predicting a new document d, run o and d through
textToXY
, producing x.
Run x on p's predict
function.
The function textToXY
returns an R list with components
x
and y
for X and Y, and a copy of the input
stopWords
.
The function textToXY
returns X.
Norm Matloff
Input a time series and transform it to a form suitable for prediction
using lm
etc.
TStoX(x,lg) TStoXmv(xmat,lg,y)
TStoX(x,lg) TStoXmv(xmat,lg,y)
x |
A vector. |
lg |
Lag, a positive integer. |
xmat |
A matrix, data frame etc., a multivariate time series. Each column is a time series, over a common time period. |
y |
A time series, again on that common time period. If NULL in
|
Similar to stats::embed
, but in lagged form, with applications
such as lm
in mind.
TStoX
is for transforming vectors, while TStoXmv
handles the multivariate time series case. Intended for use with
lm
or other regression/machine learning model, predicting
y[i]
from observations i-lg, i-lg+1,...,i-1
.
As noted, the idea is to set up something like lm(Y ~ X)
.
Let m
denote length of x
, and in the matrix input
case, the number of rows in xmat
. Let p
be 1 in the
vector case, ncol(xmat)
in the matrix case. The return value
is a matrix with m-lg
rows. There will be p*lg+1
columns, with "Y," the numbers to be predicted in the last column.
In the output in the multivariate case, let k denote
ncol(xmat)
. Then the first k columns of the output will be
the k series at lag lg
, the second k columns will be the k
series at lag lg-1
, ..., and the lg
-th set of k
columns will be the k series at lag 1,
Norm Matloff
x1 <- c(5,12,13,8,88,6) x2 <- c(5,4,3,18,168,0) y <- 1:6 xmat <- cbind(x1,x2) TStoX(x1,2) # [,1] [,2] [,3] # [1,] 5 12 13 # [2,] 12 13 8 # [3,] 13 8 88 # [4,] 8 88 6 xy <- TStoXmv(xmat,2,y) xy # [,1] [,2] [,3] [,4] [,5] # [1,] 5 5 12 4 3 # [2,] 12 4 13 3 4 # [3,] 13 3 8 18 5 # [4,] 8 18 88 168 6 lm(xy[,5] ~ xy[,-5]) # Coefficients: # (Intercept) xy[, -5]1 xy[, -5]2 xy[, -5]3 xy[, -5]4 # -65.6 3.2 18.2 -3.2 NA # need n > 7 here for useful lm() call, but this illustrates the idea
x1 <- c(5,12,13,8,88,6) x2 <- c(5,4,3,18,168,0) y <- 1:6 xmat <- cbind(x1,x2) TStoX(x1,2) # [,1] [,2] [,3] # [1,] 5 12 13 # [2,] 12 13 8 # [3,] 13 8 88 # [4,] 8 88 6 xy <- TStoXmv(xmat,2,y) xy # [,1] [,2] [,3] [,4] [,5] # [1,] 5 5 12 4 3 # [2,] 12 4 13 3 4 # [3,] 13 3 8 18 5 # [4,] 8 18 88 168 6 lm(xy[,5] ~ xy[,-5]) # Coefficients: # (Intercept) xy[, -5]1 xy[, -5]2 xy[, -5]3 xy[, -5]4 # -65.6 3.2 18.2 -3.2 NA # need n > 7 here for useful lm() call, but this illustrates the idea
Utilities.
unscale(scaledx,ctrs=NULL,sds=NULL) mmscale(m,scalePars=NULL,p=NULL) catDFRow(dfRow) constCols(d) allNumeric(lst)
unscale(scaledx,ctrs=NULL,sds=NULL) mmscale(m,scalePars=NULL,p=NULL) catDFRow(dfRow) constCols(d) allNumeric(lst)
scaledx |
A matrix. |
m |
A matrix. |
ctrs |
Take the original means to be |
lst |
An R list. |
sds |
Take the original standard deviations to be |
dfRow |
A row in a data frame. |
d |
A data frame or matrix. |
scalePars |
If not NULL, a 2-row matrix, with column |
p |
If |
The function mmscale
is meant as a better-behaved alternative to
scale
. Using minimum and maximum values, it maps variables to
[0,1], thus avoiding the problems arising from very small standard
deviations in scale
.
The function catDFRow
nicely prints a row of a data frame.
The function constCols
determines which columns of a data frame
or matrix are constant, if any.
The function unscale
returns the original object to which
scale
had been applied. Or, the attributes ctrs
and
sds
can be specified by the user.
Norm Matloff
Various measurements on weather variables collected by NASA. Downloaded
via nasapower
; see that package for documentation.
Graphics utiliites.
xyzPlot(xyz,clrs=NULL,cexText=1.0,xlim=NULL,ylim=NULL, xlab=NULL,ylab=NULL,legendPos=NULL,plotType='l')
xyzPlot(xyz,clrs=NULL,cexText=1.0,xlim=NULL,ylim=NULL, xlab=NULL,ylab=NULL,legendPos=NULL,plotType='l')
xyz |
A matrix or data frame of at least 3 columns, the first
3 serving as 'x', 'y' and 'z' coordinates of points to be plotted.
Grouping, if any, is specified in column 4, in which case |
clrs |
Colors to be used in the grouped case. |
cexText |
Text size, proportional to standard. |
xlim |
As in |
ylim |
As in |
xlab |
As in |
ylab |
As in |
legendPos |
As in |
plotType |
Coded 'l' for lines, 'p' for points. |
A way to display 3-dimensional data in 2 dimensions. For each plotted point (x,y), a z value is written in text over the point. A grouping variable is also allowed, with different colors used to plot different groups.
A group (including the entire data in the case of one group) can be displayed either as a polygonal line, or just as a point cloud. The user should experiment with different argument settings to get the most visually impactful plot.
Norm Matloff
## Not run: xyzPlot(mtcars[,c(3,6,1)],plotType='l',cexText=0.75) xyzPlot(mtcars[,c(3,6,1)],plotType='p',cexText=0.75) xyzPlot(mtcars[,c(3,6,1)],plotType='l',cexText=0.75) xyzPlot(mtcars[,c(3,6,1,2)],clrs=c('red','darkgreen','blue'),plotType='l',cexText=0.75) ## End(Not run)
## Not run: xyzPlot(mtcars[,c(3,6,1)],plotType='l',cexText=0.75) xyzPlot(mtcars[,c(3,6,1)],plotType='p',cexText=0.75) xyzPlot(mtcars[,c(3,6,1)],plotType='l',cexText=0.75) xyzPlot(mtcars[,c(3,6,1,2)],clrs=c('red','darkgreen','blue'),plotType='l',cexText=0.75) ## End(Not run)