Title: | Learn from Training Data then Quickly Fill in Missing Data |
---|---|
Description: | TrainFastImputation() uses training data to describe a multivariate normal distribution that the data approximates or can be transformed into approximating and stores this information as an object of class 'FastImputationPatterns'. FastImputation() function uses this 'FastImputationPatterns' object to impute (make a good guess at) missing data in a single line or a whole data frame of data. This approximates the process used by 'Amelia' <https://gking.harvard.edu/amelia> but is much faster when filling in values for a single line of data. |
Authors: | Stephen R. Haptonstahl |
Maintainer: | Stephen R. Haptonstahl <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.2.1 |
Built: | 2024-11-21 04:36:14 UTC |
Source: | https://github.com/cran/FastImputation |
This takes variables on the real line and constrains them to be on
a half-line (constrained above or below) or a segment (constrained both
above and below). This is approximately the inverse of
NormalizeBoundedVariable
; this does not completely reverse the
effect of NormalizeBoundedVariable
because NormalizeBoundedVariable
first forces values away from the bounds, and this information is lost.
BoundNormalizedVariable(x, constraints)
BoundNormalizedVariable(x, constraints)
x |
A vector, matrix, array, or dataframe with value to be coerced into a range or set. |
constraints |
A list of constraints. See the examples below for formatting details. |
An object of the same class as x with the values transformed into the desired half-line or segment.
Stephen R. Haptonstahl [email protected]
constraints=list(lower=5) # lower bound when constrining to an interval constraints=list(upper=10) # upper bound when constraining to an interval constraints=list(lower=5, upper=10) # both lower and upper bounds
constraints=list(lower=5) # lower bound when constrining to an interval constraints=list(upper=10) # upper bound when constraining to an interval constraints=list(lower=5, upper=10) # both lower and upper bounds
Ignoring missing values can lead to biased estimates of the covariance. Lounici (2012) gives an unbiased estimator when the data has missing values.
CovarianceWithMissing(x)
CovarianceWithMissing(x)
x |
matrix or data.frame, data with each row an observation and each column a variable. |
matrix, unbiased estimate of the covariance.
Stephen R. Haptonstahl [email protected]
High-dimensional covariance matrix estimation with missing observations. Karim Lounici. 2012.
Like Amelia, FastImputation assumes that the columns of the data are multivariate normal or can be transformed into approximately multivariate normal.
FastImputation(x, patterns, verbose = TRUE)
FastImputation(x, patterns, verbose = TRUE)
x |
Dataframe, possibly with some missing ( |
patterns |
An object of class 'FastImputationPatterns' generated by |
verbose |
If TRUE then the progress in imputing the data will be shown. |
x, but with missing values filled in (imputed)
Stephen R. Haptonstahl [email protected]
https://gking.harvard.edu/amelia
data(FI_train) # provides FItrain dataset patterns <- TrainFastImputation( FI_train, constraints=list(list("bounded_below_2", list(lower=0)), list("bounded_above_5", list(upper=0)), list("bounded_above_and_below_6", list(lower=0, upper=1)) ), idvars="user_id_1", categorical="categorical_9") data(FI_test) FI_test # note there is missing data imputed_data <- FastImputation(FI_test, patterns) imputed_data # good guesses for missing values are filled in data(FI_true) continuous_cells_imputed <- is.na(FI_test[,2:8]) continuous_imputed_values <- imputed_data[,2:8][continuous_cells_imputed] continuous_true_values <- FI_true[,2:8][continuous_cells_imputed] rmse <- sqrt(median((continuous_imputed_values-continuous_true_values)^2)) rmse median_relative_error <- median( abs((continuous_imputed_values - continuous_true_values) / continuous_true_values) ) median_relative_error imputed_data_column_means <- FI_test[,2:8] for(j in 1:ncol(imputed_data_column_means)) { imputed_data_column_means[is.na(imputed_data_column_means[,j]),j] <- mean(imputed_data_column_means[,j], na.rm=TRUE) } cont_imputed_vals_col_means <- imputed_data_column_means[continuous_cells_imputed] rmse_column_means <- sqrt(median((cont_imputed_vals_col_means-continuous_true_values)^2)) rmse_column_means # much larger error than using FastImputation median_relative_error_col_means <- median( abs((cont_imputed_vals_col_means - continuous_true_values) / continuous_true_values) ) median_relative_error_col_means # larger error than using FastImputation # Let's look at the accuracy of the imputation of the categorical variable library("caret") categorical_rows_imputed <- which(is.na(FI_test$categorical_9)) confusionMatrix(data=imputed_data$categorical_9[categorical_rows_imputed], reference=FI_true$categorical_9[categorical_rows_imputed]) # Compare to imputing with the modal value stat_mode <- function(x) { unique_values <- unique(x) unique_values <- unique_values[!is.na(unique_values)] unique_values[which.max(tabulate(match(x, unique_values)))] } categorical_rows_imputed_col_mode <- rep(stat_mode(FI_test$categorical_9), length(categorical_rows_imputed)) confusionMatrix(data=categorical_rows_imputed_col_mode, reference=FI_true$categorical_9[categorical_rows_imputed]) # less accurate than using FastImputation
data(FI_train) # provides FItrain dataset patterns <- TrainFastImputation( FI_train, constraints=list(list("bounded_below_2", list(lower=0)), list("bounded_above_5", list(upper=0)), list("bounded_above_and_below_6", list(lower=0, upper=1)) ), idvars="user_id_1", categorical="categorical_9") data(FI_test) FI_test # note there is missing data imputed_data <- FastImputation(FI_test, patterns) imputed_data # good guesses for missing values are filled in data(FI_true) continuous_cells_imputed <- is.na(FI_test[,2:8]) continuous_imputed_values <- imputed_data[,2:8][continuous_cells_imputed] continuous_true_values <- FI_true[,2:8][continuous_cells_imputed] rmse <- sqrt(median((continuous_imputed_values-continuous_true_values)^2)) rmse median_relative_error <- median( abs((continuous_imputed_values - continuous_true_values) / continuous_true_values) ) median_relative_error imputed_data_column_means <- FI_test[,2:8] for(j in 1:ncol(imputed_data_column_means)) { imputed_data_column_means[is.na(imputed_data_column_means[,j]),j] <- mean(imputed_data_column_means[,j], na.rm=TRUE) } cont_imputed_vals_col_means <- imputed_data_column_means[continuous_cells_imputed] rmse_column_means <- sqrt(median((cont_imputed_vals_col_means-continuous_true_values)^2)) rmse_column_means # much larger error than using FastImputation median_relative_error_col_means <- median( abs((cont_imputed_vals_col_means - continuous_true_values) / continuous_true_values) ) median_relative_error_col_means # larger error than using FastImputation # Let's look at the accuracy of the imputation of the categorical variable library("caret") categorical_rows_imputed <- which(is.na(FI_test$categorical_9)) confusionMatrix(data=imputed_data$categorical_9[categorical_rows_imputed], reference=FI_true$categorical_9[categorical_rows_imputed]) # Compare to imputing with the modal value stat_mode <- function(x) { unique_values <- unique(x) unique_values <- unique_values[!is.na(unique_values)] unique_values[which.max(tabulate(match(x, unique_values)))] } categorical_rows_imputed_col_mode <- rep(stat_mode(FI_test$categorical_9), length(categorical_rows_imputed)) confusionMatrix(data=categorical_rows_imputed_col_mode, reference=FI_true$categorical_9[categorical_rows_imputed]) # less accurate than using FastImputation
Smaller simulated dataset drawn from the same distribution as FI_train and FI_true. This dataset is entirely the same as FI_true except this one has 5% of its values missing. Used with FastImputation.
data(FI_test)
data(FI_test)
A data frame with 9 variables and 250 observations.
user_id_1
Sequential user ids
bounded_below_2
Multivariate normal, transformed using exp(x)
unbounded_3
Multivariate normal
unbounded_4
Multivariate normal
bounded_above_5
Multivariate normal, transformed using -exp(x)
bounded_above_and_below_6
Multivariate normal, transformed using pnorm(x)
unbounded_7
Multivariate normal
unbounded_8
Multivariate normal
categorical_9
"A" if the first of three multivariate normal draws is greatest; "B" if the second is greatest; "C" if the third is greatest
Stephen R. Haptonstahl [email protected]
All columns start as multivariate normal draws. Columns 2, 5, and 6 are transformed. Column 9 is the result of three multivariate normal columns being interpreted as one-hot encoding of a three-valued categorical variable.
Larger simulated dataset drawn from the same distribution as FI_test and FI_true and used to train the imputation algorithm. 5% of the values are missing. Used with TrainFastImputation.
data(FI_train)
data(FI_train)
A data frame with 9 variables and 10000 observations.
user_id_1
Sequential user ids
bounded_below_2
Multivariate normal, transformed using exp(x)
unbounded_3
Multivariate normal
unbounded_4
Multivariate normal
bounded_above_5
Multivariate normal, transformed using -exp(x)
bounded_above_and_below_6
Multivariate normal, transformed using pnorm(x)
unbounded_7
Multivariate normal
unbounded_8
Multivariate normal
categorical_9
"A" if the first of three multivariate normal draws is greatest; "B" if the second is greatest; "C" if the third is greatest
Stephen R. Haptonstahl [email protected]
All columns start as multivariate normal draws. Columns 2, 5, and 6 are transformed. Column 9 is the result of three multivariate normal columns being interpreted as one-hot encoding of a three-valued categorical variable.
Smaller simulated dataset drawn from the same distribution as FI_train and FI_test. This dataset is entirely the same as FI_test except FI_test has 5% of its values missing. Used to evaluate the quality of the values imputed in FI_test.
data(FI_true)
data(FI_true)
A data frame with 9 variables and 250 observations.
user_id_1
Sequential user ids
bounded_below_2
Multivariate normal, transformed using exp(x)
unbounded_3
Multivariate normal
unbounded_4
Multivariate normal
bounded_above_5
Multivariate normal, transformed using -exp(x)
bounded_above_and_below_6
Multivariate normal, transformed using pnorm(x)
unbounded_7
Multivariate normal
unbounded_8
Multivariate normal
categorical_9
"A" if the first of three multivariate normal draws is greatest; "B" if the second is greatest; "C" if the third is greatest
Stephen R. Haptonstahl [email protected]
All columns start as multivariate normal draws. Columns 2, 5, and 6 are transformed. Column 9 is the result of three multivariate normal columns being interpreted as one-hot encoding of a three-valued categorical variable.
This transforms bounded variables so that they are not bounded.
First variables are coerced away from the boundaries. by a distance of tol
.
The natural log is used for variables bounded either above or below but not both.
The inverse of the standard normal cumulative distribution function
(the quantile function) is used for variables bounded above and below.
NormalizeBoundedVariable(x, constraints, tol = stats::pnorm(-5), trim = TRUE)
NormalizeBoundedVariable(x, constraints, tol = stats::pnorm(-5), trim = TRUE)
x |
A vector, matrix, array, or dataframe with value to be coerced into a range or set. |
constraints |
A list of constraints. See the examples below for formatting details. |
tol |
Variables will be forced to be at least this far away from the boundaries. |
trim |
If TRUE values in x < lower and values in x > upper will be set to lower and upper, respectively, before normalizing. |
An object of the same class as x
with the values
transformed so that they spread out over any part of the real
line.
A variable x
that is bounded below by lower
is
transformed to log(x - lower)
.
A variable x
that is bounded above by upper
is
transformed to log(upper - x)
.
A variable x
that is bounded below by lower
and
above by upper
is transformed to
qnorm((x-lower)/(upper - lower))
.
Stephen R. Haptonstahl [email protected]
constraints=list(lower=5) # lower bound when constrining to an interval constraints=list(upper=10) # upper bound when constraining to an interval constraints=list(lower=5, upper=10) # both lower and upper bounds
constraints=list(lower=5) # lower bound when constrining to an interval constraints=list(upper=10) # upper bound when constraining to an interval constraints=list(lower=5, upper=10) # both lower and upper bounds
Like Amelia, FastImputation assumes that the columns of the data are multivariate normal or can be transformed into approximately multivariate normal.
TrainFastImputation(x, constraints = list(), idvars, categorical)
TrainFastImputation(x, constraints = list(), idvars, categorical)
x |
Dataframe containing training data. Can have incomplete rows. |
constraints |
A list of constraints. See the examples below for formatting details. |
idvars |
A vector of column numbers or column names to be ignored in the imputation process. |
categorical |
A vector of column numbers or column names of varaibles with a (small) set of possible values. |
An object of class 'FastImputationPatterns' that contains information needed later to impute on a single row.
Stephen R. Haptonstahl [email protected]
https://gking.harvard.edu/amelia
data(FI_train) # provides FI_train dataset patterns_with_constraints <- TrainFastImputation( FI_train, constraints=list(list("bounded_below_2", list(lower=0)), list("bounded_above_5", list(upper=0)), list("bounded_above_and_below_6", list(lower=0, upper=1)) ), idvars="user_id_1", categorical="categorical_9")
data(FI_train) # provides FI_train dataset patterns_with_constraints <- TrainFastImputation( FI_train, constraints=list(list("bounded_below_2", list(lower=0)), list("bounded_above_5", list(upper=0)), list("bounded_above_and_below_6", list(lower=0, upper=1)) ), idvars="user_id_1", categorical="categorical_9")
Convert columns of a dataframe from factors to character or numeric.
UnfactorColumns(x)
UnfactorColumns(x)
x |
A dataframe |
A dataframe containing the same data but any factor
columns have been replaced with numeric or character columns.
Stephen R. Haptonstahl [email protected]