Removes extreme outliers using a winsorization transformation

Winsorization is the transformation of a distribution by limiting extreme values to reduce the effect of spurious outliers. This is done by shrinking outlying observations to the border of the main part of the distribution.

winsorize(
  x,
  min.value = NULL,
  max.value = NULL,
  p = c(0.05, 0.95),
  na.rm = FALSE
)

Arguments

x

A numeric vector

min.value

A fixed lower bounds, all values lower than this will be replaced by this value. The default is set to the 5th-quantile of x.

max.value

A fixed upper bounds, all values higher than this will be replaced by this value. The default is set to the 95th-quantile of x.

p

A numeric vector of 2 representing the probabilities used in the quantile function.

na.rm

(FALSE/TRUE) should NAs be omitted?

Value

A transformed vector the same length as x, unless na.rm is TRUE, then x is length minus number of NA's

References

Dixon, W.J. (1960) Simplified Estimation from Censored Normal Samples. Annals of Mathematical Statistics. 31(2):385-391

Author

Jeffrey S. Evans <jeffrey_evans@tnc.org>

Examples

set.seed(1234) x <- rnorm(100) x[1] <- x[1] * 10 winsorize(x)
#> [1] -1.457249338 0.277429242 1.084441177 -1.457249338 0.429124689 #> [6] 0.506055892 -0.574739960 -0.546631856 -0.564451999 -0.890037829 #> [11] -0.477192700 -0.998386445 -0.776253895 0.064458817 0.959494059 #> [16] -0.110285494 -0.511009506 -0.911195417 -0.837171680 1.709520029 #> [21] 0.134088220 -0.490685897 -0.440547872 0.459589441 -0.693720247 #> [26] -1.448204910 0.574755721 -1.023655723 -0.015138300 -0.935948601 #> [31] 1.102297546 -0.475593079 -0.709440038 -0.501258061 -1.457249338 #> [36] -1.167619262 -1.457249338 -1.340993192 -0.294293859 -0.465897540 #> [41] 1.449496265 -1.068642724 -0.855364634 -0.280623002 -0.994340076 #> [46] -0.968514318 -1.107318193 -1.251985886 -0.523828119 -0.496849957 #> [51] -1.457249338 -0.582075925 -1.108889624 -1.014962009 -0.162309524 #> [56] 0.563055819 1.647817473 -0.773353424 1.605909629 -1.157808548 #> [61] 0.656588464 1.709520029 -0.034760390 -0.669633580 -0.007604756 #> [66] 1.709520029 -1.138607737 1.367827179 1.329564791 0.336472797 #> [71] 0.006892838 -0.455468738 -0.366523933 0.648286568 1.709520029 #> [76] -0.153398412 -1.390700947 -0.723581777 0.258261762 -0.317059115 #> [81] -0.177789958 -0.169994077 -1.372301886 -0.173787170 0.850232257 #> [86] 0.697608712 0.549997351 -0.402731975 -0.191593770 -1.194527880 #> [91] -0.053158819 0.255196001 1.705964007 1.001513252 -0.495583443 #> [96] 0.355550297 -1.134608044 0.878203627 0.972916753 1.709520029
plot(x, type="l", main="Winsorization transformation")
lines(winsorize(x), col="red", lwd=2)
legend("bottomright", legend=c("Original distribution","With outliers removed"), lty=c(1,1), col=c("black","red"))
# Behavior with NA value(s) x[4] <- NA winsorize(x) # returns x with original NA's
#> [1] -1.396451343 0.277429242 1.084441177 NA 0.506055892 #> [6] -0.574739960 -0.546631856 -0.564451999 -0.890037829 -0.477192700 #> [11] -0.998386445 -0.776253895 0.064458817 0.959494059 -0.110285494 #> [16] -0.511009506 -0.911195417 -0.837171680 1.713076051 0.134088220 #> [21] -0.490685897 -0.440547872 0.459589441 -0.693720247 -1.396451343 #> [26] 0.574755721 -1.023655723 -0.015138300 -0.935948601 1.102297546 #> [31] -0.475593079 -0.709440038 -0.501258061 -1.396451343 -1.167619262 #> [36] -1.396451343 -1.340993192 -0.294293859 -0.465897540 1.449496265 #> [41] -1.068642724 -0.855364634 -0.280623002 -0.994340076 -0.968514318 #> [46] -1.107318193 -1.251985886 -0.523828119 -0.496849957 -1.396451343 #> [51] -0.582075925 -1.108889624 -1.014962009 -0.162309524 0.563055819 #> [56] 1.647817473 -0.773353424 1.605909629 -1.157808548 0.656588464 #> [61] 1.713076051 -0.034760390 -0.669633580 -0.007604756 1.713076051 #> [66] -1.138607737 1.367827179 1.329564791 0.336472797 0.006892838 #> [71] -0.455468738 -0.366523933 0.648286568 1.713076051 -0.153398412 #> [76] -1.390700947 -0.723581777 0.258261762 -0.317059115 -0.177789958 #> [81] -0.169994077 -1.372301886 -0.173787170 0.850232257 0.697608712 #> [86] 0.549997351 -0.402731975 -0.191593770 -1.194527880 -0.053158819 #> [91] 0.255196001 1.705964007 1.001513252 -0.495583443 0.355550297 #> [96] -1.134608044 0.878203627 0.972916753 1.713076051
winsorize(x, na.rm=TRUE) # removes NA's
#> [1] -1.396451343 0.277429242 1.084441177 0.429124689 0.506055892 #> [6] -0.574739960 -0.546631856 -0.564451999 -0.890037829 -0.477192700 #> [11] -0.998386445 -0.776253895 0.064458817 0.959494059 -0.110285494 #> [16] -0.511009506 -0.911195417 -0.837171680 1.713076051 0.134088220 #> [21] -0.490685897 -0.440547872 0.459589441 -0.693720247 -1.396451343 #> [26] 0.574755721 -1.023655723 -0.015138300 -0.935948601 1.102297546 #> [31] -0.475593079 -0.709440038 -0.501258061 -1.396451343 -1.167619262 #> [36] -1.396451343 -1.340993192 -0.294293859 -0.465897540 1.449496265 #> [41] -1.068642724 -0.855364634 -0.280623002 -0.994340076 -0.968514318 #> [46] -1.107318193 -1.251985886 -0.523828119 -0.496849957 -1.396451343 #> [51] -0.582075925 -1.108889624 -1.014962009 -0.162309524 0.563055819 #> [56] 1.647817473 -0.773353424 1.605909629 -1.157808548 0.656588464 #> [61] 1.713076051 -0.034760390 -0.669633580 -0.007604756 1.713076051 #> [66] -1.138607737 1.367827179 1.329564791 0.336472797 0.006892838 #> [71] -0.455468738 -0.366523933 0.648286568 1.713076051 -0.153398412 #> [76] -1.390700947 -0.723581777 0.258261762 -0.317059115 -0.177789958 #> [81] -0.169994077 -1.372301886 -0.173787170 0.850232257 0.697608712 #> [86] 0.549997351 -0.402731975 -0.191593770 -1.194527880 -0.053158819 #> [91] 0.255196001 1.705964007 1.001513252 -0.495583443 0.355550297 #> [96] -1.134608044 0.878203627 0.972916753 1.713076051