Subsampling • dbsubsampling

You can get your subsample based various design-based methods, here we give some examples.

library(dbsubsampling)

Uniform Sampling

Get random subsample with equal probability.

N <- 1000
n <- 10
Unif(N = 1000, n = 10)
#>  [1] 173 407 460  63 485 870 879 559 927 452

You can set a random seed, this random seed is only valid for this sampling and will not affect the external environment. The parameter replace default TURE means the sampling is with replacement, you can set it FALSE to implement sampling without replacement.

Unif(N = 1000, n = 10, seed = 123, replace = TRUE)
#>  [1] 415 463 179 526 195 938 818 118 299 229

OSMAC

A subsampling method based on A- / L- optimal for logistic regression proposed by Wang et.al. (2018)¹.

A-optimal

A-optimal minimise the trace of the covariance matrix of the parameter estimates.

data_binary <- data_binary_class
y <- data_binary[["y"]]
x <- data_binary[-which(names(data_binary) == "y")]

OSMAC(X = x, Y = y, r1 = 100, r2 = 10, method="mmse", seed_1 = 123, seed_2 = 456)
#>  [1] 5684 1620 5372 8297 8863 9783 6483 6103 2702 5735

or you can use a unified interface (recommended):

subsampling(y_name = "y", data = data_binary, n = 10, pilot_n = 100, method = "OSMAC_A", 
            seed_1 = 123, seed_2 = 456)
#>  [1] 5684 1620 5372 8297 8863 9783 6483 6103 2702 5735

L-optimal

L-optimal minimise the trace of the covariance matrix of the linear combination of parameter estimates.

OSMAC(X = x, Y = y, r1 = 100, r2 = 10, method="mvc", seed_1 = 123, seed_2 = 456)
#>  [1] 5813 1681 5372 8313 8863 9780 1630 6103 2702 5888

or you can use a unified interface (recommended):

subsampling(y_name = "y", data = data_binary, n = 10, pilot_n = 100, method = "OSMAC_L", 
            seed_1 = 123, seed_2 = 456)
#>  [1] 5813 1681 5372 8313 8863 9780 1630 6103 2702 5888

Tips: for uniform sampling you can use the unified interface such as:

subsampling(y_name = "y", data = data_binary, n = 10, method = "Unif", seed_1 = 123)
#>  [1] 4761 6746 9819 2757 5107 9145 9209 2888 6170 2567

IBOSS

A subsampling method based on D-optimal for linear regression proposed by Wang et.al. (2019)².

data_numeric <- data_numeric_regression
X <- data_numeric[-which(names(data_numeric) == "y")]
IBOSS(n = 100, X = X)
#>  [1]  183  226  395  419  584  666  711  758 1027 1144 1324 1445 1940 1946 1978
#> [16] 2018 2673 2982 3190 3395 3484 3612 3632 3638 3696 3816 3835 3896 3921 4256
#> [31] 4312 4405 4523 4551 4729 4938 5121 5226 5342 5410 5679 5770 5995 6089 6163
#> [46] 6170 6203 6250 6525 6964 6979 7053 7198 7407 7564 7633 7915 7935 7967 7992
#> [61] 8026 8088 8106 8156 8161 8267 8306 8501 8503 8521 8534 8694 8805 8841 9117
#> [76] 9211 9302 9364 9398 9456 9676 9946 9971 9989 1173 2344 5394 8438 8567 9239
#> [91] 1787 2104 2215 3121 7159 9133

or you can use a unified interface (recommended):

subsampling(y_name = "y", data = data_numeric, n = 100, method = "IBOSS")
#>  [1]  183  226  395  419  584  666  711  758 1027 1144 1324 1445 1940 1946 1978
#> [16] 2018 2673 2982 3190 3395 3484 3612 3632 3638 3696 3816 3835 3896 3921 4256
#> [31] 4312 4405 4523 4551 4729 4938 5121 5226 5342 5410 5679 5770 5995 6089 6163
#> [46] 6170 6203 6250 6525 6964 6979 7053 7198 7407 7564 7633 7915 7935 7967 7992
#> [61] 8026 8088 8106 8156 8161 8267 8306 8501 8503 8521 8534 8694 8805 8841 9117
#> [76] 9211 9302 9364 9398 9456 9676 9946 9971 9989 1173 2344 5394 8438 8567 9239
#> [91] 1787 2104 2215 3121 7159 9133

Leverage

A subsampling method based on leverage scores for linear regression proposed by Ma et.al. (2015)³.

Leverage(X, n = 10, shrinkage_alpha = 0.9, replace = TRUE, seed = 123)
#>  [1] 8534 2958 2986 9334 4761 9819 2757 9145 3194 2888

or you can use a unified interface (recommended):

subsampling(y_name = "y", data = data_numeric, n = 10, method = "Leverage", 
            replace = TRUE, seed = 123, shrinkage = 0.9)
#>  [1] 8534 2958 2986 9334 4761 9819 2757 9145 3194 2888

OSS

A subsampling method based on Orthogonal Array proposed by Wang et.al.(2021)⁴.

OSS(n = 10, X = X)
#>  [1] 8841 8961 1902 7512   48 9867 6547 9784 3392 3622

or you can use a unified interface (recommended):

subsampling(y_name = "y", data = data_numeric, n = 10, method = "OSS")
#>  [1] 8841 8961 1902 7512   48 9867 6547 9784 3392 3622

LowCon

A subsampling method based on Space-filling designs proposed by Meng et.al.(2021)⁵.

LowCon(X = X, n = 10, theta = 1, seed = 123)
#>  [1] 6032 6633 4180 5093 6005 7093 3621 4429 1715 7143

or you can use a unified interface (recommended):

subsampling(y_name = "y", data = data_numeric, n = 10, method = "LowCon", seed = 123, theta = 1)
#>  [1] 6032 6633 4180 5093 6005 7093 3621 4429 1715 7143

IES

A subsampling method based on Orthogonal Array proposed by Zhang et.al.(2024)⁶.

IES(X = X, n = 10, q = 16, seed = 123)
#>  [1] 2876 7890 4440 9400 9813 2499 4939 8165 2224 4628

or you can use a unified interface (recommended):

subsampling(y_name = "y", data = data_numeric, n = 10, method = "IES", seed = 123, q = 16)
#>  [1] 2876 7890 4440 9400 9813 2499 4939 8165 2224 4628

DDS

A model-free subsampling method based on uniform designs proposed by Zhang et.al.(2023)⁷.

DDS(X = X, n = 10, ratio = 0.85)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 7010 3375 4172 9019 4983   85 9454 7745 9810  9737

or you can use a unified interface (recommended):

subsampling(y_name = "y", data = data_numeric, n = 10, method = "DDS", ratio = 0.85)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 7010 3375 4172 9019 4983   85 9454 7745 9810  9737

We’re working on more methods.