A unified interface for retrieving subsample indexes
Usage
subsampling(
y_name,
x_name = NULL,
data,
n,
pilot_n = NULL,
method = "Unif",
replace = TRUE,
seed = NULL,
seed_1 = NULL,
seed_2 = NULL,
na_method = NULL,
...
)
Arguments
- y_name
A character. The name of the response variable in the data frame.
- x_name
A character. The name of the explanatory variable in the data frame. Default to all variables except the response variable
- data
A data frame containing response variables and explanatory variables
- n
Subsample size.
- pilot_n
pilot sample size (for some method)
- method
Subsamling methods:
Unif
: Random sampling.OSMAC_A
: A subsampling method based on A-optimal for logistic regression proposed by Wang et.al. (2018).OSMAC_L
: A subsampling method based on L-optimal for logistic regression proposed by Wang et.al. (2018).IBOSS
: A subsampling method based on D-optimal for linear regression proposed by Wang et.al. (2019).OSS
: A subsampling method based on Orthogonal Array proposed by Wang et.al.(2021).LowCon
: A subsampling method based on Space-filling designs proposed by Meng et.al.(2021).IES
: A subsampling method based on Orthogonal Array proposed by Zhang et.al.(2024).
- replace
A boolean.
TRUE
(the default): Sampling with replace.FALSE
: Sampling without replace
- seed
Random seed. Default to NULL.
- seed_1
Random seed for the first stage sampling (When two-phase subsampling are needed). Default to NULL.
- seed_2
Random seed for the second stage sampling (When two-phase subsampling are needed). Default to NULL.
- na_method
Method to handle NA.
- ...
Additional parameters setting for some methods.
theta
: Percentage of data shrinkage forLowCon
. Default to 1.q
: Hyperparamter of how to divide the axes forIES
. Default to 16.
Examples
data_binary <- data_binary_class
subsampling(y_name = "y", data = data_binary, n = 30, method = "Unif", seed = 123)
#> [1] 2463 2511 8718 2986 1842 9334 3371 4761 6746 9819 2757 5107 9145 9209 2888
#> [16] 6170 2567 9642 9982 2980 1614 555 4469 9359 7789 9991 9097 1047 7067 3004
subsampling(y_name = "y", data = data_binary, n = 30, pilot_n = 100, method = "OSMAC_A",
seed_1 = 123, seed_2 = 456)
#> [1] 5684 1620 5372 8297 8863 9783 6483 6103 2702 5735 9382 40 9919 8623 2816
#> [16] 5035 6088 2006 4702 1993 4279 9827 8738 8892 7632 6836 6393 6405 99 3952
subsampling(y_name = "y", data = data_binary, n = 30, pilot_n = 100, method = "OSMAC_L",
seed_1 = 123, seed_2 = 456)
#> [1] 5813 1681 5372 8313 8863 9780 1630 6103 2702 5888 9382 9843 9913 8635 2816
#> [16] 5035 6211 2090 4702 2083 4385 9813 8776 8904 4425 6899 1615 6513 99 4076
data_numeric <- data_numeric_regression
subsampling(y_name = "y", data = data_numeric, n = 100, method = "IBOSS")
#> [1] 183 226 395 419 584 666 711 758 1027 1144 1324 1445 1940 1946 1978
#> [16] 2018 2673 2982 3190 3395 3484 3612 3632 3638 3696 3816 3835 3896 3921 4256
#> [31] 4312 4405 4523 4551 4729 4938 5121 5226 5342 5410 5679 5770 5995 6089 6163
#> [46] 6170 6203 6250 6525 6964 6979 7053 7198 7407 7564 7633 7915 7935 7967 7992
#> [61] 8026 8088 8106 8156 8161 8267 8306 8501 8503 8521 8534 8694 8805 8841 9117
#> [76] 9211 9302 9364 9398 9456 9676 9946 9971 9989 1173 2344 5394 8438 8567 9239
#> [91] 1787 2104 2215 3121 7159 9133
subsampling(y_name = "y", data = data_numeric, n = 30, method = "OSS")
#> [1] 8841 8961 1902 7512 48 9867 6547 9784 3392 3622 5780 6594 1890 1850 8335
#> [16] 1254 6204 1257 4611 3831 4782 4919 1579 3404 718 7189 2060 4899 590 1800
subsampling(y_name = "y", data = data_numeric, n = 10, method = "LowCon", seed = 123, theta = 1)
#> [1] 6032 6633 4180 5093 6005 7093 3621 4429 1715 7143
subsampling(y_name = "y", data = data_numeric, n = 10, method = "IES", seed = 123, q = 16)
#> [1] 2876 7890 4440 9400 9813 2499 4939 8165 2224 4628