A unified interface for retrieving subsample indexes
Usage
subsampling(
  y_name,
  x_name = NULL,
  data,
  n,
  pilot_n = NULL,
  method = "Unif",
  replace = TRUE,
  seed = NULL,
  seed_1 = NULL,
  seed_2 = NULL,
  na_method = NULL,
  ...
)Arguments
- y_name
 A character. The name of the response variable in the data frame.
- x_name
 A character. The name of the explanatory variable in the data frame. Default to all variables except the response variable
- data
 A data frame containing response variables and explanatory variables
- n
 Subsample size.
- pilot_n
 pilot sample size (for some method)
- method
 Subsamling methods:
Unif: Random sampling.OSMAC_A: A subsampling method based on A-optimal for logistic regression proposed by Wang et.al. (2018).OSMAC_L: A subsampling method based on L-optimal for logistic regression proposed by Wang et.al. (2018).IBOSS: A subsampling method based on D-optimal for linear regression proposed by Wang et.al. (2019).OSS: A subsampling method based on Orthogonal Array proposed by Wang et.al.(2021).LowCon: A subsampling method based on Space-filling designs proposed by Meng et.al.(2021).IES: A subsampling method based on Orthogonal Array proposed by Zhang et.al.(2024).
- replace
 A boolean.
TRUE(the default): Sampling with replace.FALSE: Sampling without replace
- seed
 Random seed. Default to NULL.
- seed_1
 Random seed for the first stage sampling (When two-phase subsampling are needed). Default to NULL.
- seed_2
 Random seed for the second stage sampling (When two-phase subsampling are needed). Default to NULL.
- na_method
 Method to handle NA.
- ...
 Additional parameters setting for some methods.
theta: Percentage of data shrinkage forLowCon. Default to 1.q: Hyperparamter of how to divide the axes forIES. Default to 16.
Examples
data_binary <- data_binary_class
subsampling(y_name = "y", data = data_binary, n = 30, method = "Unif", seed = 123)
#>  [1] 2463 2511 8718 2986 1842 9334 3371 4761 6746 9819 2757 5107 9145 9209 2888
#> [16] 6170 2567 9642 9982 2980 1614  555 4469 9359 7789 9991 9097 1047 7067 3004
subsampling(y_name = "y", data = data_binary, n = 30, pilot_n = 100, method = "OSMAC_A",
  seed_1 = 123, seed_2 = 456)
#>  [1] 5684 1620 5372 8297 8863 9783 6483 6103 2702 5735 9382   40 9919 8623 2816
#> [16] 5035 6088 2006 4702 1993 4279 9827 8738 8892 7632 6836 6393 6405   99 3952
subsampling(y_name = "y", data = data_binary, n = 30, pilot_n = 100, method = "OSMAC_L",
  seed_1 = 123, seed_2 = 456)
#>  [1] 5813 1681 5372 8313 8863 9780 1630 6103 2702 5888 9382 9843 9913 8635 2816
#> [16] 5035 6211 2090 4702 2083 4385 9813 8776 8904 4425 6899 1615 6513   99 4076
data_numeric <- data_numeric_regression
subsampling(y_name = "y", data = data_numeric, n = 100, method = "IBOSS")
#>  [1]  183  226  395  419  584  666  711  758 1027 1144 1324 1445 1940 1946 1978
#> [16] 2018 2673 2982 3190 3395 3484 3612 3632 3638 3696 3816 3835 3896 3921 4256
#> [31] 4312 4405 4523 4551 4729 4938 5121 5226 5342 5410 5679 5770 5995 6089 6163
#> [46] 6170 6203 6250 6525 6964 6979 7053 7198 7407 7564 7633 7915 7935 7967 7992
#> [61] 8026 8088 8106 8156 8161 8267 8306 8501 8503 8521 8534 8694 8805 8841 9117
#> [76] 9211 9302 9364 9398 9456 9676 9946 9971 9989 1173 2344 5394 8438 8567 9239
#> [91] 1787 2104 2215 3121 7159 9133
subsampling(y_name = "y", data = data_numeric, n = 30, method = "OSS")
#>  [1] 8841 8961 1902 7512   48 9867 6547 9784 3392 3622 5780 6594 1890 1850 8335
#> [16] 1254 6204 1257 4611 3831 4782 4919 1579 3404  718 7189 2060 4899  590 1800
subsampling(y_name = "y", data = data_numeric, n = 10, method = "LowCon", seed = 123, theta = 1)
#>  [1] 6032 6633 4180 5093 6005 7093 3621 4429 1715 7143
subsampling(y_name = "y", data = data_numeric, n = 10, method = "IES", seed = 123, q = 16)
#>  [1] 2876 7890 4440 9400 9813 2499 4939 8165 2224 4628
