Get subsample index. — subsampling • dbsubsampling

A unified interface for retrieving subsample indexes

Usage

subsampling(
  y_name,
  x_name = NULL,
  data,
  n,
  pilot_n = NULL,
  method = "Unif",
  replace = TRUE,
  seed = NULL,
  seed_1 = NULL,
  seed_2 = NULL,
  na_method = NULL,
  ...
)

Arguments

y_name

A character. The name of the response variable in the data frame.

x_name

A character. The name of the explanatory variable in the data frame. Default to all variables except the response variable

data

A data frame containing response variables and explanatory variables

n

Subsample size.

pilot_n

pilot sample size (for some method)

method

Subsamling methods:

Unif: Random sampling.
OSMAC_A: A subsampling method based on A-optimal for logistic regression proposed by Wang et.al. (2018).
OSMAC_L: A subsampling method based on L-optimal for logistic regression proposed by Wang et.al. (2018).
IBOSS: A subsampling method based on D-optimal for linear regression proposed by Wang et.al. (2019).
OSS : A subsampling method based on Orthogonal Array proposed by Wang et.al.(2021).
LowCon: A subsampling method based on Space-filling designs proposed by Meng et.al.(2021).
IES: A subsampling method based on Orthogonal Array proposed by Zhang et.al.(2024).

replace

A boolean.

TRUE (the default): Sampling with replace.
FALSE: Sampling without replace

seed

Random seed. Default to NULL.

seed_1

Random seed for the first stage sampling (When two-phase subsampling are needed). Default to NULL.

seed_2

Random seed for the second stage sampling (When two-phase subsampling are needed). Default to NULL.

na_method

Method to handle NA.

...

Additional parameters setting for some methods.

theta: Percentage of data shrinkage for LowCon. Default to 1.
q : Hyperparamter of how to divide the axes for IES. Default to 16.

Value

A numeric vector with length n which represent the subsample index.

Examples

data_binary <- data_binary_class
subsampling(y_name = "y", data = data_binary, n = 30, method = "Unif", seed = 123)
#>  [1] 2463 2511 8718 2986 1842 9334 3371 4761 6746 9819 2757 5107 9145 9209 2888
#> [16] 6170 2567 9642 9982 2980 1614  555 4469 9359 7789 9991 9097 1047 7067 3004
subsampling(y_name = "y", data = data_binary, n = 30, pilot_n = 100, method = "OSMAC_A",
  seed_1 = 123, seed_2 = 456)
#>  [1] 5684 1620 5372 8297 8863 9783 6483 6103 2702 5735 9382   40 9919 8623 2816
#> [16] 5035 6088 2006 4702 1993 4279 9827 8738 8892 7632 6836 6393 6405   99 3952
subsampling(y_name = "y", data = data_binary, n = 30, pilot_n = 100, method = "OSMAC_L",
  seed_1 = 123, seed_2 = 456)
#>  [1] 5813 1681 5372 8313 8863 9780 1630 6103 2702 5888 9382 9843 9913 8635 2816
#> [16] 5035 6211 2090 4702 2083 4385 9813 8776 8904 4425 6899 1615 6513   99 4076

data_numeric <- data_numeric_regression
subsampling(y_name = "y", data = data_numeric, n = 100, method = "IBOSS")
#>  [1]  183  226  395  419  584  666  711  758 1027 1144 1324 1445 1940 1946 1978
#> [16] 2018 2673 2982 3190 3395 3484 3612 3632 3638 3696 3816 3835 3896 3921 4256
#> [31] 4312 4405 4523 4551 4729 4938 5121 5226 5342 5410 5679 5770 5995 6089 6163
#> [46] 6170 6203 6250 6525 6964 6979 7053 7198 7407 7564 7633 7915 7935 7967 7992
#> [61] 8026 8088 8106 8156 8161 8267 8306 8501 8503 8521 8534 8694 8805 8841 9117
#> [76] 9211 9302 9364 9398 9456 9676 9946 9971 9989 1173 2344 5394 8438 8567 9239
#> [91] 1787 2104 2215 3121 7159 9133
subsampling(y_name = "y", data = data_numeric, n = 30, method = "OSS")
#>  [1] 8841 8961 1902 7512   48 9867 6547 9784 3392 3622 5780 6594 1890 1850 8335
#> [16] 1254 6204 1257 4611 3831 4782 4919 1579 3404  718 7189 2060 4899  590 1800
subsampling(y_name = "y", data = data_numeric, n = 10, method = "LowCon", seed = 123, theta = 1)
#>  [1] 6032 6633 4180 5093 6005 7093 3621 4429 1715 7143
subsampling(y_name = "y", data = data_numeric, n = 10, method = "IES", seed = 123, q = 16)
#>  [1] 2876 7890 4440 9400 9813 2499 4939 8165 2224 4628