Split dataset into training and test sets — factory_data_load_and_split

Splits the dataset with Scikit-learn and returns the train/test data and their row/position indices.

factory_data_load_and_split_r(
  filename,
  target,
  predictor,
  test_size = 0.33,
  reduce_criticality = FALSE,
  theme = NULL
)

Arguments

filename	A data frame with the data (class and text columns), otherwise the dataset name (CSV), including full path to the data folder (if not in the project's working directory), and the data type suffix (".csv").
target	String. The name of the response variable.
predictor	String. The name of the predictor variable.
test_size	Numeric. Proportion of data that will form the test dataset.
reduce_criticality	Logical. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that hold data on criticality. If `TRUE`, then all records with a criticality of "-5" (respectively, "5") are assigned a criticality of "-4" (respectively, "4"). This is to avoid situations where the pipeline breaks due to a lack of sufficient data for "-5" and/or "5". Defaults to `FALSE`.
theme	String. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that use theme labels ("Access", "Environment/ facilities" etc.). The column name of the theme variable. Defaults to `NULL`. If supplied, the theme variable will be used as a predictor (along with the text predictor) in the model that is fitted with criticality as the response variable. The rationale is two-fold. First, to help the model improve predictions on criticality when the theme labels are readily available. Second, to force the criticality for "Couldn't be improved" to always be "3" in the training and test data, as well as in the predictions. This is the only criticality value that "Couldn't be improved" can take, so by forcing it to always be "3", we are improving model performance, but are also correcting possible erroneous assignments of values other than "3" that are attributed to human error.

Value

A list of length 6: x_train (data frame), x_test (data frame), y_train (array), y_test (array), index_training_data (integer vector), and index_test_data (integer vector). The row names (names) of x_train and x_test (y_train and y_test) are index_training_data and index_test_data respectively.

References

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825--2830

Examples

data_splits <- pxtextmineR::factory_data_load_and_split_r(
  filename = pxtextmineR::text_data,
  target = "label",
  predictor = "feedback",
  test_size = 0.33)

# Let's take a look at the returned list
str(data_splits)
#> List of 6
#>  $ x_train            :'data.frame':	6923 obs. of  1 variable:
#>   ..$ predictor: chr [1:6923] "XXXX came to see mum at home.  She was marvellous, treated mum with respect and informative." "FrIendly and helpful HV." "Good that we were able to have a face to face appointment.  \nEverything explained in a manner we could underst"| __truncated__ "Nothing" ...
#>   ..- attr(*, "pandas.index")=Int64Index([4565, 3245, 2370, 9224, 5480, 4230, 3001,   84, 6643, 2074,
#>             ...
#>              889, 3808, 7678, 2660, 6408, 9136, 3534, 6024, 4619, 5379],
#>            dtype='int64', length=6923)
#>  $ x_test             :'data.frame':	3411 obs. of  1 variable:
#>   ..$ predictor: chr [1:3411] "Just been happy on the ward" "Did well tracking improvements to anxiety and depression, discussing and challenging irrational thoughts was done well. " "As fIrst tIme mum, advIse and InformatIon provIded was extremely helpful -  would recomend to other parents too." "Nothing" ...
#>   ..- attr(*, "pandas.index")=Int64Index([ 6850,  2155,  2803,  9491,  9811,  6822,  1813, 10243,  3598,
#>              5264,
#>             ...
#>              3101,  2122,  1232,  4194,  7644,  4990,  6936,  2427,  4411,
#>              9425],
#>            dtype='int64', length=3411)
#>  $ y_train            : chr [1:6923(1d)] "Staff" "Staff" "Communication" "Couldn't be improved" ...
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..$ : chr [1:6923] "4565" "3245" "2370" "9224" ...
#>  $ y_test             : chr [1:3411(1d)] "Care received" "Care received" "Communication" "Couldn't be improved" ...
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..$ : chr [1:3411] "6850" "2155" "2803" "9491" ...
#>  $ index_training_data: int [1:6923] 4565 3245 2370 9224 5480 4230 3001 84 6643 2074 ...
#>  $ index_test_data    : int [1:3411] 6850 2155 2803 9491 9811 6822 1813 10243 3598 5264 ...

# Each record in the split data is tagged with the row index of the original dataset
head(rownames(data_splits$x_train))
#> [1] "4565" "3245" "2370" "9224" "5480" "4230"
head(names(data_splits$y_train))
#> [1] "4565" "3245" "2370" "9224" "5480" "4230"

# Note that, in Python, indices start from 0 and go up to number_of_records - 1
all_indices <- data_splits$y_train %>%
  names() %>%
  c(names(data_splits$y_test)) %>%
  as.numeric() %>%
  sort()
head(all_indices) # Starts from zero
#> [1] 0 1 2 3 4 5
tail(all_indices) # Ends in nrow(text_data) - 1
#> [1] 10328 10329 10330 10331 10332 10333
length(all_indices) == nrow(text_data)
#> [1] TRUE