factory_data_load_and_split_r.Rd
Splits the dataset with Scikit-learn
and returns the train/test data and their row/position indices.
factory_data_load_and_split_r( filename, target, predictor, test_size = 0.33, reduce_criticality = FALSE, theme = NULL )
filename | A data frame with the data (class and text columns), otherwise the dataset name (CSV), including full path to the data folder (if not in the project's working directory), and the data type suffix (".csv"). |
---|---|
target | String. The name of the response variable. |
predictor | String. The name of the predictor variable. |
test_size | Numeric. Proportion of data that will form the test dataset. |
reduce_criticality | Logical. For internal use by Nottinghamshire
Healthcare NHS Foundation Trust or other trusts that hold data on
criticality. If |
theme | String. For internal use by Nottinghamshire Healthcare NHS
Foundation Trust or other trusts that use theme labels ("Access",
"Environment/ facilities" etc.). The column name of the theme variable.
Defaults to |
A list of length 6: x_train
(data frame), x_test
(data frame),
y_train
(array), y_test
(array), index_training_data
(integer vector), and index_test_data
(integer vector). The row names
(names) of x_train
and x_test
(y_train
and y_test
) are
index_training_data
and index_test_data
respectively.
Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825--2830
data_splits <- pxtextmineR::factory_data_load_and_split_r( filename = pxtextmineR::text_data, target = "label", predictor = "feedback", test_size = 0.33) # Let's take a look at the returned list str(data_splits)#> List of 6 #> $ x_train :'data.frame': 6923 obs. of 1 variable: #> ..$ predictor: chr [1:6923] "XXXX came to see mum at home. She was marvellous, treated mum with respect and informative." "FrIendly and helpful HV." "Good that we were able to have a face to face appointment. \nEverything explained in a manner we could underst"| __truncated__ "Nothing" ... #> ..- attr(*, "pandas.index")=Int64Index([4565, 3245, 2370, 9224, 5480, 4230, 3001, 84, 6643, 2074, #> ... #> 889, 3808, 7678, 2660, 6408, 9136, 3534, 6024, 4619, 5379], #> dtype='int64', length=6923) #> $ x_test :'data.frame': 3411 obs. of 1 variable: #> ..$ predictor: chr [1:3411] "Just been happy on the ward" "Did well tracking improvements to anxiety and depression, discussing and challenging irrational thoughts was done well. " "As fIrst tIme mum, advIse and InformatIon provIded was extremely helpful - would recomend to other parents too." "Nothing" ... #> ..- attr(*, "pandas.index")=Int64Index([ 6850, 2155, 2803, 9491, 9811, 6822, 1813, 10243, 3598, #> 5264, #> ... #> 3101, 2122, 1232, 4194, 7644, 4990, 6936, 2427, 4411, #> 9425], #> dtype='int64', length=3411) #> $ y_train : chr [1:6923(1d)] "Staff" "Staff" "Communication" "Couldn't be improved" ... #> ..- attr(*, "dimnames")=List of 1 #> .. ..$ : chr [1:6923] "4565" "3245" "2370" "9224" ... #> $ y_test : chr [1:3411(1d)] "Care received" "Care received" "Communication" "Couldn't be improved" ... #> ..- attr(*, "dimnames")=List of 1 #> .. ..$ : chr [1:3411] "6850" "2155" "2803" "9491" ... #> $ index_training_data: int [1:6923] 4565 3245 2370 9224 5480 4230 3001 84 6643 2074 ... #> $ index_test_data : int [1:3411] 6850 2155 2803 9491 9811 6822 1813 10243 3598 5264 ...# Each record in the split data is tagged with the row index of the original dataset head(rownames(data_splits$x_train))#> [1] "4565" "3245" "2370" "9224" "5480" "4230"#> [1] "4565" "3245" "2370" "9224" "5480" "4230"# Note that, in Python, indices start from 0 and go up to number_of_records - 1 all_indices <- data_splits$y_train %>% names() %>% c(names(data_splits$y_test)) %>% as.numeric() %>% sort() head(all_indices) # Starts from zero#> [1] 0 1 2 3 4 5#> [1] 10328 10329 10330 10331 10332 10333#> [1] TRUE