Predict unlabelled text using a fitted Scikit-learn (Python) pipeline

factory_predict_unlabelled_text_r(
  dataset,
  predictor,
  pipe_path_or_object,
  preds_column = NULL,
  column_names = "all_cols",
  theme = NULL
)

Arguments

dataset

Data frame. The text data to predict classes for.

predictor

String. The column name of the text variable.

pipe_path_or_object

String or sklearn.model_selection._search.RandomizedSearchCV (e.g. from factory_pipeline_r). If a string, it should be in the form "path_to_fitted_pipeline/pipeline.sav", where "pipeline" is the name of the SAV file with the fitted Scikit-learn pipeline.

preds_column

A string with the user-specified name of the column that will have the predictions. If NULL (default), then the name will be paste0(text_col_name, "_preds").

column_names

A vector of strings with the names of the columns of the supplied data frame (incl. text_col_name) to be added to the returned data frame. If "preds_only", then the only column in the returned data frame will be preds_column. Defaults to "all_cols".

theme

String. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that use theme labels ("Access", "Environment/ facilities" etc.). The column name of the theme variable. Defaults to NULL. If supplied, the theme variable will be used as a predictor (along with the text predictor) in the model that is fitted with criticality as the response variable. The rationale is two-fold. First, to help the model improve predictions on criticality when the theme labels are readily available. Second, to force the criticality for "Couldn't be improved" to always be "3" in the training and test data, as well as in the predictions. This is the only criticality value that "Couldn't be improved" can take, so by forcing it to always be "3", we are improving model performance, but are also correcting possible erroneous assignments of values other than "3" that are attributed to human error.

Value

Data frame. The predictions column with or without any other columns passed by the user (see column_names).

Examples

# Prepare training and test sets data_splits <- pxtextmineR::factory_data_load_and_split_r( filename = pxtextmineR::text_data, target = "label", predictor = "feedback", test_size = 0.90) # Fit the pipeline pipe <- pxtextmineR::factory_pipeline_r( x = data_splits$x_train, y = data_splits$y_train, tknz = "spacy", ordinal = FALSE, metric = "accuracy_score", cv = 2, n_iter = 1, n_jobs = 1, verbose = 3, learners = "SGDClassifier" ) # Make predictions # # Return data frame with predictions column and all original columns from # the supplied data frame preds_all_cols <- pxtextmineR::factory_predict_unlabelled_text_r( dataset = pxtextmineR::text_data, predictor = "feedback", pipe_path_or_object = pipe, column_names = "all_cols") str(preds_all_cols)
#> 'data.frame': 10334 obs. of 4 variables: #> $ feedback_preds: chr "Couldn't be improved" "Care received" "Staff" "Care received" ... #> $ label : chr "Couldn't be improved" "Environment/ facilities" "Access" "Communication" ... #> $ criticality : chr "3" "-1" "-2" "-1" ... #> $ feedback : chr "Nothing." "Temperature in theatre a little low." "Same service available at Bingham Health Centre." "Appointment details given over phone - no physical evidence/reminder which could cause problems. Other than tha"| __truncated__ ... #> - attr(*, "pandas.index")=RangeIndex(start=0, stop=10334, step=1)
# Return data frame with predictions column only preds_preds_only <- pxtextmineR::factory_predict_unlabelled_text_r( dataset = pxtextmineR::text_data, predictor = "feedback", pipe_path_or_object = pipe, column_names = "preds_only") head(preds_preds_only)
#> feedback_preds #> 1 Couldn't be improved #> 2 Care received #> 3 Staff #> 4 Care received #> 5 Miscellaneous #> 6 Care received
# Return data frame with predictions column and columns label and feedback from # the supplied data frame preds_label_text <- pxtextmineR::factory_predict_unlabelled_text_r( dataset = pxtextmineR::text_data, predictor = "feedback", pipe_path_or_object = pipe, column_names = c("label", "feedback")) str(preds_label_text)
#> 'data.frame': 10334 obs. of 3 variables: #> $ feedback_preds: chr "Couldn't be improved" "Care received" "Staff" "Care received" ... #> $ label : chr "Couldn't be improved" "Environment/ facilities" "Access" "Communication" ... #> $ feedback : chr "Nothing." "Temperature in theatre a little low." "Same service available at Bingham Health Centre." "Appointment details given over phone - no physical evidence/reminder which could cause problems. Other than tha"| __truncated__ ... #> - attr(*, "pandas.index")=RangeIndex(start=0, stop=10334, step=1)
# Return data frame with the predictions column name supplied by the user preds_custom_preds_name <- pxtextmineR::factory_predict_unlabelled_text_r( dataset = pxtextmineR::text_data, predictor = "feedback", pipe_path_or_object = pipe, column_names = "preds_only", preds_column = "predictions") head(preds_custom_preds_name)
#> predictions #> 1 Couldn't be improved #> 2 Care received #> 3 Staff #> 4 Care received #> 5 Miscellaneous #> 6 Care received