Fit and evaluate the pipeline — text_classification_pipeline

Split the data, build and fit the pipeline, produce performance metrics.

text_classification_pipeline_r(
  filename,
  target,
  predictor,
  test_size = 0.33,
  ordinal = FALSE,
  tknz = "spacy",
  metric = "class_balance_accuracy_score",
  cv = 2,
  n_iter = 10,
  n_jobs = 1,
  verbose = 3,
  learners = c("SGDClassifier"),
  reduce_criticality = FALSE,
  theme = NULL
)

Arguments

filename	A data frame with the data (class and text columns), otherwise the dataset name (CSV), including full path to the data folder (if not in the project's working directory), and the data type suffix (".csv").
target	String. The name of the response variable.
predictor	String. The name of the predictor variable.
test_size	Numeric. Proportion of data that will form the test dataset.
ordinal	Whether to fit an ordinal classification model. The ordinal model is the implementation of Frank and Hall (2001) that can use any standard classification model that calculates probabilities.
tknz	Tokenizer to use ("spacy" or "wordnet").
metric	String. Scorer to use during pipeline tuning ("accuracy_score", "balanced_accuracy_score", "matthews_corrcoef", "class_balance_accuracy_score").
cv	Number of cross-validation folds.
n_iter	Number of parameter settings that are sampled (see `sklearn.model_selection.RandomizedSearchCV`).
n_jobs	Number of jobs to run in parallel (see `sklearn.model_selection.RandomizedSearchCV`). NOTE: If your machine does not have the number of cores specified in `n_jobs`, then an error will be returned.
verbose	Controls the verbosity (see `sklearn.model_selection.RandomizedSearchCV`).
learners	Vector. `Scikit-learn` names of the learners to tune. Must be one or more of "SGDClassifier", "RidgeClassifier", "Perceptron", "PassiveAggressiveClassifier", "BernoulliNB", "ComplementNB", "MultinomialNB", "KNeighborsClassifier", "NearestCentroid", "RandomForestClassifier". When a single model is used, it can be passed as a string.
reduce_criticality	Logical. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that hold data on criticality. If `TRUE`, then all records with a criticality of "-5" (respectively, "5") are assigned a criticality of "-4" (respectively, "4"). This is to avoid situations where the pipeline breaks due to a lack of sufficient data for "-5" and/or "5". Defaults to `FALSE`.
theme	String. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that use theme labels ("Access", "Environment/ facilities" etc.). The column name of the theme variable. Defaults to `NULL`. If supplied, the theme variable will be used as a predictor (along with the text predictor) in the model that is fitted with criticality as the response variable. The rationale is two-fold. First, to help the model improve predictions on criticality when the theme labels are readily available. Second, to force the criticality for "Couldn't be improved" to always be "3" in the training and test data, as well as in the predictions. This is the only criticality value that "Couldn't be improved" can take, so by forcing it to always be "3", we are improving model performance, but are also correcting possible erroneous assignments of values other than "3" that are attributed to human error.

Value

A list of length 7:

A fitted Scikit-learn pipeline containing a number of objects that can be accessed with the $ sign (see examples). For a partial list see "Atributes" in sklearn.model_selection.RandomizedSearchCV. Do not be surprised if more objects are in the pipeline than those in the aforementioned "Attributes" list. Python objects can contain several objects, from numeric results (e.g. the pipeline's accuracy), to methods (i.e. functions in the R lingo) and classes. In Python, these are normally accessed with object.<whatever>, but in R the command is object$<whatever>. For instance, one can access method predict() to make to make predictions on unseen data. See Examples.
tuning_results Data frame. All (hyper)parameter values and models tried during fitting.
pred Vector. The predictions on the test set.
accuracy_per_class Data frame. Accuracies per class.
p_compare_models_bar A bar plot comparing the mean scores (of the user-supplied metric parameter) from the cross-validation on the training set, for the best (hyper)parameter values for each learner.
index_training_data The row names/indices of the training data. Note that, in Python, indices start from 0 and go up to number_of_records - 1. See Examples.
index_test_data The row names/indices of the test data. Note that, in Python, indices start from 0 and go up to number_of_records - 1. See Examples.

Details

This function brings together the three functions that run chunks of the process independently, namely splitting data into training and test sets (factory_data_load_and_split_r), building and fitting the pipeline (factory_pipeline_r) on the whole dataset (train and test), and assessing pipeline performance (factory_model_performance_r).

For details on what the pipeline does/how it works, see factory_pipeline_r's Details section.

References

Frank E. & Hall M. (2001). A Simple Approach to Ordinal Classification. Machine Learning: ECML 2001 145--156.

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–-2830.

Examples

# We can prepare the data, build and fit the pipeline, and get performance
# metrics, in two ways. One way is to run the factory_* functions independently
# The commented out script right below would do exactly that.

# Prepare training and test sets
# data_splits <- pxtextmineR::factory_data_load_and_split_r(
#   filename = pxtextmineR::text_data,
#   target = "label",
#   predictor = "feedback",
#   test_size = 0.90) # Make a small training set for a faster run in this example
#
# # Fit the pipeline
# pipe <- pxtextmineR::factory_pipeline_r(
#   x = data_splits$x_train,
#   y = data_splits$y_train,
#   tknz = "spacy",
#   ordinal = FALSE,
#   metric = "class_balance_accuracy_score",
#   cv = 2, n_iter = 10, n_jobs = 1, verbose = 3,
#   learners = c("SGDClassifier", "MultinomialNB")
# )
# (SGDClassifier represents both logistic regression and linear SVM. This
# depends on the value of the "loss" hyperparameter, which can be "log" or
# "hinge". This is set internally in factory_pipeline_r).
#
# # Assess model performance
# pipe_performance <- pxtextmineR::factory_model_performance_r(
#   pipe = pipe,
#   x_train = data_splits$x_train,
#   y_train = data_splits$y_train,
#   x_test = data_splits$x_test,
#   y_test = data_splits$y_test,
#   metric = "accuracy_score")

# Alternatively, we can use text_classification_pipeline_r() to do everything in
# one go.
text_pipe <- pxtextmineR::text_classification_pipeline_r(
  filename = pxtextmineR::text_data,
  target = 'label',
  predictor = 'feedback',
  test_size = 0.33,
  ordinal = FALSE,
  tknz = "spacy",
  metric = "class_balance_accuracy_score",
  cv = 2, n_iter = 10, n_jobs = 1, verbose = 3,
  learners = c("SGDClassifier", "MultinomialNB"),
  reduce_criticality = FALSE,
  theme = NULL
)

names(text_pipe)
#> [1] "pipe"                 "tuning_results"       "pred"                
#> [4] "accuracy_per_class"   "p_compare_models_bar" "index_training_data" 
#> [7] "index_test_data"     

# Let's compare pipeline performance for different tunings with a range of
# metrics averaging the cross-validation metrics for each fold.
text_pipe$
  tuning_results %>%
  dplyr::select(learner, dplyr::contains("mean_test"))
#>         learner mean_test_Accuracy mean_test_Balanced Accuracy
#> 2      Logistic          0.6692195                   0.4642124
#> 0    Linear SVM          0.5575606                   0.4020137
#> 6      Logistic          0.5090311                   0.3634837
#> 5    Linear SVM          0.5855842                   0.3474347
#> 8 MultinomialNB          0.4810045                   0.3219231
#> 9    Linear SVM          0.4170098                   0.3080414
#> 1 MultinomialNB          0.4805708                   0.2927007
#> 4 MultinomialNB          0.5646388                   0.2881532
#> 7      Logistic          0.4552871                   0.2632449
#> 3      Logistic          0.4135429                   0.2520484
#>   mean_test_Matthews Correlation Coefficient mean_test_Class Balance Accuracy
#> 2                                  0.5679886                        0.4312644
#> 0                                  0.4386832                        0.3549740
#> 6                                  0.3785785                        0.3066306
#> 5                                  0.4573056                        0.3014037
#> 8                                  0.3360435                        0.2718537
#> 9                                  0.2858717                        0.2535001
#> 1                                  0.3198254                        0.2445776
#> 4                                  0.4183560                        0.2346988
#> 7                                  0.2877204                        0.2207833
#> 3                                  0.2491383                        0.1815914

# A glance at the (hyper)parameters and their tuned values
text_pipe$
  tuning_results %>%
  dplyr::select(learner, dplyr::contains("param_")) %>%
  str()
#> 'data.frame':	10 obs. of  22 variables:
#>  $ learner                                                    : chr  "Logistic" "Linear SVM" "Logistic" "Linear SVM" ...
#>  $ param_sampling__kw_args                                    : chr  "{'up_balancing_counts': 800}" "{'up_balancing_counts': 300}" "{'up_balancing_counts': 800}" "{'up_balancing_counts': 800}" ...
#>  $ param_preprocessor__texttr__text__transformer__use_idf     :List of 10
#>   ..$ : logi FALSE
#>   ..$ : logi FALSE
#>   ..$ : logi FALSE
#>   ..$ : logi FALSE
#>   ..$ : logi TRUE
#>   ..$ : logi TRUE
#>   ..$ : logi TRUE
#>   ..$ : logi FALSE
#>   ..$ : logi FALSE
#>   ..$ : logi TRUE
#>  $ param_preprocessor__texttr__text__transformer__tokenizer   : chr  "<pxtextmining.helpers.tokenization.LemmaTokenizer object at 0x0000000016D661C0>" "<pxtextmining.helpers.tokenization.LemmaTokenizer object at 0x0000000016D661C0>" "<pxtextmining.helpers.tokenization.LemmaTokenizer object at 0x0000000016D661C0>" "<pxtextmining.helpers.tokenization.LemmaTokenizer object at 0x0000000016D661C0>" ...
#>  $ param_preprocessor__texttr__text__transformer__preprocessor: chr  "<function text_preprocessor at 0x00000000665D93A0>" "<function text_preprocessor at 0x00000000665D93A0>" "<function text_preprocessor at 0x00000000665D93A0>" "<function text_preprocessor at 0x00000000665D93A0>" ...
#>  $ param_preprocessor__texttr__text__transformer__norm        :List of 10
#>   ..$ : NULL
#>   ..$ : chr "l2"
#>   ..$ : chr "l2"
#>   ..$ : NULL
#>   ..$ : chr "l2"
#>   ..$ : chr "l2"
#>   ..$ : chr "l2"
#>   ..$ : NULL
#>   ..$ : NULL
#>   ..$ : chr "l2"
#>  $ param_preprocessor__texttr__text__transformer__ngram_range : chr  "(1, 3)" "(2, 3)" "(2, 3)" "(2, 3)" ...
#>  $ param_preprocessor__texttr__text__transformer__min_df      :List of 10
#>   ..$ : int 1
#>   ..$ : int 3
#>   ..$ : int 1
#>   ..$ : int 1
#>   ..$ : int 3
#>   ..$ : int 3
#>   ..$ : int 3
#>   ..$ : int 3
#>   ..$ : int 3
#>   ..$ : int 3
#>  $ param_preprocessor__texttr__text__transformer__max_df      :List of 10
#>   ..$ : num 0.95
#>   ..$ : num 0.7
#>   ..$ : num 0.95
#>   ..$ : num 0.95
#>   ..$ : num 0.95
#>   ..$ : num 0.7
#>   ..$ : num 0.7
#>   ..$ : num 0.7
#>   ..$ : num 0.7
#>   ..$ : num 0.7
#>  $ param_preprocessor__texttr__text__transformer              : chr  "TfidfVectorizer(max_df=0.95, ngram_range=(1, 3), norm=None,\n                preprocessor=<function text_prepro"| __truncated__ "TfidfVectorizer(max_df=0.95, ngram_range=(1, 3), norm=None,\n                preprocessor=<function text_prepro"| __truncated__ "TfidfVectorizer(max_df=0.95, ngram_range=(1, 3), norm=None,\n                preprocessor=<function text_prepro"| __truncated__ "TfidfVectorizer(max_df=0.95, ngram_range=(1, 3), norm=None,\n                preprocessor=<function text_prepro"| __truncated__ ...
#>  $ param_preprocessor__sentimenttr__scaler__scaler__n_bins    :List of 10
#>   ..$ : int 4
#>   ..$ : int 4
#>   ..$ : int 4
#>   ..$ : int 4
#>   ..$ : int 8
#>   ..$ : int 8
#>   ..$ : int 4
#>   ..$ : int 4
#>   ..$ : int 8
#>   ..$ : int 4
#>  $ param_preprocessor__sentimenttr__scaler__scaler            : chr  "KBinsDiscretizer(n_bins=4, strategy='kmeans')" "KBinsDiscretizer(n_bins=4, strategy='kmeans')" "KBinsDiscretizer(n_bins=4, strategy='kmeans')" "KBinsDiscretizer(n_bins=4, strategy='kmeans')" ...
#>  $ param_preprocessor__lengthtr__scaler__scaler               : chr  "KBinsDiscretizer(n_bins=3, strategy='kmeans')" "KBinsDiscretizer(n_bins=3, strategy='kmeans')" "KBinsDiscretizer(n_bins=3, strategy='kmeans')" "KBinsDiscretizer(n_bins=3, strategy='kmeans')" ...
#>  $ param_featsel__selector__score_func                        : chr  "<function chi2 at 0x00000000665C14C0>" "<function chi2 at 0x00000000665C14C0>" "<function chi2 at 0x00000000665C14C0>" "<function chi2 at 0x00000000665C14C0>" ...
#>  $ param_featsel__selector__percentile                        :List of 10
#>   ..$ : int 70
#>   ..$ : int 70
#>   ..$ : int 85
#>   ..$ : int 70
#>   ..$ : int 85
#>   ..$ : int 85
#>   ..$ : int 100
#>   ..$ : int 100
#>   ..$ : int 85
#>   ..$ : int 85
#>  $ param_featsel__selector                                    : chr  "SelectPercentile(percentile=70,\n                 score_func=<function chi2 at 0x00000000665C14C0>)" "SelectPercentile(percentile=70,\n                 score_func=<function chi2 at 0x00000000665C14C0>)" "SelectPercentile(percentile=70,\n                 score_func=<function chi2 at 0x00000000665C14C0>)" "SelectPercentile(percentile=70,\n                 score_func=<function chi2 at 0x00000000665C14C0>)" ...
#>  $ param_clf__estimator__penalty                              :List of 10
#>   ..$ : chr "elasticnet"
#>   ..$ : chr "l2"
#>   ..$ : chr "elasticnet"
#>   ..$ : chr "l2"
#>   ..$ : num NaN
#>   ..$ : chr "elasticnet"
#>   ..$ : num NaN
#>   ..$ : num NaN
#>   ..$ : chr "elasticnet"
#>   ..$ : chr "l2"
#>  $ param_clf__estimator__max_iter                             :List of 10
#>   ..$ : int 10000
#>   ..$ : int 10000
#>   ..$ : int 10000
#>   ..$ : int 10000
#>   ..$ : num NaN
#>   ..$ : int 10000
#>   ..$ : num NaN
#>   ..$ : num NaN
#>   ..$ : int 10000
#>   ..$ : int 10000
#>  $ param_clf__estimator__loss                                 :List of 10
#>   ..$ : chr "log"
#>   ..$ : chr "hinge"
#>   ..$ : chr "log"
#>   ..$ : chr "hinge"
#>   ..$ : num NaN
#>   ..$ : chr "hinge"
#>   ..$ : num NaN
#>   ..$ : num NaN
#>   ..$ : chr "log"
#>   ..$ : chr "log"
#>  $ param_clf__estimator__class_weight                         :List of 10
#>   ..$ : chr "balanced"
#>   ..$ : chr "balanced"
#>   ..$ : chr "balanced"
#>   ..$ : chr "balanced"
#>   ..$ : num NaN
#>   ..$ : chr "balanced"
#>   ..$ : num NaN
#>   ..$ : num NaN
#>   ..$ : NULL
#>   ..$ : NULL
#>  $ param_clf__estimator                                       : chr  "SGDClassifier(class_weight='balanced', loss='log', max_iter=10000,\n              penalty='elasticnet')" "SGDClassifier(class_weight='balanced', loss='log', max_iter=10000,\n              penalty='elasticnet')" "SGDClassifier(class_weight='balanced', loss='log', max_iter=10000,\n              penalty='elasticnet')" "SGDClassifier(class_weight='balanced', loss='log', max_iter=10000,\n              penalty='elasticnet')" ...
#>  $ param_clf__estimator__alpha                                :List of 10
#>   ..$ : num NaN
#>   ..$ : num NaN
#>   ..$ : num NaN
#>   ..$ : num NaN
#>   ..$ : num 0.1
#>   ..$ : num NaN
#>   ..$ : num 0.5
#>   ..$ : int 1
#>   ..$ : num NaN
#>   ..$ : num NaN
#>  - attr(*, "pandas.index")=Int64Index([2, 0, 6, 5, 8, 9, 1, 4, 7, 3], dtype='int64')

# Learner performance barplot
text_pipe$p_compare_models_bar

# Predictions on test set
preds <- text_pipe$pred
head(preds)
#> [1] "Care received"        "Care received"        "Couldn't be improved"
#> [4] "Staff"                "Care received"        "Staff"               

# We can also get the row indices of the train and test data. Note that, in
# Python, indices start from 0. For example, the row indices of a data frame
# with 5 rows would be 0, 1, 2, 3 & 4.
head(sort(text_pipe$index_training_data))
#> [1] 0 1 2 3 4 5
head(sort(text_pipe$index_test_data))
#> [1]  7 11 12 19 23 26

# Let's subset the original data set
text_dataset <- pxtextmineR::text_data
rownames(text_dataset) <- 0:(nrow(text_dataset) - 1)
data_train <- text_dataset[text_pipe$index_training_data, ]
data_test <- text_dataset[text_pipe$index_test_data, ]
str(data_train)
#> 'data.frame':	6922 obs. of  3 variables:
#>  $ label      : chr  "Staff" "Care received" "Communication" "Couldn't be improved" ...
#>  $ criticality: chr  "1" "0" "0" "3" ...
#>  $ feedback   : chr  "It’s because that doctors, nurses help me" "What was very good was the fact the Physiotherapist called to see my mother when she was in the area just to se"| __truncated__ "XXXX was a good listener." "Nothing" ...
str(data_test)
#> 'data.frame':	3411 obs. of  3 variables:
#>  $ label      : chr  "Communication" "Care received" "Couldn't be improved" "Staff" ...
#>  $ criticality: chr  "0" "0" "3" "0" ...
#>  $ feedback   : chr  "Patient, listened carefully and very helpful advice.  Positive advice and exercises to do.\n" "I found the service very beneficial." "No, I was very with it all." "I find the staff welcoming and caring, making me feel more comfortable in discussing things." ...