Prepare and fit a text classification pipeline with Scikit-learn.

factory_pipeline_r(
  x,
  y,
  tknz = "spacy",
  ordinal = FALSE,
  metric = "class_balance_accuracy_score",
  cv = 5,
  n_iter = 2,
  n_jobs = 1,
  verbose = 3,
  learners = c("SGDClassifier", "RidgeClassifier", "Perceptron",
    "PassiveAggressiveClassifier", "BernoulliNB", "ComplementNB", "MultinomialNB",
    "RandomForestClassifier"),
  theme = NULL
)

Arguments

x

Data frame. The text feature.

y

Vector. The response variable.

tknz

Tokenizer to use ("spacy" or "wordnet").

ordinal

Whether to fit an ordinal classification model. The ordinal model is the implementation of Frank and Hall (2001) that can use any standard classification model that calculates probabilities.

metric

String. Scorer to use during pipeline tuning ("accuracy_score", "balanced_accuracy_score", "matthews_corrcoef", "class_balance_accuracy_score").

cv

Number of cross-validation folds.

n_iter

Number of parameter settings that are sampled (see sklearn.model_selection.RandomizedSearchCV).

n_jobs

Number of jobs to run in parallel (see sklearn.model_selection.RandomizedSearchCV). NOTE: If your machine does not have the number of cores specified in n_jobs, then an error will be returned.

verbose

Controls the verbosity (see sklearn.model_selection.RandomizedSearchCV).

learners

Vector. Scikit-learn names of the learners to tune. Must be one or more of "SGDClassifier", "RidgeClassifier", "Perceptron", "PassiveAggressiveClassifier", "BernoulliNB", "ComplementNB", "MultinomialNB", "KNeighborsClassifier", "NearestCentroid", "RandomForestClassifier". When a single model is used, it can be passed as a string.

theme

String. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that use theme labels ("Access", "Environment/ facilities" etc.). The column name of the theme variable. Defaults to NULL. If supplied, the theme variable will be used as a predictor (along with the text predictor) in the model that is fitted with criticality as the response variable. The rationale is two-fold. First, to help the model improve predictions on criticality when the theme labels are readily available. Second, to force the criticality for "Couldn't be improved" to always be "3" in the training and test data, as well as in the predictions. This is the only criticality value that "Couldn't be improved" can take, so by forcing it to always be "3", we are improving model performance, but are also correcting possible erroneous assignments of values other than "3" that are attributed to human error.

Value

A fitted Scikit-learn pipeline containing a number of objects that can be accessed with the $ sign (see examples). For a partial list see "Atributes" in sklearn.model_selection.RandomizedSearchCV. Do not be surprised if more objects are in the pipeline than those in the aforementioned "Attributes" list. Python objects can contain several objects, from numeric results (e.g. the pipeline's accuracy), to methods (i.e. functions in the R lingo) and classes. In Python, these are normally accessed with object.<whatever>, but in R the command is object$<whatever>. For instance, one can access method predict() to make predictions on unseen data. See Examples.

Details

The pipeline's parameter grid switches between two approaches to text classification: Bag-of-Words and Embeddings. For the former, both TF-IDF and raw counts are tried out.

The pipeline does the following:

The numeric values in the grid are currently lists/tuples (Python objects) of values that are defined either empirically or are based on the published literature (e.g. for Random Forest, see Probst et al. 2019). Values may be replaced by appropriate distributions in a future release.

Note

The pipeline uses the tokenizers of Python library pxtextmining. Any warnings from Scikit-learn like UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None' can therefore be safely ignored.

Also, any warnings about over-sampling can also be safely ignored. These warnings are simply a result of an internal check in the over-sampler of imblearn.

References

Frank E. & Hall M. (2001). A Simple Approach to Ordinal Classification. Machine Learning: ECML 2001 145--156.

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–-2830.

Probst P., Bischl B. & Boulesteix A-L (2018). Tunability: Importance of Hyperparameters of Machine Learning Algorithms. https://arxiv.org/abs/1802.09596

Examples

# Prepare training and test sets data_splits <- pxtextmineR::factory_data_load_and_split_r( filename = pxtextmineR::text_data, target = "label", predictor = "feedback", test_size = 0.90) # Make a small training set for a faster run in this example # Let's take a look at the returned list str(data_splits)
#> List of 6 #> $ x_train :'data.frame': 1033 obs. of 1 variable: #> ..$ predictor: chr [1:1033] "Nothing to add all fine." "The ward need to look at putting a room somewhere on the ward, for very noisy patients as I've had no sleep for"| __truncated__ "The taps in patient toilets (Burns unit) could be improved to MIXER taps. Hot water v hot n cold water v cold, "| __truncated__ "Fans to keep me cooler and more comfortable." ... #> ..- attr(*, "pandas.index")=Int64Index([ 9817, 614, 10189, 8331, 6249, 5726, 8261, 1387, 2850, #> 2640, #> ... #> 5018, 705, 8383, 6074, 8943, 2585, 1543, 9558, 2067, #> 7593], #> dtype='int64', length=1033) #> $ x_test :'data.frame': 9301 obs. of 1 variable: #> ..$ predictor: chr [1:9301] "Treatment could have been ready, have had to wait each time and son in law is sat waiting outside." "You are already the best, everyone. Thank you. " "The noise and chatter was awful, staff stood leaning up against the reception desk talking and laughing and not"| __truncated__ "Nothing springs to mind as I have been treated well. " ... #> ..- attr(*, "pandas.index")=Int64Index([10235, 346, 7554, 775, 2435, 4563, 8110, 8056, 5850, #> 2816, #> ... #> 3958, 4858, 3207, 2398, 5502, 8079, 2797, 7283, 8837, #> 1766], #> dtype='int64', length=9301) #> $ y_train : chr [1:1033(1d)] "Couldn't be improved" "Environment/ facilities" "Communication" "Environment/ facilities" ... #> ..- attr(*, "dimnames")=List of 1 #> .. ..$ : chr [1:1033] "9817" "614" "10189" "8331" ... #> $ y_test : chr [1:9301(1d)] "Access" "Couldn't be improved" "Staff" "Couldn't be improved" ... #> ..- attr(*, "dimnames")=List of 1 #> .. ..$ : chr [1:9301] "10235" "346" "7554" "775" ... #> $ index_training_data: int [1:1033] 9817 614 10189 8331 6249 5726 8261 1387 2850 2640 ... #> $ index_test_data : int [1:9301] 10235 346 7554 775 2435 4563 8110 8056 5850 2816 ...
# Fit the pipeline pipe <- pxtextmineR::factory_pipeline_r( x = data_splits$x_train, y = data_splits$y_train, tknz = "spacy", ordinal = FALSE, metric = "accuracy_score", cv = 2, n_iter = 1, n_jobs = 1, verbose = 3, learners = "SGDClassifier" ) # Mean cross-validated score of the best_estimator pipe$best_score_
#> [1] 0.5314651
# Best parameters during tuning pipe$best_params_
#> $sampling__kw_args #> $sampling__kw_args$up_balancing_counts #> [1] 300 #> #> #> $preprocessor__texttr__text__transformer__use_idf #> [1] FALSE #> #> $preprocessor__texttr__text__transformer__tokenizer #> <pxtextmining.helpers.tokenization.LemmaTokenizer> #> #> $preprocessor__texttr__text__transformer__preprocessor #> <function text_preprocessor at 0x00000000645663A0> #> #> $preprocessor__texttr__text__transformer__norm #> [1] "l2" #> #> $preprocessor__texttr__text__transformer__ngram_range #> $preprocessor__texttr__text__transformer__ngram_range[[1]] #> [1] 1 #> #> $preprocessor__texttr__text__transformer__ngram_range[[2]] #> [1] 3 #> #> #> $preprocessor__texttr__text__transformer__min_df #> [1] 3 #> #> $preprocessor__texttr__text__transformer__max_df #> [1] 0.95 #> #> $preprocessor__texttr__text__transformer #> TfidfVectorizer(max_df=0.95, min_df=3, ngram_range=(1, 3), #> preprocessor=<function text_preprocessor at 0x00000000645663A0>, #> tokenizer=<pxtextmining.helpers.tokenization.LemmaTokenizer>, #> use_idf=False) #> #> $preprocessor__sentimenttr__scaler__scaler__n_bins #> [1] 8 #> #> $preprocessor__sentimenttr__scaler__scaler #> KBinsDiscretizer(n_bins=8, strategy='kmeans') #> #> $preprocessor__lengthtr__scaler__scaler #> KBinsDiscretizer(n_bins=3, strategy='kmeans') #> #> $featsel__selector__score_func #> <function chi2 at 0x000000006454E4C0> #> #> $featsel__selector__percentile #> [1] 100 #> #> $featsel__selector #> SelectPercentile(percentile=100, #> score_func=<function chi2 at 0x000000006454E4C0>) #> #> $clf__estimator__penalty #> [1] "elasticnet" #> #> $clf__estimator__max_iter #> [1] 10000 #> #> $clf__estimator__loss #> [1] "log" #> #> $clf__estimator__class_weight #> [1] "balanced" #> #> $clf__estimator #> SGDClassifier(class_weight='balanced', loss='log', max_iter=10000, #> penalty='elasticnet') #>
# Is the best model a linear SVM (loss = "hinge") or logistic regression (loss = "log)? pipe$best_params_$clf__estimator__loss
#> [1] "log"
# Make predictions preds <- pipe$predict(data_splits$x_test) head(preds)
#> [1] "Care received" "Care received" "Staff" "Care received" #> [5] "Care received" "Staff"
# Performance on test set # # Can be done using the pipe's score() method pipe$score(data_splits$x_test, data_splits$y_test)
#> [1] 0.5902591
# Or using dplyr data_splits$y_test %>% data.frame() %>% dplyr::rename(true = '.') %>% dplyr::mutate( pred = preds, check = true == preds, check = sum(check) / nrow(.) ) %>% dplyr::pull(check) %>% unique
#> [1] 0.5902591
# We can also use other metrics, such as the Class Balance Accuracy score pxtextmineR::class_balance_accuracy_score_r( data_splits$y_test, preds )
#> [1] 0.3913339