Prepare and fit a text classification pipeline — factory_pipeline

Prepare and fit a text classification pipeline with Scikit-learn.

factory_pipeline_r(
  x,
  y,
  tknz = "spacy",
  ordinal = FALSE,
  metric = "class_balance_accuracy_score",
  cv = 5,
  n_iter = 2,
  n_jobs = 1,
  verbose = 3,
  learners = c("SGDClassifier", "RidgeClassifier", "Perceptron",
    "PassiveAggressiveClassifier", "BernoulliNB", "ComplementNB", "MultinomialNB",
    "RandomForestClassifier"),
  theme = NULL
)

Arguments

x	Data frame. The text feature.
y	Vector. The response variable.
tknz	Tokenizer to use ("spacy" or "wordnet").
ordinal	Whether to fit an ordinal classification model. The ordinal model is the implementation of Frank and Hall (2001) that can use any standard classification model that calculates probabilities.
metric	String. Scorer to use during pipeline tuning ("accuracy_score", "balanced_accuracy_score", "matthews_corrcoef", "class_balance_accuracy_score").
cv	Number of cross-validation folds.
n_iter	Number of parameter settings that are sampled (see `sklearn.model_selection.RandomizedSearchCV`).
n_jobs	Number of jobs to run in parallel (see `sklearn.model_selection.RandomizedSearchCV`). NOTE: If your machine does not have the number of cores specified in `n_jobs`, then an error will be returned.
verbose	Controls the verbosity (see `sklearn.model_selection.RandomizedSearchCV`).
learners	Vector. `Scikit-learn` names of the learners to tune. Must be one or more of "SGDClassifier", "RidgeClassifier", "Perceptron", "PassiveAggressiveClassifier", "BernoulliNB", "ComplementNB", "MultinomialNB", "KNeighborsClassifier", "NearestCentroid", "RandomForestClassifier". When a single model is used, it can be passed as a string.
theme	String. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that use theme labels ("Access", "Environment/ facilities" etc.). The column name of the theme variable. Defaults to `NULL`. If supplied, the theme variable will be used as a predictor (along with the text predictor) in the model that is fitted with criticality as the response variable. The rationale is two-fold. First, to help the model improve predictions on criticality when the theme labels are readily available. Second, to force the criticality for "Couldn't be improved" to always be "3" in the training and test data, as well as in the predictions. This is the only criticality value that "Couldn't be improved" can take, so by forcing it to always be "3", we are improving model performance, but are also correcting possible erroneous assignments of values other than "3" that are attributed to human error.

Value

A fitted Scikit-learn pipeline containing a number of objects that can be accessed with the $ sign (see examples). For a partial list see "Atributes" in sklearn.model_selection.RandomizedSearchCV. Do not be surprised if more objects are in the pipeline than those in the aforementioned "Attributes" list. Python objects can contain several objects, from numeric results (e.g. the pipeline's accuracy), to methods (i.e. functions in the R lingo) and classes. In Python, these are normally accessed with object.<whatever>, but in R the command is object$<whatever>. For instance, one can access method predict() to make predictions on unseen data. See Examples.

Details

The pipeline's parameter grid switches between two approaches to text classification: Bag-of-Words and Embeddings. For the former, both TF-IDF and raw counts are tried out.

The pipeline does the following:

Feature engineering:
- Converts text into TF-IDFs or GloVe word vectors with spaCy.
- Creates a new feature that is the length of the text in each record.
- Performs sentiment analysis on the text feature and creates new features that are all scores/indicators produced by TextBlob and vaderSentiment.
- Applies sklearn.preprocessing.KBinsDiscretizer to the text length and sentiment indicator features, and sklearn.preprocessing.StandardScaler to the embeddings (word vectors).
Up-sampling of rare classes: uses imblearn.over_sampling.RandomOverSampler to up-sample rare classes. Currently the threshold to consider a class as rare and the up-balancing values are fixed and cannot be user-defined.
Tokenization and lemmatization of the text feature: uses spaCy (default) or NLTK. It also strips punctuation, excess spaces, and metacharacters "r" and "n" from the text. It converts emojis into "__text__" (where "text" is the emoji name), and NA/NULL values into "__notext__" (the pipeline does get rid of records with no text, but this conversion at least deals with any escaping ones).
Feature selection: Uses sklearn.feature_selection.SelectPercentile with sklearn.feature_selection.chi2 for TF-IDFs or sklearn.feature_selection.f_classif for embeddings.
Fitting and benchmarking of user-supplied Scikit-learn estimators.

The numeric values in the grid are currently lists/tuples (Python objects) of values that are defined either empirically or are based on the published literature (e.g. for Random Forest, see Probst et al. 2019). Values may be replaced by appropriate distributions in a future release.

Note

The pipeline uses the tokenizers of Python library pxtextmining. Any warnings from Scikit-learn like UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None' can therefore be safely ignored.

Also, any warnings about over-sampling can also be safely ignored. These warnings are simply a result of an internal check in the over-sampler of imblearn.

References

Frank E. & Hall M. (2001). A Simple Approach to Ordinal Classification. Machine Learning: ECML 2001 145--156.

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–-2830.

Probst P., Bischl B. & Boulesteix A-L (2018). Tunability: Importance of Hyperparameters of Machine Learning Algorithms. https://arxiv.org/abs/1802.09596

Examples

# Prepare training and test sets
data_splits <- pxtextmineR::factory_data_load_and_split_r(
  filename = pxtextmineR::text_data,
  target = "label",
  predictor = "feedback",
  test_size = 0.90) # Make a small training set for a faster run in this example

# Let's take a look at the returned list
str(data_splits)
#> List of 6
#>  $ x_train            :'data.frame':	1033 obs. of  1 variable:
#>   ..$ predictor: chr [1:1033] "Nothing to add all fine." "The ward need to look at putting a room somewhere on the ward, for very noisy patients as I've had no sleep for"| __truncated__ "The taps in patient toilets (Burns unit) could be improved to MIXER taps. Hot water v hot n cold water v cold, "| __truncated__ "Fans to keep me cooler and more comfortable." ...
#>   ..- attr(*, "pandas.index")=Int64Index([ 9817,   614, 10189,  8331,  6249,  5726,  8261,  1387,  2850,
#>              2640,
#>             ...
#>              5018,   705,  8383,  6074,  8943,  2585,  1543,  9558,  2067,
#>              7593],
#>            dtype='int64', length=1033)
#>  $ x_test             :'data.frame':	9301 obs. of  1 variable:
#>   ..$ predictor: chr [1:9301] "Treatment could have been ready, have had to wait each time and son in law is sat waiting outside." "You are already the best, everyone. Thank you. " "The noise and chatter was awful, staff stood leaning up against the reception desk talking and laughing and not"| __truncated__ "Nothing springs to mind as I have been treated well. " ...
#>   ..- attr(*, "pandas.index")=Int64Index([10235,   346,  7554,   775,  2435,  4563,  8110,  8056,  5850,
#>              2816,
#>             ...
#>              3958,  4858,  3207,  2398,  5502,  8079,  2797,  7283,  8837,
#>              1766],
#>            dtype='int64', length=9301)
#>  $ y_train            : chr [1:1033(1d)] "Couldn't be improved" "Environment/ facilities" "Communication" "Environment/ facilities" ...
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..$ : chr [1:1033] "9817" "614" "10189" "8331" ...
#>  $ y_test             : chr [1:9301(1d)] "Access" "Couldn't be improved" "Staff" "Couldn't be improved" ...
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..$ : chr [1:9301] "10235" "346" "7554" "775" ...
#>  $ index_training_data: int [1:1033] 9817 614 10189 8331 6249 5726 8261 1387 2850 2640 ...
#>  $ index_test_data    : int [1:9301] 10235 346 7554 775 2435 4563 8110 8056 5850 2816 ...

# Fit the pipeline
pipe <- pxtextmineR::factory_pipeline_r(
  x = data_splits$x_train,
  y = data_splits$y_train,
  tknz = "spacy",
  ordinal = FALSE,
  metric = "accuracy_score",
  cv = 2, n_iter = 1, n_jobs = 1, verbose = 3,
  learners = "SGDClassifier"
)

# Mean cross-validated score of the best_estimator
pipe$best_score_
#> [1] 0.5314651

# Best parameters during tuning
pipe$best_params_
#> $sampling__kw_args
#> $sampling__kw_args$up_balancing_counts
#> [1] 300
#>
#>
#> $preprocessor__texttr__text__transformer__use_idf
#> [1] FALSE
#>
#> $preprocessor__texttr__text__transformer__tokenizer
#> <pxtextmining.helpers.tokenization.LemmaTokenizer>
#>
#> $preprocessor__texttr__text__transformer__preprocessor
#> <function text_preprocessor at 0x00000000645663A0>
#>
#> $preprocessor__texttr__text__transformer__norm
#> [1] "l2"
#>
#> $preprocessor__texttr__text__transformer__ngram_range
#> $preprocessor__texttr__text__transformer__ngram_range[[1]]
#> [1] 1
#>
#> $preprocessor__texttr__text__transformer__ngram_range[[2]]
#> [1] 3
#>
#>
#> $preprocessor__texttr__text__transformer__min_df
#> [1] 3
#>
#> $preprocessor__texttr__text__transformer__max_df
#> [1] 0.95
#>
#> $preprocessor__texttr__text__transformer
#> TfidfVectorizer(max_df=0.95, min_df=3, ngram_range=(1, 3),
#>                 preprocessor=<function text_preprocessor at 0x00000000645663A0>,
#>                 tokenizer=<pxtextmining.helpers.tokenization.LemmaTokenizer>,
#>                 use_idf=False)
#>
#> $preprocessor__sentimenttr__scaler__scaler__n_bins
#> [1] 8
#>
#> $preprocessor__sentimenttr__scaler__scaler
#> KBinsDiscretizer(n_bins=8, strategy='kmeans')
#>
#> $preprocessor__lengthtr__scaler__scaler
#> KBinsDiscretizer(n_bins=3, strategy='kmeans')
#>
#> $featsel__selector__score_func
#> <function chi2 at 0x000000006454E4C0>
#>
#> $featsel__selector__percentile
#> [1] 100
#>
#> $featsel__selector
#> SelectPercentile(percentile=100,
#>                  score_func=<function chi2 at 0x000000006454E4C0>)
#>
#> $clf__estimator__penalty
#> [1] "elasticnet"
#>
#> $clf__estimator__max_iter
#> [1] 10000
#>
#> $clf__estimator__loss
#> [1] "log"
#>
#> $clf__estimator__class_weight
#> [1] "balanced"
#>
#> $clf__estimator
#> SGDClassifier(class_weight='balanced', loss='log', max_iter=10000,
#>               penalty='elasticnet')
#> 

# Is the best model a linear SVM (loss = "hinge") or logistic regression (loss = "log)?
pipe$best_params_$clf__estimator__loss
#> [1] "log"

# Make predictions
preds <- pipe$predict(data_splits$x_test)
head(preds)
#> [1] "Care received" "Care received" "Staff"         "Care received"
#> [5] "Care received" "Staff"        

# Performance on test set #
# Can be done using the pipe's score() method
pipe$score(data_splits$x_test, data_splits$y_test)
#> [1] 0.5902591

# Or using dplyr
data_splits$y_test %>%
  data.frame() %>%
  dplyr::rename(true = '.') %>%
  dplyr::mutate(
    pred = preds,
    check = true == preds,
    check = sum(check) / nrow(.)
  ) %>%
  dplyr::pull(check) %>%
  unique
#> [1] 0.5902591

# We can also use other metrics, such as the Class Balance Accuracy score
pxtextmineR::class_balance_accuracy_score_r(
  data_splits$y_test,
  preds
)
#> [1] 0.3913339