factory_pipeline_r.Rd
Prepare and fit a text classification pipeline with
Scikit-learn
.
factory_pipeline_r( x, y, tknz = "spacy", ordinal = FALSE, metric = "class_balance_accuracy_score", cv = 5, n_iter = 2, n_jobs = 1, verbose = 3, learners = c("SGDClassifier", "RidgeClassifier", "Perceptron", "PassiveAggressiveClassifier", "BernoulliNB", "ComplementNB", "MultinomialNB", "RandomForestClassifier"), theme = NULL )
x | Data frame. The text feature. |
---|---|
y | Vector. The response variable. |
tknz | Tokenizer to use ("spacy" or "wordnet"). |
ordinal | Whether to fit an ordinal classification model. The ordinal model is the implementation of Frank and Hall (2001) that can use any standard classification model that calculates probabilities. |
metric | String. Scorer to use during pipeline tuning ("accuracy_score", "balanced_accuracy_score", "matthews_corrcoef", "class_balance_accuracy_score"). |
cv | Number of cross-validation folds. |
n_iter | Number of parameter settings that are sampled (see
|
n_jobs | Number of jobs to run in parallel (see |
verbose | Controls the verbosity (see |
learners | Vector. |
theme | String. For internal use by Nottinghamshire Healthcare NHS
Foundation Trust or other trusts that use theme labels ("Access",
"Environment/ facilities" etc.). The column name of the theme variable.
Defaults to |
A fitted Scikit-learn
pipeline containing a number of objects that
can be accessed with the $
sign (see examples). For a partial list see
"Atributes" in sklearn.model_selection.RandomizedSearchCV
.
Do not be surprised if more objects are in the pipeline than those in the
aforementioned "Attributes" list. Python objects can contain several
objects, from numeric results (e.g. the pipeline's accuracy),
to methods (i.e. functions in the R lingo) and classes. In Python,
these are normally accessed with object.<whatever>
, but in R the
command is object$<whatever>
. For instance, one can access method
predict()
to make predictions on unseen data. See Examples.
The pipeline's parameter grid switches between two approaches to text classification: Bag-of-Words and Embeddings. For the former, both TF-IDF and raw counts are tried out.
The pipeline does the following:
Feature engineering:
Converts text into TF-IDFs or GloVe
word vectors with spaCy
.
Creates a new feature that is the length of the text in each record.
Performs sentiment analysis on the text feature and creates
new features that are all scores/indicators produced by
TextBlob
and vaderSentiment
.
Applies sklearn.preprocessing.KBinsDiscretizer
to the text length and sentiment indicator features, and
sklearn.preprocessing.StandardScaler
to the embeddings (word vectors).
Up-sampling of rare classes: uses imblearn.over_sampling.RandomOverSampler
to up-sample rare classes. Currently the threshold to consider a
class as rare and the up-balancing values are fixed and cannot be
user-defined.
Tokenization and lemmatization of the text feature: uses spaCy
(default) or NLTK
. It also strips
punctuation, excess spaces, and metacharacters "r" and "n" from the
text. It converts emojis into "__text__"
(where "text" is the emoji
name), and NA/NULL values into "__notext__"
(the pipeline does get
rid of records with no text, but this conversion at least deals with
any escaping ones).
Feature selection: Uses sklearn.feature_selection.SelectPercentile
with sklearn.feature_selection.chi2
for TF-IDFs or sklearn.feature_selection.f_classif
for embeddings.
Fitting and benchmarking of user-supplied Scikit-learn
estimators.
The numeric values in the grid are currently lists/tuples (Python objects) of
values that are defined either empirically or are based on the published
literature (e.g. for Random Forest, see Probst et al. 2019
).
Values may be replaced by appropriate distributions in a future release.
The pipeline uses the tokenizers of Python
library pxtextmining
.
Any warnings from Scikit-learn
like UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
can
therefore be safely ignored.
Also, any warnings about over-sampling can also be safely ignored. These
warnings are simply a result of an internal check in the over-sampler of
imblearn
.
Frank E. & Hall M. (2001). A Simple Approach to Ordinal Classification. Machine Learning: ECML 2001 145--156.
Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–-2830.
Probst P., Bischl B. & Boulesteix A-L (2018). Tunability: Importance of Hyperparameters of Machine Learning Algorithms. https://arxiv.org/abs/1802.09596
# Prepare training and test sets data_splits <- pxtextmineR::factory_data_load_and_split_r( filename = pxtextmineR::text_data, target = "label", predictor = "feedback", test_size = 0.90) # Make a small training set for a faster run in this example # Let's take a look at the returned list str(data_splits)#> List of 6 #> $ x_train :'data.frame': 1033 obs. of 1 variable: #> ..$ predictor: chr [1:1033] "Nothing to add all fine." "The ward need to look at putting a room somewhere on the ward, for very noisy patients as I've had no sleep for"| __truncated__ "The taps in patient toilets (Burns unit) could be improved to MIXER taps. Hot water v hot n cold water v cold, "| __truncated__ "Fans to keep me cooler and more comfortable." ... #> ..- attr(*, "pandas.index")=Int64Index([ 9817, 614, 10189, 8331, 6249, 5726, 8261, 1387, 2850, #> 2640, #> ... #> 5018, 705, 8383, 6074, 8943, 2585, 1543, 9558, 2067, #> 7593], #> dtype='int64', length=1033) #> $ x_test :'data.frame': 9301 obs. of 1 variable: #> ..$ predictor: chr [1:9301] "Treatment could have been ready, have had to wait each time and son in law is sat waiting outside." "You are already the best, everyone. Thank you. " "The noise and chatter was awful, staff stood leaning up against the reception desk talking and laughing and not"| __truncated__ "Nothing springs to mind as I have been treated well. " ... #> ..- attr(*, "pandas.index")=Int64Index([10235, 346, 7554, 775, 2435, 4563, 8110, 8056, 5850, #> 2816, #> ... #> 3958, 4858, 3207, 2398, 5502, 8079, 2797, 7283, 8837, #> 1766], #> dtype='int64', length=9301) #> $ y_train : chr [1:1033(1d)] "Couldn't be improved" "Environment/ facilities" "Communication" "Environment/ facilities" ... #> ..- attr(*, "dimnames")=List of 1 #> .. ..$ : chr [1:1033] "9817" "614" "10189" "8331" ... #> $ y_test : chr [1:9301(1d)] "Access" "Couldn't be improved" "Staff" "Couldn't be improved" ... #> ..- attr(*, "dimnames")=List of 1 #> .. ..$ : chr [1:9301] "10235" "346" "7554" "775" ... #> $ index_training_data: int [1:1033] 9817 614 10189 8331 6249 5726 8261 1387 2850 2640 ... #> $ index_test_data : int [1:9301] 10235 346 7554 775 2435 4563 8110 8056 5850 2816 ...# Fit the pipeline pipe <- pxtextmineR::factory_pipeline_r( x = data_splits$x_train, y = data_splits$y_train, tknz = "spacy", ordinal = FALSE, metric = "accuracy_score", cv = 2, n_iter = 1, n_jobs = 1, verbose = 3, learners = "SGDClassifier" ) # Mean cross-validated score of the best_estimator pipe$best_score_#> [1] 0.5314651# Best parameters during tuning pipe$best_params_#> $sampling__kw_args #> $sampling__kw_args$up_balancing_counts #> [1] 300 #> #> #> $preprocessor__texttr__text__transformer__use_idf #> [1] FALSE #> #> $preprocessor__texttr__text__transformer__tokenizer #> <pxtextmining.helpers.tokenization.LemmaTokenizer> #> #> $preprocessor__texttr__text__transformer__preprocessor #> <function text_preprocessor at 0x00000000645663A0> #> #> $preprocessor__texttr__text__transformer__norm #> [1] "l2" #> #> $preprocessor__texttr__text__transformer__ngram_range #> $preprocessor__texttr__text__transformer__ngram_range[[1]] #> [1] 1 #> #> $preprocessor__texttr__text__transformer__ngram_range[[2]] #> [1] 3 #> #> #> $preprocessor__texttr__text__transformer__min_df #> [1] 3 #> #> $preprocessor__texttr__text__transformer__max_df #> [1] 0.95 #> #> $preprocessor__texttr__text__transformer #> TfidfVectorizer(max_df=0.95, min_df=3, ngram_range=(1, 3), #> preprocessor=<function text_preprocessor at 0x00000000645663A0>, #> tokenizer=<pxtextmining.helpers.tokenization.LemmaTokenizer>, #> use_idf=False) #> #> $preprocessor__sentimenttr__scaler__scaler__n_bins #> [1] 8 #> #> $preprocessor__sentimenttr__scaler__scaler #> KBinsDiscretizer(n_bins=8, strategy='kmeans') #> #> $preprocessor__lengthtr__scaler__scaler #> KBinsDiscretizer(n_bins=3, strategy='kmeans') #> #> $featsel__selector__score_func #> <function chi2 at 0x000000006454E4C0> #> #> $featsel__selector__percentile #> [1] 100 #> #> $featsel__selector #> SelectPercentile(percentile=100, #> score_func=<function chi2 at 0x000000006454E4C0>) #> #> $clf__estimator__penalty #> [1] "elasticnet" #> #> $clf__estimator__max_iter #> [1] 10000 #> #> $clf__estimator__loss #> [1] "log" #> #> $clf__estimator__class_weight #> [1] "balanced" #> #> $clf__estimator #> SGDClassifier(class_weight='balanced', loss='log', max_iter=10000, #> penalty='elasticnet') #># Is the best model a linear SVM (loss = "hinge") or logistic regression (loss = "log)? pipe$best_params_$clf__estimator__loss#> [1] "log"#> [1] "Care received" "Care received" "Staff" "Care received" #> [5] "Care received" "Staff"# Performance on test set # # Can be done using the pipe's score() method pipe$score(data_splits$x_test, data_splits$y_test)#> [1] 0.5902591# Or using dplyr data_splits$y_test %>% data.frame() %>% dplyr::rename(true = '.') %>% dplyr::mutate( pred = preds, check = true == preds, check = sum(check) / nrow(.) ) %>% dplyr::pull(check) %>% unique#> [1] 0.5902591# We can also use other metrics, such as the Class Balance Accuracy score pxtextmineR::class_balance_accuracy_score_r( data_splits$y_test, preds )#> [1] 0.3913339