An R wrapper for Python’s pxtextmining
library- a pipeline to classify text-based patient experience data.
Function documentation: https://nhs-r-community.github.io/pxtextmineR/.
Package pxtextmineR
does not wrap everything from pxtextmining
, but selected functions that will offer R users new opportunities for modelling. For example, the whole Scikit-learn
(Pedregosa et al., 2011) text classification pipeline is wrapped, as well as helper functions for e.g. sentiment analysis with Python’s textBlob
and vaderSentiment
.
How does the wrapper work? It uses R package reticulate
, which provides tools for interoperability between Python and R.
There are a few things that need to be done to install and set up pxtextmineR
.
Run devtools::install_github("nhs-r-community/pxtextmineR")
in the R console.
Create a Python virtual environment. If not familiar with virtual environments please take a look at this and this. R package reticulate
has functions to create a Python virtual environment via the R console. Refer to reticulate::conda_create
and reticulate::virtualenv_create
. For example, if using Conda, run
reticulate::conda_create("r-reticulate")
where r-reticulate
is the name of reticulate
‘s default virtual environment. Using this default virtual environment for pxtextmineR
is strongly recommended because it makes the setup so much easier. According to the reticulate
authors’ own words “[i]t’s much more straightforward for users if there is a common environment used by R packages […]”
Tell reticulate
to use the r-reticulate
virtual environment:
Install Python package pxtextmining
in r-reticulate
:
reticulate::py_install(envname = "r-reticulate", packages = "pxtextmining", pip = TRUE)
We also need to install a couple of spaCy
models in r-reticulate
. These are obtained from URL links and thus need to be installed separately. In the R console run:
system("pip install wheel")
system("pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz")
system("pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.3.1/en_core_web_lg-2.3.1.tar.gz")
All steps in one go:
devtools::install_github("nhs-r-community/pxtextmineR")
# If not using Conda, comment out the next two lines and uncomment the two lines
# following them.
reticulate::conda_create("r-reticulate")
reticulate::use_condaenv("r-reticulate", required = TRUE)
# reticulate::virtualenv_create("r-reticulate")
# reticulate::use_virtualenv("r-reticulate", required = TRUE)
reticulate::py_install(envname = "r-reticulate", packages = "pxtextmining", pip = TRUE)
system("pip install wheel")
system("pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz")
system("pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.3.1/en_core_web_lg-2.3.1.tar.gz")
The installation instructions above did not work in all machines on which the installation process was tested. There were two problems:
reticulate
would simply refuse to install in virtual environment r-reticulate
the version of Scikit-learn
that pxtextmining
uses (v 0.23.2).r-reticulate
(i.e. reticulate::use_condaenv("<some_other_virtual_environment>", required = TRUE)
), the behaviour of reticulate
was confusing. On the one hand, it would run pxtextmineR
functions using the user-specified virtual environment. However, on the other hand, when running commands to build e.g. function documentation with R package pkgdown
, reticulate
would automatically set r-reticulate
as the default environment, causing the code to break.We have opted for a more “invasive” approach to fix this problem so that users can use any virtual environment with no issues. This requires the following steps:
Create a Python virtual environment using e.g. Anaconda, Miniconda or a Virtual Python Environment.
Activate it and install pxtextmining
and the spaCy
models:
pip install pxtextmining
pip install wheel
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.3.1/en_core_web_lg-2.3.1.tar.gz
Use a text editor to open your .Renviron
file, normally located in ~/.Renviron
, and add the following lines:
PXTEXTMINER_PYTHON_VENV_MANAGER=name_or_path_to_venv_manager
PXTEXTMINER_PYTHON_VENV=name_of_venv
where "name_of_venv" should be replaced by the name of the virtual environment (unquoted) and "name_or_path_to_venv_manager" should be replaced by the name of the virtual environment manager or the path to the virtual environment (unquoted). In more detail:.
If using Conda or Miniconda, replace "name_or_path_to_venv_manager" with "conda" or "miniconda" (unquoted) respectively.
If using a Virtual Python Environment, replace
"name_or_path_to_venv_manager" with the path to the virtual environment,
e.g. home/user/venvs/myvenv
.
Good idea to restart R Studio.
Run devtools::install_github("nhs-r-community/pxtextmineR")
in the R console.
Again, good idea to restart R Studio. If there are error messages that the user-specified Python environment cannot be set, close and re-open R Studio.
Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–2830.