Name Detection¶
There are several detectors that can be used to detect names:
- Stanford detector
Best accuracy, requires java to be installed
- Spacy v3 detector
Almost as good as Stanford NER, but easier to install
- TextBlob detector
Has a very high false positive rate, use with caution
All of these detectors are optional and so are not enabled by default.
To enable them you must install any dependencies, import them and finally add them to your Scrubber
.
In the following sections examples are given for this.
Stanford NER detector¶
To run the Stanford NER detector you will need both java and the nltk python package. On debian linux, java can be installed with:
$ apt-get install openjdk-14-jre
And then the python dependencies can be installed with:
$ pip install scrubadub_stanford
Once this has been done, the StanfordEntityDetector
can be used with the following:
>>> import scrubadub, scrubadub_stanford
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.add_detector(scrubadub_stanford.detectors.StanfordEntityDetector)
>>> scrubber.clean("My name is John")
'My name is {{NAME}}'
Spacy¶
This is the suggested named detector, since its easy to install and works pretty well. Spacy v3 requires python version >= 3.6 and < 3.9, as python 3.9 is not yet supported by spacy.
To install all dependencies for the Spacy detector you can do:
$ pip install scrubadub_spacy
Then to run it you can add it to your Scrubber
, like so:
>>> import scrubadub, scrubadub_spacy
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.add_detector(scrubadub_spacy.detectors.SpacyEntityDetector)
>>> scrubber.clean("My name is John")
'My name is {{NAME}}'
It is also possible to enable other tags from the Spacy Entity tagger, such Location and Organisation.
This can be done with the enable_*
parameters in the initialiser:
>>> import scrubadub, scrubadub_stanford
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.add_detector(scrubadub_stanford.detectors.StanfordEntityDetector(
... enable_person=True, enable_organization=True, enable_location=True
... ))
>>> scrubber.clean("My name is John and I work at the United Nations in Geneva")
'My name is {{NAME}} and I work at the {{ORGANIZATION}} in {{LOCATION}}'
TextBlob¶
It is suggested not to use this detector due to its high false positive rate, however it is useful in some situations. Please test it on your data to ensure it works well. This detector is already installed in the base scrubadub package and os you only need scrubadub installed to run it.
$ pip install scrubadub
Then to run it you can add it to your Scrubber
, like so:
>>> import scrubadub
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.add_detector(scrubadub.detectors.TextBlobNameDetector)
>>> scrubber.clean("My name is John")
'My name is {{NAME}}'