scrubadub.comparison

Filth objects are responsible for marking particular sections of text as containing that type of filth. It is also responsible for knowing how it should be cleaned. Every type of Filth inherits from scrubadub.filth.base.Filth.

scrubadub.comparison.get_filth_classification_report(filth_list: List[scrubadub.filth.base.Filth], combine_detectors: bool = False, groupby_documents: bool = False, output_dict: bool = False) Optional[Union[str, Dict[str, float]]][source]

Evaluates the performance of detectors using KnownFilth.

An example of using this is shown below:

>>> import scrubadub, scrubadub.comparison, scrubadub.detectors.text_blob
>>> scrubber = scrubadub.Scrubber(detector_list=[
...     scrubadub.detectors.text_blob.TextBlobNameDetector(name='name_detector'),
...     scrubadub.detectors.TaggedEvaluationFilthDetector([
...         {'match': 'Tom', 'filth_type': 'name'},
...         {'match': 'tom@example.com', 'filth_type': 'email'},
...     ]),
... ])
>>> filth_list = list(scrubber.iter_filth("Hello I am Tom"))
>>> print(scrubadub.comparison.get_filth_classification_report(filth_list))
filth    detector         locale      precision    recall  f1-score   support

name     name_detector    en_US            1.00      1.00      1.00         1

                            accuracy                           1.00         1
                           macro avg       1.00      1.00      1.00         1
                        weighted avg       1.00      1.00      1.00         1
Parameters
  • filth_list (A list of Filth objects) – The list of detected filth

  • combine_detectors (bool, optional) – Combine performance of all detectors for the same filth/locale

  • groupby_documents (bool, optional) – Show performance for each file individually

  • output_dict (bool, optional) – Return the report in JSON format, defautls to False

Returns

The report in JSON (a dict) or in plain text

Return type

str or dict

scrubadub.comparison.get_filth_dataframe(filth_list: List[scrubadub.filth.base.Filth]) pandas.core.frame.DataFrame[source]

Produces a pandas DataFrame to allow debugging and improving detectors.

An example of using this is shown below:

>>> import scrubadub, scrubadub.comparison, scrubadub.detectors.text_blob
>>> scrubber = scrubadub.Scrubber(detector_list=[
...     scrubadub.detectors.text_blob.TextBlobNameDetector(name='name_detector'),
...     scrubadub.detectors.TaggedEvaluationFilthDetector([
...         {'match': 'Tom', 'filth_type': 'name'},
...         {'match': 'tom@example.com', 'filth_type': 'email'},
...     ]),
... ])
>>> filth_list = list(scrubber.iter_filth("Hello I am Tom"))
>>> with pd.option_context("display.max_columns", 20):
...     print(scrubadub.comparison.get_filth_dataframe(filth_list))  
   group_id  filth_id filth_type  detector_name document_name text  beg  end  \
0         0         0       name  name_detector          None  Tom   11   14

  locale  known_filth comparison_type known_text  known_beg  known_end  \
0  en_US         True             NaN        Tom         11         14

  known_comparison_type  exact_match  partial_match  true_positive  \
0                  name         True           True           True

   false_positive  false_negative
0           False           False
Parameters

filth_list (A list of Filth objects) – The list of detected filth

Returns

A pd.DataFrame containing infomatoin about the detected Filth

Return type

pd.DataFrame

scrubadub.comparison.make_fake_document(paragraphs: int = 20, locale: str = 'en_US', seed: Optional[int] = None, faker: Optional[faker.proxy.Faker] = None, filth_types: Optional[List[str]] = None, fake_text_function: Optional[Callable[[...], str]] = None, additional_filth_types: Optional[Iterable[Type[scrubadub.filth.base.Filth]]] = None) Tuple[str, List[scrubadub.detectors.tagged.KnownFilthItem]][source]

Creates a fake document containing Filth that needs to be removed. Also returns the list of known filth items that are needed by the TaggedEvaluationFilthDetector.

An example of using this is shown below:

>>> import scrubadub, scrubadub.comparison
>>> document, known_filth_items = scrubadub.comparison.make_fake_document(paragraphs=1, seed=1)
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.add_detector(scrubadub.detectors.TaggedEvaluationFilthDetector(
...     known_filth_items=known_filth_items
... ))
>>> filth_list = list(scrubber.iter_filth(document))
>>> print(scrubadub.comparison.get_filth_classification_report(filth_list))
filth    detector    locale      precision    recall  f1-score   support

email    email       en_US            1.00      1.00      1.00         2
url      url         en_US            1.00      1.00      1.00         1

                      micro avg       1.00      1.00      1.00         3
                      macro avg       1.00      1.00      1.00         3
                   weighted avg       1.00      1.00      1.00         3
                    samples avg       1.00      1.00      1.00         3
Parameters
  • paragraphs (int) – The list of detected filth

  • locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”

  • seed (int, optional) – The random seed used to generate the document

  • faker (int) – A Faker object that is used to generate the text

  • filth_types (List[str]) – A list of the Filth.type to generate

  • fake_text_function (Callable, optional) – A function that will generate a 1-3 sentances of text

Returns

The document and a list of KnownFilthItems

Return type

Tuple[str, List[KnownFilthItem]]