Welcome to RecList’s documentation!

RecList

https://img.shields.io/pypi/v/reclist.svg Documentation Status https://github.com/jacopotagliabue/reclist/workflows/Python%20package/badge.svg Contributors License Downloads YouTube

RecList

Overview

RecList is an open source library providing behavioral, “black-box” testing for recommender systems. Inspired by the pioneering work of Ribeiro et al. 2020 in NLP, we introduce a general plug-and-play procedure to scale up behavioral testing, with an easy-to-extend interface for custom use cases.

While quantitative metrics over held-out data points are important, a lot more tests are needed for recommenders to properly function in the wild and not erode our confidence in them: for example, a model may boast an accuracy improvement over the entire dataset, but actually be significantly worse than another on rare items or new users; or again, a model that correctly recommends HDMI cables as add-on for shoppers buying a TV, may also wrongly recommend TVs to shoppers just buying a cable.

RecList goal is to operationalize these important intuitions into a practical package for testing research and production models in a more nuanced way, without requiring unnecessary custom code and ad hoc procedures. To streamline comparisons among existing models, RecList ships with popular datasets and ready-made behavioral tests: read the TDS blog post as a gentle introduction to the main use cases, check the paper for more details on the relevant literature.

If you are not familiar with the library, we suggest first taking our small tour to get acquainted with the main abstractions through ready-made models and public datasets.

Supporters

RecList is a community project made possible by the generous support of these awesome folks. Make sure to check them out!

Comet
https://github.com/jacopotagliabue/reclist/raw/main/images/comet.png
Neptune
https://github.com/jacopotagliabue/reclist/raw/main/images/neptune.png
Gantry
https://github.com/jacopotagliabue/reclist/raw/main/images/gantry.png

Project updates

Community Support: RecList is an open source community project made possible by the support of the awesome folks at Comet, Neptune and Gantry. Soon RecList tests will be natively integrated with the MLOps tools you already know and love!

June 2022: We launched a website to collect RecList materials, such as talks and presentations. RecList is powering the Data Challenge at CIKM 2022: fill the form for updates.

In the last few months, we presented this library to practioners at Tubi, eBay, NVIDIA, BBC and other RecSys companies: we are in the process of collecting our thoughts after all the feedback we received, as we plan a beta release for this package in the next few months - come back often for updates, as we will also open a call for collaboration!

Please remember that the library is in alpha (i.e. enough working code to finish the paper and tinker with it). We welcome early feedback, but please be advised that the package may change substantially in the near future (“If you’re not embarrassed by the first version, you’ve launched too late”).

Summary

This doc is structured as follows:

Quick Start

If you want to see RecList in action, clone the repository, create and activate a virtual env, and install the required packages from pip (you can install from root of course). If you prefer to experiment in an interactive, no-installation-required fashion, try out our colab notebook.

Sample scripts are divided by use-cases: similar items, complementary items or session-based recommendations. When executing one, a suitable public dataset will be downloaded, and a baseline model trained: finally, the script will run a pre-made suite of behavioral tests to show typical results.

git clone https://github.com/jacopotagliabue/reclist
cd reclist
python3 -m venv venv
source venv/bin/activate
pip install reclist
python examples/coveo_complementary_rec.py

Running your model on one of the supported dataset, leveraging the pre-made tests, is as easy as implementing a simple interface, RecModel.

Once you’ve run successfully the sample script, take the guided tour below to learn more about the abstractions and the out-of-the-box capabilities of RecList.

A Guided Tour

An instance of RecList represents a suite of tests for recommender systems: given a dataset (more appropriately, an instance of RecDataset) and a model (an instance of RecModel), it will run the specified tests on the target dataset, using the supplied model.

For example, the following code instantiates a pre-made suite of tests that contains sensible defaults for a cart recommendation use case:

rec_list = CoveoCartRecList(
    model=model,
    dataset=coveo_dataset
)
# invoke rec_list to run tests
rec_list(verbose=True)

Our library pre-packages standard recSys KPIs and important behavioral tests, divided by use cases, but it is built with extensibility in mind: you can re-use tests in new suites, or you can write new domain-specific suites and tests.

Any suite must inherit the RecList interface, and then declare with Pytonic decorators its tests. In this case, the test re-uses a standard function:

class MyRecList(RecList):

    @rec_test(test_type='stats')
    def basic_stats(self):
        """
        Basic statistics on training, test and prediction data
        """
        from reclist.metrics.standard_metrics import statistics
        return statistics(self._x_train,
            self._y_train,
            self._x_test,
            self._y_test,
            self._y_preds)

Any model can be tested, as long as its predictions are wrapped in a RecModel. This allows for pure “black-box” testings, a SaaS provider can be tested just by wrapping the proper API call in the method:

class MyCartModel(RecModel):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def predict(self, prediction_input: list, *args, **kwargs):
        """
        Implement the abstract method, accepting a list of lists, each list being
        the content of a cart: the predictions returned by the model are the top K
        items suggested to complete the cart.
        """

        return

More generally, the logical workflow of a typical RecList implementation is as follows (see our blog post for a longer explanation):

https://github.com/jacopotagliabue/reclist/blob/main/images/workflow.gif

While many standard KPIs are available in the package, the philosophy behind RecList is that metrics like Hit Rate provide only a partial picture of the expected behavior of recommenders in the wild: two models with very similar accuracy can have very different behavior on, say, the long-tail, or model A can be better than model B overall, but at the expense of providing disastrous performance on a set of inputs that are particularly important in production.

RecList recognizes that outside of academic benchmarks, some mistakes are worse than others, and not all inputs are created equal: when possible, it tries to operationalize through scalable code behavioral insights for debugging and error analysis; it also provides extensible abstractions when domain knowledge and custom logic are needed.

Once you run a suite of tests, results are dumped automatically and versioned in a local folder, structured as follows (name of the suite, name of the model, run timestamp):

.reclist/
  myList/
    myModel/
      1637357392/
      1637357404/

If you start using RecList as part of your standard testings - either for research or production purposes - you can use the JSON report for machine-to-machine communication with downstream system (e.g. you may want to automatically fail the model pipeline if certain behavioral tests are not passed).

Note: our app is deprecated, as RecList Beta will have connectors with existing apps (experiment trackers, model cards, etc.).

Capabilities

RecList provides a dataset and model agnostic framework to scale up behavioral tests. As long as the proper abstractions are implemented, all the out-of-the-box components can be re-used. For example:

  • you can use a public dataset provided by RecList to train your new cart recommender model, and then use the RecTests we provide for that use case;

  • you can use some baseline model on your custom dataset, to establish a baseline for your project;

  • you can use a custom model, on a private dataset and define from scratch a new suite of tests, mixing existing methods and domain-specific tests.

We list below what we currently support out-of-the-box, with particular focus on datasets and tests, as the models we provide are convenient baselines, but they are not meant to be SOTA research models.

Datasets

RecList features convenient wrappers around popular datasets, to help test models over known benchmarks in a standardized way.

Behavioral Tests

RecList helps report standard quantitative metrics over popular (or custom) datasets, such as the ones collected in standard_metrics.py: hit rate, mrr, coverage, popularity bias, etc. However, RecList raison d’etre is providing plug-and-play behavioral tests, as agnostic as possible to the underlying models and datasets, while leaving open the possibility of writing personalized tests when domain knowledge and custom logic are necessary.

Tests descriptions are available in our (WIP) docs, but we share here some examples from our paper.

First, RecList allows to compare the performance of models which may have similar aggregate KPIs (e.g. hit rate on the entire test set) in different slices. When plotting HR by product popularity, it is easy to spot that prod2vec works much better with rarer items than the alternatives:

https://github.com/jacopotagliabue/reclist/blob/main/images/hit_rate_dist.png

When slicing by important meta-data (in this simulated example, brands), RecList uncovers significant differences in performance for different groups; since the features we care about vary across datasets, the package allows for a generic way to partition the test set and compute per-slice metrics:

https://github.com/jacopotagliabue/reclist/blob/main/images/slice_dist.png

Finally, RecList can take advantage of the latent item space to compute the cosine distances <query item, ground truth> and <query item, prediction> for missed predictions in the test set. In a cart recommender use case, we expect items to reflect the complementary nature of the suggestions: if a TV is in the cart, a model should recommend a HDMI cable, not another TV. As we see in the comparison below, Google’s predictions better match the label distribution, suggesting that the model better capture the nature of the task:

https://github.com/jacopotagliabue/reclist/blob/main/images/distance_to_query.png

Roadmap

We have exciting news about our Beta, including the usage of RecList as main library for the CIKM Data Challenge!

Contributing

We will update this repo with some guidelines for contributions as soon as the codebase becomes more stable. Check back often for updates!

Acknowledgments

The original authors are:

If you have questions or feedback, please reach out to: jacopo dot tagliabue at tooso dot ai.

Talks and Presentations

Past and upcoming talks and presentations can be found at our new website.

License and Citation

All the code is released under an open MIT license. If you found RecList useful, please cite our WWW paper:

@inproceedings{10.1145/3487553.3524215,
    author = {Chia, Patrick John and Tagliabue, Jacopo and Bianchi, Federico and He, Chloe and Ko, Brian},
    title = {Beyond NDCG: Behavioral Testing of Recommender Systems with RecList},
    year = {2022},
    isbn = {9781450391306},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3487553.3524215},
    doi = {10.1145/3487553.3524215},
    pages = {99–104},
    numpages = {6},
    keywords = {recommender systems, open source, behavioral testing},
    location = {Virtual Event, Lyon, France},
    series = {WWW '22 Companion}
}

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Core Components

To understand how to use RecList you need to know the core components of the package. After reading this page you should be comfortable enough to play and modify the tutorial and also create your reclists for evaluation.

Two Paths

Say you have a new recommender and you want to validate on behavioural tests. We have different datasets you can play with. If you want to follow this path, you just need to get the training data from the datastets, train the model and then run the reclist to evaluate your model.

Another possible use case is that you might want to create a RecList for a new dataset you have.

RecTest

The RecTest is probably the most fundamental abstraction in RecList. The RecTest is essentially a decorator used to evaluate the various datasets.

class MyRecListRecList(RecList):

    @rec_test(test_type='stats')
    def basic_stats(self):
        """
        Basic statistics on training, test and prediction data
        """
        from reclist.metrics.standard_metrics import statistics
        return statistics(self._x_train,
                          self._y_train,
                          self._x_test,
                          self._y_test,
                          self._y_preds)

RecDataset

The dataset is a simple abstraction you might want to implement if you are playing with other datasets. You will see that we just need to instantiate the main parameters.

class RecDataset(ABC):
"""
Implements an abstract class for the dataset
"""
def __init__(self, force_download=False):
    """
    :param force_download: allows to force the download of the dataset in case it is needed.
    :type: force_download: bool, optional
    """
    self._x_train = None
    self._y_train = None
    self._x_test = None
    self._y_test = None
    self._catalog = None
    self.force_download = force_download
    self.load()

@abstractmethod
def load(self):
    """
    Abstract method that should implement dataset loading
    @return:
    """
    return

RecModel

class MyRecModel(RecModel):
    """
    My Recommender Model
    """
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    model_name = "mymodel"

    def train(self, products):

        self._model = my_training_function(products)

    def predict(self, prediction_input: list, *args, **kwargs):

        predictions = self.model.predict(prediction_input)

        return predictions

    def get_vector(self, product_sku):
        try:
            return list(self._model.get_vector(product_sku))
        except Exception as e:
            return []

Modules

Abstractions

class reclist.abstractions.RecDataset(force_download=False)[source]

Bases: ABC

Implements an abstract class for the dataset

property catalog
abstract load()[source]

Abstract method that should implement dataset loading :return:

property x_test
property x_train
property y_test
property y_train
class reclist.abstractions.RecList(model: RecModel, dataset: RecDataset, y_preds: Optional[list] = None)[source]

Bases: ABC

META_DATA_FOLDER = '.reclist'
dump_results_to_json(test_results: list, report_path: str, epoch_time_ms: int)[source]
generate_report(epoch_time_ms: int)[source]
get_tests()[source]

Helper to extract methods decorated with rec_test

property rec_tests
store_artifacts(report_path: str)[source]
property test_data
property test_results
train_dense_repr(type_name: str, type_fn)[source]

Train a dense representation over a type of meta-data & store into object

class reclist.abstractions.RecModel(model=None)[source]

Bases: ABC

Abstract class for recommendation model

property model
abstract predict(prediction_input: list, *args, **kwargs)[source]

The predict function should implement the behaviour of the model at inference time.

Parameters
  • prediction_input – the input that is used to to do the prediction

  • args

  • kwargs

Returns

reclist.abstractions.rec_test(test_type: str)[source]

Rec test decorator

Datasets

Data sets currently available

class reclist.datasets.CoveoDataset(**kwargs)[source]

Bases: RecDataset

Coveo SIGIR data challenge dataset

load()[source]

Abstract method that should implement dataset loading :return:

class reclist.datasets.MovieLensDataset(**kwargs)[source]

Bases: RecDataset

MovieLens 25M Dataset

Reference: https://files.grouplens.org/datasets/movielens/ml-25m-README.html

load()[source]

Abstract method that should implement dataset loading :return:

class reclist.datasets.SpotifyDataset(**kwargs)[source]

Bases: RecDataset

load()[source]

Abstract method that should implement dataset loading :return:

load_spotify_playlist_dataset()[source]

Metrics

Metrics

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

You can contribute in many ways:

Types of Contributions

Report Bugs

Report bugs at https://github.com/vinid/reclist/issues.

If you are reporting a bug, please include:

  • Your operating system name and version.

  • Any details about your local setup that might be helpful in troubleshooting.

  • Detailed steps to reproduce the bug.

Fix Bugs

Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.

Implement Features

Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.

Write Documentation

RecList could always use more documentation, whether as part of the official RecList docs, in docstrings, or even on the web in blog posts, articles, and such.

Submit Feedback

The best way to send feedback is to file an issue at https://github.com/vinid/reclist/issues.

If you are proposing a feature:

  • Explain in detail how it would work.

  • Keep the scope as narrow as possible, to make it easier to implement.

  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

Get Started!

Ready to contribute? Here’s how to set up reclist for local development.

  1. Fork the reclist repo on GitHub.

  2. Clone your fork locally:

    $ git clone git@github.com:your_name_here/reclist.git
    
  3. Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:

    $ mkvirtualenv reclist
    $ cd reclist/
    $ python setup.py develop
    
  4. Create a branch for local development:

    $ git checkout -b name-of-your-bugfix-or-feature
    

    Now you can make your changes locally.

  5. When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:

    $ flake8 reclist tests
    $ python setup.py test or pytest
    $ tox
    

    To get flake8 and tox, just pip install them into your virtualenv.

  6. Commit your changes and push your branch to GitHub:

    $ git add .
    $ git commit -m "Your detailed description of your changes."
    $ git push origin name-of-your-bugfix-or-feature
    
  7. Submit a pull request through the GitHub website.

Pull Request Guidelines

Before you submit a pull request, check that it meets these guidelines:

  1. The pull request should include tests.

  2. If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.

  3. The pull request should work for Python 3.5, 3.6, 3.7 and 3.8, and for PyPy. Check https://travis-ci.com/vinid/reclist/pull_requests and make sure that the tests pass for all supported Python versions.

Tips

To run a subset of tests:

$ pytest tests.test_reclist

Deploying

A reminder for the maintainers on how to deploy. Make sure all your changes are committed (including an entry in HISTORY.rst). Then run:

$ bump2version patch # possible: major / minor / patch
$ git push
$ git push --tags

Travis will then deploy to PyPI if tests pass.

Credits

Development Lead

Contributors

None yet. Why not be the first?

History

0.1.0 (2021-11-16)

  • First release on PyPI.

Indices and tables