From the discussion during the technical committee, the scikit-learn Consortium at Inria defined the following list of priorities for the coming year:

  • Improve documentation with extra examples and topic-based discussions:
  • Operationalization of models / MLOps
    • Other packages to use
    • Good practices (we are an entry point of the community)
      • Model serialization options (pickle vs skops vs ONNX)
      • Version control for reproducible retraining
      • Automate (CI / CD)
      • Long blocks of code should be in importable Python modules with tests, not in notebooks
    • Add side boxes in our doc / examples on the production logic (it may differ from exploration mode)
    • [Idea to be explored] Declarative construction of pipelines (pros / cons)
  • Annotation regarding models’ hyperparameter: https://github.com/scikit-learn/scikit-learn/pull/17929
    • First step implemented via programmatic hyperparameter declaration.
  • Improving performance and scalability
  • DOC and tools: safer recommendation for the right metrics for a given y_test.
  • Improve support for quantification of uncertainties in predictions and calibration measures
  • Improve the default solver in linear models:
  • More flexible support for alternative input data container types:
  • Quantification of fairness issues and potential mitigation
    • Document fairness assessment metrics

 

Longer term: Big picture tasks which require more thinking

  • MLOps: Model auditing and data auditing
  • Survival analysis tools need to go beyond point wise predictions and this might  be more generally useful in scikit-learn, possibly uncertainty quantification in predictions.

Explore API to simplify data wrangling (outside of scikit-learn)

 

Community: On the community side

  • Continue regular technical sprints and topic focused workshops (possibly by inviting past sprint contributors to try to foster a long term relationship and hopefully recruit new maintainers).
    • Better preparation for issues
    • Plan with greater advance
    • Fewer people in the sprints (to be able to provide better mentoring)
  • Make the consortium meetings more transparent and inclusive:
    • Invite Adrin and other advisory board people to the meetings
    • Make the weekly tasking more visible
  • Renew the organization of beginners’ workshops for prospective contributors, probably before a sprint.
  • Organize a workshop on statistical topics (causal inference and calibration) and possibly followed by 2 days of sprint
  • Organize a workshop on our software-engineering practices, some ideas of topics:
    • CI and CD practices, e.g.:
      • optional testing on float32 and robin round seed setting
      • nightly build and version pinning rationales
    • local development practices, e.g.:
      • pre-commit config
    • code review guidelines
    • performance troubleshooting and improvements
      • profiling, benchmarking
  • Conduct a new edition of its 2013 survey among all scikit-learn users.