OpenML.org is an open science platform for machine learning, which has recently been awarded 2 Open Science NL grants. On OpenML, we gather datasets, tasks, machine learning algorithms and the results of these algorithms on tasks. That can be used to analyse the behaviour of machine learning algorithms, and when they work well. For more info on OpenML, see the following paper: https://www.sciencedirect.com/science/article/pii/S2666389925001655
Uploading datasets to OpenML can be complicated, and requires significant user experience. This project will aim at building a two-stage pipeline, which brings the human in the loop. This process entails two actors, the dataset owner (who performs step 1 and can be consulted in step 2) and the OpenML community member (who performs step 2). The dataset owner first uploads their dataset in free format to an intermediate system, with all relevant information (in Croissant format) in which it can be further processed by OpenML community members.
A web application for this two step approach. Emphasis should lie on the first step, because the second step can be performed by an expert community member. A user should be able to drop a dataset somewhere on a server, along with relevant information (as well as contact information) to store this in Croissant format. It can for example use a GitHub backend, where after uploading the dataset, an issue can be opened in which a discussion between dataset owner and OpenML core member can be held.
The flexibility will then lie at the OpenML community members, who (potentially after contacting the dataset owner) should be able to process the dataset into OpenML format. For simple cases, this can be facilitated by the web application itself. For complicated cases, this is not possible (due the diversity of datasets).
A minimal viable project would entail a flexible pipeline focusing on step 1 and communication interfaces (e.g., GitHub issues) in which the two actors can interact for step 2.