Over the last few years, the technology boost in the capability of analyzing data for scientific research and reusing it in Open Science initiatives for the construction of models, including those generated with Machine Learning techniques, has been enormous. The attempt to build data 'spaces' or 'ecosystems' which support the publication and reuse of data for feeding pipelines – i.e. the processes that data scientists specify and execute to prepare, transform, enrich and analyse data – has inspired several initiatives in Europe and worldwide. However, assessing and controlling the quality of data and results can be very expansive in terms of computational resources and human costs. Completely automated pipelines can reduce the costs of this process, but they present significant weaknesses in monitoring the data life cycle and often make it very difficult to control the results in terms of quality, uncertainty and explainability.
In this scenario, the project ‘Discount Quality for Responsible Data Science: Human-in-the-Loop for Quality Data’ intends to exploit a Human-In-The-Loop (HITL) approach – i.e. an approach involving human intervention in the most delicate phases of the data transformation process – to increase the overall sustainability of the pipeline, both from a computational point of view and in terms of human effort. In particular, the project focuses on data preparation, which normally takes up to 80% of the overall time needed to complete the process, balancing the need for high quality data and the need to reduce the work involved in preparing them. In order to make this process more sustainable, two main goals will be pursued: 1) reducing the computational effort neded to analyse data; 2) introducing HITL to make human intervention more effective, and thus limiting it.
The project leverages the complementary expertise of the partners involved. The research unit coordinated by Prof. Barbara Pernici from the Department of Electronics, Information and Bioengineering – Politecnico di Milano, who’s also Principal Investigator of the project, has a strong expertise in data and information quality, in particular in quality assessment and in developing pipelines for scientific data and for social media analysis. The University of Modena and Reggio Emilia contributes with expertise in data preparation, based on semantic-based approaches. The University of Milano-Bicocca provides competences on data sharing and annotations. The Sapienza University of Rome has a significant expertise in data visualization and exploration.