In this lab, students, PhDs and researchers work to develop prototypes for the management, integration, sharing and analysis of data through information systems, with specific application to the Bioinformatics and Web domains. The main topics covered are:
- Data representation
- Languages for data update, query, search and extraction
- Data and software architectures specialized for data management, query and analysis.
The research activity concerns the definition of models, methods and tools for the efficient and effective management of big data collections and their query, mining and analysis to extract new knowledge. Besides the traditional topics, the research activities focus on the development of information systems for the Web, and the construction of tools to compose search systems and data extraction on the Web (Search Computing) and in the Cloud, with specific application to the management, query and analysis of big biomedical and bioinformatics data and information, including those from the innovative technologies for the sequencing of DNA and RNA (Genomic Computing). In the majority of cases, research results are incorporated in innovative applications and demonstrative prototypal tools, made publicly available on the Web. In the last years, the research mainly regarded the following research lines:
- Genomic Computing:it is focused on management and analysis of big genomic data generated by the technologies for the next generation sequencing of DNA and RNA. In collaboration with IEO and IIT (www.ifom-ieo-campus.it), it aims at constructing a powerful computational infrastructure that can process the data generated by the machines for DNA and RNA sequencing and allows creating easily visualizations, queries, analyses, mining and searches on genomic data collections distributed and available world-wide. The goal is to generate a standard computational infrastructure, highly efficient, extensible and easily usable – the “Internet of genomes”, to support scientists in the genomic research.
- Search Computing: a project funded by the ERC that aims to develop languages and methods for the integration of data guided by their ranking. The information search is managed through services that make accessible distributed and heterogeneous data sources (http://www.search-computing.it/).
- Data management in pervasive systems: : centered on management and analysis of big quantities of data produced by the pervasive technologies of new generation such as sensor networks, RFID, smart devices and social networks, it proposes methods and models for the design of adaptive systems based on the context and of analysis methods oriented to simplifying the user interaction with big quantities of data. The project Green Move http://gm.polimi.it/ used these technologies.
These Research lines have brought the following results: GMQL – GenoMetric Query Language
) provides a next-generation query language for querying big genomic data, such as those from Next Generation Sequencing technologies. It operates upon aligned genomic data in a variety of data formats, providing parallel computation in the cloud, thereby supporting queries over thousands of samples, such as the ones provided by the ENCODE
consortia. The language's name indicates its ability to compute massive operations on genomic regions, which take into account region relative positions and distances. GMQL is supported by the PRIN project GenData 2020
. GPKB – Genomic and Proteomic Knowledge Base
) enables users to visually perform complex queries on heterogeneous integrated biomedical data, even very numerous; it is based on the Genomic and Proteomic Data Warehouse (GPDW), which we created by integrating information provided by several well-known biomedical-molecular databanks, including: Entrez Gene, Uniprot, IntAct, Expasy Enzyme, GO, GOA, BioCyc, KEGG, Reactome and OMIM. Bio-SeCo – Bio Search Computing
) offers an integrated environment that simplifies the search, exploration and combination based on partial ordering of heterogeneous data provided by registered search services, and provides global results that can help answering multi-subject complex biomedical questions. GFINDer – Genome Function INtegrated Discoverer
(http://www.bioinformatics.polimi.it/GFINDer/) is a system to discover, use and extract a great quantity of genomic information from heterogeneous and distributed online databases, in order to support the biomedical interpretation of high-throughput biomolecular experiments. PerLa – PERvasive Language
) proposes a language and a middleware for the management of data in pervasive systems.