In this project, our goal is to design and develop a web-based platform that enables researchers and clinicians to discover and access cancer-related datasets efficiently. The tool will allow users to perform free-text queries to find all relevant datasets for a specific cancer type (e.g., pancreatic, lung) with selected characteristics (e.g., patient demographics, treatment modalities). Additionally, the platform will support dataset enrichment by retrieving relevant contextual information from scientific repositories such as Europe PMC. These insights will help users understand how datasets have been used in prior research, including experimental context, cohort details, and observed outcomes.
The project delivers a functioning web application that enables users to perform free-text queries specifying cancer types and characteristics and allows them to extend an existing dataset with related information retrieved from Europe PMC. At this stage, the tool relies primarily on metadata and publication abstracts rather than full-text analysis. The fully completed version expands these capabilities by mining information from the full text of scientific publications. It will also offer additional filtering options, such as publication year, journal, or other metadata, allowing users to refine results. Using advanced natural language processing and information extraction techniques, the tool will automatically identify dataset mentions, link them across studies, and generate enriched metadata describing how datasets have been used in previous research.