Soutenance publique de thèse de doctorat en informatique : Rabeb ABIDA
Semantic Enrichment and Data Exploitation in an Open Data Context
Date : 08/12/2023 15:00 - 08/12/2023 17:00
Lieu : PA01
Orateur(s) : Rabeb ABIDA
Organisateur(s) : Isabelle Daelman
OD (Open Data) is becoming a significant trend, especially in the government data context (i.e., OGD, (Open Government Data)). These data may contain multiple datasets, including budget and spending, health, population, census, geographical, parliament minutes, etc. It also includes data that is indirectly ‘’owned” by public administrations (e.g. through subsidiaries or agencies), such as data related to climate/pollution, public transportation, congestion/traffic, and child care/education. Whereas OD is a powerful tool that has enormous potential to empower researchers, citizens, businesses, and other stakeholders in various ways.
By making these data more discoverable, available, exploitable, accessible, understandable, and usable by all, we can foster data analysis, exploitation, and publication, promote transparency, and better decision-making by giving users and citizens access to information about government activities, and services, and provide opportunities for public participation and collaboration in government policies and decision-making processes. LOD (Linked Open Data) is the process of following a set of best practices for publishing and connecting structured data on the Web. The blending of OGD (Open Government Data) and LOD forms LOGD (Linked Open Government Data). This data not only improves the transparency of a government, it can also lead to innovative solutions for community advancement, as well as support the public administration functions. While OD can bring many benefits, it also poses many challenges, such as the barriers faced by most data consumers and publishers, which should be addressed in this thesis. Heterogeneity is the most significant issue with open data, as the data may come in many various formats and structures, making it challenging to integrate, publish and analyze data effectively.
This can limit the usefulness, re-use of the data, and make it more difficult to extract meaningful information and determine the quality of the data to be published. Yet, limited work exists, for supporting the process of LOD production and publishing in a E-Government context. There is a clear lack of automatically-supported, integrated solutions that enable (i) combining appropriate techniques and tools to assist users, especially non-expert users, to efficiently manipulate datasets from extraction to publication LOGD, (ii) having a follow-up on the progress during this process as well as a semantic visualization of its data, and (iii) involving stakeholders with governments to reuse the data to provide public services and applications. In addition, Data quality, Ambiguity, and Completeness issues are another important challenge, for example, the open data initiative releases raw data via various data portals, which are disparate and difficult to use, especially by the general public. The same entity can be identified in different data sources. It can make data difficult to add correct information, merge and analyze data from different sources. Additionally, data may be incomplete, inconsistencies or contain errors can lead to inaccurate or misleading results, data collection errors or incomplete reporting, which can have significant implications for decision-making and policy development. To address this challenge, it is essential to establish clear data quality standards and guidelines for data collection, annotation and validation. These standards should include effective guidelines for data extraction, data cleaning, data interpretation, data analysis, and quality control, as well as procedures for handling missing data or incomplete metadata and ensuring data accuracy. Discoverability and Understanding are important because they help ensure that data can be found, accessed, and interpreted accurately and effectively. When the data files that have been linked on the portal do not provide the semantic links between the data sources, it is therefore difficult to understand and establish the links between the data entities. Open Data publishers should promote data fluency among stakeholders, so that they can analyze and interpret data effectively. Most existing tools are not capable of automating the process and being interoperable with Open Data portals. Hence, there is still, no all-in-one approach that can handle the problems of the Open Data quality, and a lack of semantic annotation tools that integrate well with Open Data portal.
This thesis focuses on overcoming these barriers for Open Data and at the same time, provides solutions for improving practical interventions in this area. More particularly, we aim to support semantics-aware analysis in LOD-enabled Government systems, where, we propose an useful and integrated semi-automatically framework to inter-actively assist Open Data publisher in integrating and publishing LOGD. Moreover, we focus to address OD quality issues such as noise, misspellings or incomplete metadata, missing cell values and missing significant information. We intend to improve the usability, discoverability and understanding of OD, where we provide all-in-one automatic approach that can handle the problems of the data quality, and support the annotation systems that can be easily integrated into any OD portal. Our purpose is to help data analysts analyze Open Data to reduce the time spent quickly spotting signs of open data problems and to implement a simpler application to help users merge related annotated datasets to improve a government’s transparency, which can lead to innovative solutions for community progress. Finally, we evaluate the capabilities, usability and usefulness of the solutions we have developed.
Keywords: open data, linked open data, semantic enrichment, data exploitation
Contact :
Isabelle Daelman
-
isabelle.daelman@unamur.be
Télecharger :
vCal