Nos tutelles

CNRS

Nos partenaires


Accueil > Publications > Thèses

NGUYEN Cong-Danh


Workload- and Data-based Automated Design for a Hybrid Row-column Storage Model and Bloom Filter-based Query Processing for Large-scale DICOM Data Management

May 04, 2018 - 16 h 00 - Salle du Conseil

In the health care industry, DICOM (Digital Imaging and Communication in Medicine) standard has become very popular to store and transfer digital medical images and reports. However, the ever-increasing size, high velocity and variety of the DICOM data collections have led to challenges in data management. Besides, various types of queries, including OLAP, OLTP or mixed workloads, also cause negative impacts on query performance. Existing systems have limitations dealing with these characteristics of data and workloads. In this thesis, we propose new efficient methods for storing and querying data that can be applied to enhance performance, scalability, availability and elasticity for a large-scale DICOM data management system.
We propose a hybrid storage model of row and column stores, called HYTORMO, together with data storage and query processing strategies. First, HYTORMO is designed and implemented to be deployed on large-scale environment to make it possible to manage big medical data. Second, the data storage strategy combines the use of vertical partitioning and a hybrid storage in order to create data storage configurations that can reduce storage space demand and increase the performance of queries in given workloads. To achieve such a data storage configuration, we propose two different database design approaches : (1) expert-based design and (2) automated design. In the former approach, experts (e.g., database designers) manually create data storage configurations by grouping the attributes of DICOM data and selecting a suitable data storage layout for each column group. In the latter approach, we propose a hybrid automated design framework, called HADF. HADF depends on similarity measures (between attributes) that can take into consideration the impacts of both workload- and data-specific information to automatically group the attributes into column groups and to select suitable data storage layouts for them. Finally, we propose a suitable and efficient query processing strategy built on top of HYTORMO. It considers the use of both inner joins and left-outer joins for join operations between vertically partitioned tables to prevent data loss if only using inner joins. In addition, an Intersection Bloom filter is applied to remove irrelevant data from the input tables of join operations ; this helps to reduce disk I/O, network communication and CPU costs and thus improve query performance.
We also provides experimental evaluations to validate the benefits of the proposed methods over real DICOM datasets.

Jury :
Rapporteurs :
Prof. Christine COLLET Institut polytechnique de Grenoble (Grenoble INP), France
Prof. Abdelkader HAMEURLAIN Université de Toulouse, IRIT, France
Examinateur :
Prof. Farouk TOUMANI Université Clermont Auvergne, LIMOS, France
Directeurs :
Prof. Laurent D’ORAZIO Université Rennes 1, IRISA, France
Prof. Mohand-Said HACID Université Claude Bernard Lyon 1, LIRIS, France
Msc. Nga TRAN Micro Focus - Vertica, Cambridge, Massachusetts, USA