Scalable diversification for data exploration platforms

Khan, Hina Anwar (2016). Scalable diversification for data exploration platforms PhD Thesis, School of Information Technology and Electrical Engineering, The University of Queensland. doi:10.14264/uql.2016.1073

Attached Files (Some files may be inaccessible until you login with your UQ eSpace credentials)
Name Description MIMEType Size Downloads
s4305793_final_thesis.pdf Thesis (open access) application/pdf 4.73MB 0
Author Khan, Hina Anwar
Thesis Title Scalable diversification for data exploration platforms
School, Centre or Institute School of Information Technology and Electrical Engineering
Institution The University of Queensland
DOI 10.14264/uql.2016.1073
Publication date 2016-11-18
Thesis type PhD Thesis
Supervisor Mohamed A. Sharaf
Shazia Sadiq
Total pages 150
Language eng
Subjects 0806 Information Systems
Formatted abstract
Today, data exploration platforms are widely used to assist users in locating interesting objects within large volumes of scientific and business data. In those platforms, users try to make sense of the underlying data space by iteratively posing numerous queries over large databases. Hence, data exploration platforms rely on methods for the extraction of representative data to provide users with a concise and meaningful representation of query results. That is, extracting a few tuples from a query result to provide quick insights in the potentially huge answer space. In the past few years, importance of diversification while extracting representative subsets of data has been greatly emphasized. It has been shown that diverse subsets provide more effective representation of the underlying data by minimizing redundancy and increasing coverage. Meanwhile, search results diversification adds additional cost to an already computationally expensive exploration process.

In this PhD thesis, we have focused on the design, implementation and evaluation of scalable diversification algorithms and schemes for the data exploration platforms. Particularly, this research stipulates that extracting diverse representative results during data exploration requires addressing several challenges including: 1) scaling to big volumes of high dimensional data, 2) large number of users, and 3) enabling real time continuous exploration. To address those challenges we focus on two broad aspects: 1) Diversification of high dimensional large data sets and 2) Diversification of multiple user queries.

The existing work conducts diversification in two steps: first compute all relevant query results, and then diversify the query results to select a small diverse subset. Similarly, all the dimensional attributes of all the data points in a query result are considered for diversification. Such a generic approach would be a performance bottleneck in high-dimensional large databases. To efficiently compute diverse subsets of query results exhibiting both high dimensionality and high-cardinality, we have proposed the Progressive Diversification Scheme. Our proposed scheme, utilizes the partial distance computations to reduce the amount of CPU and I/O incurred during query diversification. Moreover, to avoid the overhead of computing all relevant results first, we propose embedding diversification in query evaluation step by utilizing column-based data storage systems. In addition to computational cost of diversification methods, we have also considered the complexity of diversity objective function in high dimensional databases. Often, computing diverse solutions along all the dimensions is not a realistic approach. In fact, users may have some pre-specified preferences over some dimensions of the data, while expecting good coverage over the other dimensions. Motivated by that need, we propose a novel scheme, which aims to generate representative data that balance the tradeoff between regret minimization and diversity maximization. Our scheme is based on a hybrid objective function that combines both regret and diversity. Additionally, it employs several algorithms that are designed to maximize that objective function.

We also address the diversification of multiple queries across and within user sessions. While, most of the current work is focused on the diversification of a single query result, we have proposed two scalable schemes for the diversification of both concurrent and sequential multiple queries. The results of the concurrent queries are available simultaneously and hence, provide an opportunity to exploit the overlap in those results. Our proposed algorithms leverage the natural overlap in search results in conjunction with the concurrent diversification of those overlapping results. In order to further reduce the processing costs of diversification, we have employed various approximation techniques that provide orders of magnitude reductions in cost, while maintaining a quality of diversification comparable to that of near optimal schemes. In contrast to the concurrent queries across user sessions, the queries within a single session are submitted sequentially at different intervals of time. Therefore, we have proposed a sequential diversification scheme that exploits the properties of data diversification functions while leveraging the natural overlap occurring between the results of different queries. Our proposed scheme relies on a regression model-based diversification method and an order based cache. In particular, we employ an adaptive regression model to estimate the diversity of a diverse subset. Such estimation of diversity value allows us to select diverse results without scanning all the query results. In order to further expedite the diversification process, we propose an order-based caching scheme to leverage the overlap between sequence of data exploration queries.

Our extensive experimental evaluation on both synthetic and real data sets shows the significant benefits provided by our proposed schemes as compared to existing diversification methods.
Keyword Data Exploration
Results Diversification
Query Optimisation
Representative Data

Document type: Thesis
Collections: UQ Theses (RHD) - Official
UQ Theses (RHD) - Open Access
Version Filter Type
Citation counts: Google Scholar Search Google Scholar
Created: Wed, 16 Nov 2016, 23:38:44 EST by Hina Khan on behalf of Learning and Research Services (UQ Library)