IEEE XPLORE PROJECT’S ABSTRACT

Florida, Orlando, FL

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: May 2009, Volume: 21, Issue: 5, On page(s): 729-743, ISSN: 1041-4347,

INSPEC Accession Number: 10520358, Digital Object Identifier: 10.1109/TKDE.2008.188,

First Published: 2008-09-19, Current Version Published: 2009-03-24

ABSTRACT

Target search in content-based image retrieval (CBIR) systems refers to finding a specific (target) image such as a particular registered logo or a specific historical photograph. Existing techniques, designed around query refinement based on relevance feedback, suffer from slow convergence, and do not guarantee to find intended targets. To address these limitations, we propose several efficient query point movement methods. We prove that our approach is able to reach any given target image with fewer iterations in the worst and average cases. We propose a new index structure and query processing technique to improve retrieval effectiveness and efficiency. We also consider strategies to minimize the effects of users’ inaccurate relevance feedback. Extensive experiments in simulated and realistic environments show that our approach significantly reduces the number of required iterations and improves overall retrieval performance. The experimental results also confirm that our approach can always retrieve intended targets even with poor selection of initial query points.


SCHEMA VACUUMING IN TEMPORAL DATABASES

Roddick, J.F.,  Sch. of Comput. Sci., Eng. & Math., Flinders Univ., Adelaide, SA

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: May 2009, Volume: 21, Issue: 5, On page(s): 744-747, ISSN: 1041-4347,

INSPEC Accession Number: 10520359, Digital Object Identifier: 10.1109/TKDE.2008.201,

First Published: 2008-09-26, Current Version Published: 2009-03-24

ABSTRACT

Temporal databases facilitate the support of historical information by providing functions for indicating the intervals during which a tuple was applicable (along one or more temporal dimensions). Because data are never deleted, only superceded, temporal databases are inherently append-only resulting, over time, in a large historical sequence of database states. Data vacuuming in temporal databases allows for this sequence to be shortened by strategically, and irrevocably, deleting obsolete data. Schema versioning allows users to maintain a history of database schemata without compromising the semantics of the data or the ability to view data through historical schemata. While the techniques required for data vacuuming in temporal databases have been relatively well covered, the associated area of vacuuming schemata has received less attention. This paper discusses this issue and proposes a mechanism that fits well with existing methods for data vacuuming and schema versioning.


THE SUBGRAPH SIMILARITY PROBLEM

De Nardo, L.   Ranzato, F.   Tapparo, F.,  Dipt. di Mat. Pura ed Applicata, Univ. of Padova, Padova

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: May 2009, Volume: 21, Issue: 5, On page(s): 748-749, ISSN: 1041-4347,

INSPEC Accession Number: 10520360, Digital Object Identifier: 10.1109/TKDE.2008.205,

First Published: 2008-10-10, Current Version Published: 2009-03-24

ABSTRACT

Similarity is a well known weakening of bisimilarity where one system is required to simulate the other and vice versa. It has been shown that the subgraph bisimilarity problem, a variation of the subgraph isomorphism problem where isomorphism is weakened to bisimilarity, is NP-complete. We show that the subgraph similarity problem and some related variations thereof still remain NP-complete.


HISTOGRAM-BASED GLOBAL LOAD BALANCING IN STRUCTURED PEER-TO-PEER SYSTEMS

Quang Hieu Vu   Beng Chin Ooi   Rinard, M.   Kian-Lee Tan, Sch. of Comput., Nat. Univ. of Singapore, Singapore

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009, Volume: 21, Issue: 4, On page(s): 595-608, ISSN: 1041-4347,

INSPEC Accession Number: 10477132, Digital Object Identifier: 10.1109/TKDE.2008.182,

First Published: 2008-09-05, Current Version Published: 2009-02-24

ABSTRACT

Over the past few years, peer-to-peer (P2P) systems have rapidly grown in popularity and have become a dominant means for sharing resources. In these systems, load balancing is a key challenge because nodes are often heterogeneous. While several load-balancing schemes have been proposed in the literature, these solutions are typically ad hoc, heuristic based, and localized. In this paper, we present a general framework, HiGLOB, for global load balancing in structured P2P systems. Each node in HiGLOB has two key

components:

(1)    a histogram manager maintains a histogram that reflects a global view of the distribution of the load in the system, and

(2)     A load-balancing manager that redistributes the load whenever the node becomes overloaded or underloaded. We exploit the routing metadata to partition the P2P network into nonoverlapping regions corresponding to the histogram buckets.

We propose mechanisms to keep the cost of constructing and maintaining the histograms low. We further show that our scheme can control and bound the amount of load imbalance across the system. Finally, we demonstrate the effectiveness of HiGLOB by instantiating it over three existing structured P2P systems: Skip Graph, BATON, and Chord. Our experimental results indicate that our approach works well in practice.

PROGRESSIVE PARAMETRIC QUERY OPTIMIZATION

Bizarro, P.   Bruno, N.   DeWitt, D.J.,  CISUC/DEI, Univ. of Coimbra, Coimbra

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009, Volume: 21, Issue: 4, On page(s): 582-594, ISSN: 1041-4347,

INSPEC Accession Number: 10477131, Digital Object Identifier: 10.1109/TKDE.2008.160,

First Published: 2008-08-01, Current Version Published: 2009-02-24

ABSTRACT

Commercial applications usually rely on pre-compiled parameterized procedures to interact with a database. Unfortunately, executing a procedure with a set of parameters different from those used at compilation time may be arbitrarily sub-optimal. Parametric query optimization (PQO) attempts to solve this problem by exhaustively determining the optimal plans at each point of the parameter space at compile time. However, PQO is likely not cost-effective if the query is executed infrequently or if it is executed with values only within a subset of the parameter space. In this paper we propose instead to progressively explore the parameter space and build a parametric plan during several executions of the same query. We introduce algorithms that, as parametric plans are populated, are able to frequently bypass the optimizer but still execute optimal or near-optimal plans.


MULTISCALE REPRESENTATIONS FOR FAST PATTERN MATCHING IN STREAM TIME SERIES

Xiang Lian   Lei Chen   Yu, J.X.   Jinsong Han   Jian Ma, Dept. of Comput. Sci. & Eng., Hong Kong Univ. of Sci. & Technol., Kowloon

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009, Volume: 21, Issue: 4, On page(s): 568-581, ISSN: 1041-4347,

INSPEC Accession Number: 10479185, Digital Object Identifier: 10.1109/TKDE.2008.184,

First Published: 2008-09-12, Current Version Published: 2009-02-24

ABSTRACT

Similarity-based time-series retrieval has been a subject of long-term study due to its wide usage in many applications, such as financial data analysis, weather data forecasting, and multimedia data retrieval. Its original task was to find those time series similar to a pattern (query) time-series data, where both the pattern and data time series are static. Recently, with an increasing demand on stream data management, similarity-based stream time-series retrieval has raised new research issues due to its unique requirements during the stream processing, such as one-pass search and fast response. In this paper, we address the problem of matching both static and dynamic patterns over stream time-series data. We will develop a novel multiscale representation, called multiscale segment mean, for stream time-series data, which can be incrementally computed and thus perfectly adapted to the stream characteristics. Most importantly, we propose a novel multistep filtering mechanism, step by step, over the multiscale representation. Analysis indicates that the mechanism can greatly prune the search space and thus offer fast response. Furthermore, batch processing optimization and the dynamic case where patterns are also from stream time series are discussed. Extensive experiments show the multiscale representation together with the multistep filtering scheme can efficiently filter out false candidates and detect patterns, compared to the multiscale wavelet.


OPTIMAL LOT SIZING POLICIES FOR SEQUENTIAL ONLINE AUCTIONS

Tripathi, A.K.   Nair, S.K.   Karuga, G.G., Dept. of Inf. Syst. & Oper. Manage., Univ. of Washington, Seattle, WA

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009, Volume: 21, Issue: 4, On page(s): 554-567, ISSN: 1041-4347,

INSPEC Accession Number: 10477130, Digital Object Identifier: 10.1109/TKDE.2008.145

First Published: 2008-07-18, Current Version Published: 2009-02-24

ABSTRACT

This study proposes methods for determining the optimal lot sizes for sequential auctions that are conducted to sell sizable quantities of an item. These auctions are fairly common in business to consumer (B2C) auctions. In these auctions, the tradeoff for the auctioneer is between the alacrity with which funds are received, and the amount of funds collected by the faster clearing of inventory using larger lot sizes. Observed bids in these auctions impact the auctioneer’s decision on lot sizes in future auctions. We first present a goal programming approach for estimating the bid distribution for the bidder population from the observed bids, readily available in these auctions. We then develop models to compute optimal lot sizes for both stationary and non-stationary bid distributions. For stationary bid distribution, we present closed form solutions and structural results. Our findings show that the optimal lot size increases with inventory holding costs and number of bidders. Our model for non-stationary bid distribution captures the inter-auction dynamics such as the number of bidders, their bids, past winning bids, and lot size. We use simulated data to test the robustness of our model.

A PURE NASH EQUILIBRIUM-BASED GAME THEORETICAL METHOD FOR DATA REPLICATION ACROSS MULTIPLE SERVERS

Khan, S.U.   Ahmad, I.,  Dept. of Electr. & Comput. Eng., North Dakota State Univ., Fargo, ND

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009, Volume: 21, Issue: 4, On page(s): 537-553, ISSN: 1041-4347,

INSPEC Accession Number: 10477129, Digital Object Identifier: 10.1109/TKDE.2008.171,

First Published: 2008-08-22, Current Version Published: 2009-02-24

ABSTRACT

This paper proposes a non-cooperative game based technique to replicate data objects across a distributed system of multiple servers in order to reduce user perceived Web access delays. In the proposed technique computational agents represent servers and compete with each other to optimize the performance of their servers. The optimality of a non-cooperative game is typically described by Nash equilibrium, which is based on spontaneous and non-deterministic strategies. However, Nash equilibrium may or may not guarantee system-wide performance. Furthermore, there can be multiple Nash equilibria, making it difficult to decide which one is the best. In contrast, the proposed technique uses the notion of pure Nash equilibrium, which if achieved, guarantees stable optimal performance. In the proposed technique, agents use deterministic strategies that work in conjunction with their self-interested nature but ensure system-wide performance enhancement. In general, the existence of a pure Nash equilibrium is hard to achieve, but we prove the existence of such equilibrium in the proposed technique. The proposed technique is also experimentally compared against some well-known conventional replica allocation methods, such as branch and bound, greedy, and genetic algorithms.


ON THE DESIGN AND APPLICABILITY OF DISTANCE FUNCTIONS IN HIGH-DIMENSIONAL DATA SPACE

Chih-Ming Hsu   Ming-Syan Chen.,  Dept. of Electr. Eng., Nat. Taiwan Univ., Taipei

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009, Volume: 21, Issue: 4, On page(s): 523-536, ISSN: 1041-4347,

INSPEC Accession Number: 10477128, Digital Object Identifier: 10.1109/TKDE.2008.178,

First Published: 2008-08-29, Current Version Published: 2009-02-24

ABSTRACT

Effective distance functions in high dimensional data space are very important in solutions for many data mining problems. Recent research has shown that if the Pearson variation of the distance distribution converges to zero with increasing dimensionality, the distance function will become unstable (or meaningless) in high dimensional space, even with the commonly used Lp metric in the Euclidean space. This result has spawned many studies the along the same lines. However, the necessary condition for unstability of a distance function, which is required for function design, remains unknown. In this paper, we shall prove that several important conditions are in fact equivalent to unstability. Based on these theoretical results, we employ some effective and valid indices for testing the stability of a distance function. In addition, this theoretical analysis inspires us that unstable phenomena are rooted in variation of the distance distribution. To demonstrate the theoretical results, we design a meaningful distance function, called the shrinkage-divergence proximity (SDP), based on a given distance function. It is shown empirically that the SDP significantly outperforms other measures in terms of stability in high dimensional data space, and is thus more suitable for distance-based clustering applications.

MINING PROJECTED CLUSTERS IN HIGH-DIMENSIONAL SPACES

Bouguessa, M.   Shengrui Wang.,  Dept. of Comput. Sci., Univ. of Sherbrooke, Sherbrooke, QC;

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009, Volume: 21, Issue: 4, On page(s): 507-522, ISSN: 1041-4347,

INSPEC Accession Number: 10477127, Digital Object Identifier: 10.1109/TKDE.2008.162,

First Published: 2008-08-01, Current Version Published: 2009-02-24

ABSTRACT

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. To address this problem, a number of projected clustering algorithms have been proposed. However, most of them encounter difficulties when clusters hide in subspaces with very low dimensionality. These challenges motivate our effort to propose a robust partitional distance-based projected clustering algorithm. The algorithm consists of three phases. The first phase performs attribute relevance analysis by detecting dense and sparse regions and their location in each attribute. Starting from the results of the first phase, the goal of the second phase is to eliminate outliers, while the third phase aims to discover clusters in different subspaces. The clustering process is based on the k-means algorithm, with the computation of distance restricted to subsets of attributes where object values are dense. Our algorithm is capable of detecting projected clusters of low dimensionality embedded in a high-dimensional space and avoids the computation of the distance in the full-dimensional space. The suitability of our proposal has been demonstrated through an empirical study using synthetic and real datasets.


IMINE: INDEX SUPPORT FOR ITEM SET MINING

Baralis, E.   Cerquitelli, T.   Chiusano, S.,  Dipt. di Autom. e Inf., Politec. di Torino, Torino

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009, Volume: 21, Issue: 4, On page(s): 493-506, ISSN: 1041-4347,

INSPEC Accession Number: 10477126, Digital Object Identifier: 10.1109/TKDE.2008.180,

First Published: 2008-08-29, Current Version Published: 2009-02-24

ABSTRACT

This paper presents the IMine index, a general and compact structure which provides tight integration of item set extraction in a relational DBMS. Since no constraint is enforced during the index creation phase, IMine provides a complete representation of the original database. To reduce the I/O cost, data accessed together during the same extraction phase are clustered on the same disk block. The IMine index structure can be efficiently exploited by different item set extraction algorithms. In particular, IMine data access methods currently support the FP-growth and LCM v.2 algorithms, but they can straightforwardly support the enforcement of various constraint categories. The IMine index has been integrated into the PostgreSQL DBMS and exploits its physical level access methods. Experiments, run for both sparse and dense data distributions, show the efficiency of the proposed index and its linear scalability also for large datasets. Item set mining supported by the IMine index shows performance always comparable with, and ssometimes better than, state of the art algorithms accessing data on flat file.

COMPRESSION AND AGGREGATION FOR LOGISTIC REGRESSION ANALYSIS IN DATA CUBES

Ruibin Xi   Nan Lin   Yixin Chen.,  Dept. of Math., Washington Univ., St. Louis, MO

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009, Volume: 21, Issue: 4, On page(s): 479-492, ISSN: 1041-4347,

INSPEC Accession Number: 10477125, Digital Object Identifier: 10.1109/TKDE.2008.186,

First Published: 2008-09-12, Current Version Published: 2009-02-24

ABSTRACT

Logistic regression is an important technique for analyzing and predicting data with categorical attributes. In this paper, We consider supporting online analytical processing (OLAP) of logistic regression analysis for multi-dimensional data in a data cube where it is expensive in time and space to build logistic regression models for each cell from the raw data. We propose a novel scheme to compress the data in such a way that we can reconstruct logistic regression models to answer any OLAP query without accessing the raw data. Based on a first-order approximation to the maximum likelihood estimating equations, we develop a compression scheme that compresses each base cell into a small compressed data block with essential information to support the aggregation of logistic regression models. Aggregation formulae for deriving high-level logistic regression models from lower level component cells are given. We prove that the compression is nearly lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded and approaches to zero when the data size increases. The results show that the proposed compression and aggregation scheme can make feasible OLAP of logistic regression in a data cube. Further, it supports real-time logistic regression analysis of stream data, which can only be scanned once and cannot be permanently retained. Experimental results validate our theoretical analysis and demonstrate that our method can dramatically save time and space costs with almost no degradation of the modeling accuracy.

A GENERIC LOCAL ALGORITHM FOR MINING DATA STREAMS IN LARGE DISTRIBUTED SYSTEMS

Wolff, R.   Bhaduri, K.   Kargupta, H., Dept. of Manage. Inf. Syst., Haifa Univ., Haifa

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009, Volume: 21, Issue: 4, On page(s): 465-478, ISSN: 1041-4347,

INSPEC Accession Number: 10477124, Digital Object Identifier: 10.1109/TKDE.2008.169,

First Published: 2008-08-22, Current Version Published: 2009-02-24

ABSTRACT

In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system’s functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.

IMPROVING PERSONALIZATION SOLUTIONS THROUGH OPTIMAL SEGMENTATION OF CUSTOMER BASES

Tianyi Jiang   Tuzhilin, A.,  AvePoint Inc., Jersey, NJ

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on publication date: March 2009, Volume: 21, Issue: 3, On page(s): 305-320, Location: Los Angeles, CA, USA, ISSN: 1041-4347,

INSPEC Accession Number: 10416748, Digital Object Identifier: 10.1109/TKDE.2008.163

First Published: 2008-08-08, Current Version Published: 2009-01-23

ABSTRACT

On the Web, where the search costs are low and the competition is just a mouse click away, it is crucial to segment the customers intelligently in order to offer more targeted and personalized products and services to them. Traditionally, customer segmentation is achieved using statistics-based methods that compute a set of statistics from the customer data and group customers into segments by applying distance-based clustering algorithms in the space of these statistics. In this paper, we present a direct grouping-based approach to computing customer segments that groups customers not based on computed statistics, but in terms of optimally combining transactional data of several customers to build a data mining model of customer behavior for each group. Then, building customer segments becomes a combinatorial optimization problem of finding the best partitioning of the customer base into disjoint groups. This paper shows that finding an optimal customer partition is NP-hard, proposes several suboptimal direct grouping segmentation methods, and empirically compares them among themselves, traditional statistics-based hierarchical and affinity propagation-based segmentation, and one-to-one methods across multiple experimental conditions. It is shown that the best direct grouping method significantly dominates the statistics-based and one-to-one approaches across most of the experimental conditions, while still being computationally tractable. It is also shown that the distribution of the sizes of customer segments generated by the best direct grouping method follows a power law distribution and that microsegmentation provides the best approach to personalization.

EFFECTIVE AND EFFICIENT QUERY PROCESSING FOR VIDEO SUBSEQUENCE IDENTIFICATION

Heng Tao Shen   Jie Shao   Zi Huang   Xiaofang Zhou., Sch. of Inf. Technol. & Electr. Eng., Univ. of Queensland, Brisbane, QLD

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009, Volume: 21, Issue: 3, On page(s): 321-334,Location: Los Angeles, CA, USA,

ISSN: 1041-4347, INSPEC Accession Number: 10416749, Digital Object Identifier: 10.1109/TKDE.2008.168

First Published: 2008-08-15, Current Version Published: 2009-01-23

ABSTRACT

With the growing demand for visual information of rich content, effective and efficient manipulations of large video databases are increasingly desired. Many investigations have been made on content-based video retrieval. However, despite the importance, video subsequence identification, which is to find the similar content to a short query clip from a long video sequence, has not been well addressed. This paper presents a graph transformation and matching approach to this problem, with extension to identify the occurrence of potentially different ordering or length due to content editing. With a novel batch query algorithm to retrieve similar frames, the mapping relationship between the query and database video is first represented by a bipartite graph. The densely matched parts along the long sequence are then extracted, followed by a filter-and-refine search strategy to prune some irrelevant subsequences. During the filtering stage, maximum size matching is deployed for each subgraph constructed by the query and candidate subsequence to obtain a smaller set of candidates. During the refinement stage, sub-maximum similarity matching is devised to identify the subsequence with the highest aggregate score from all candidates, according to a robust video similarity model that incorporates visual content, temporal order, and frame alignment information. The performance studies conducted on a long video recording of 50 hours validate that our approach is promising in terms of both search accuracy and speed.

AUTOMATICALLY DETERMINING THE NUMBER OF CLUSTERS IN UNLABELED DATA SETS

Liang Wang   Leckie, C.   Ramamohanarao, K.   Bezdek, J., Dept. of Comput. Sci. & Software Eng., Univ. of Melbourne, Melbourne, VIC

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009, Volume: 21, Issue: 3, On page(s): 335-350, Location: Los Angeles, CA, USA,

ISSN: 1041-4347, INSPEC Accession Number: 10416751

Digital Object Identifier: 10.1109/TKDE.2008.158, Current Version Published: 2009-01-23

ABSTRACT

Clustering is a popular tool for exploratory data analysis. One of the major problems in cluster analysis is the determination of the number of clusters in unlabeled data, which is a basic input for most clustering algorithms. In this paper we investigate a new method called DBE (dark block extraction) for automatically estimating the number of clusters in unlabeled data sets, which is based on an existing algorithm for visual assessment of cluster tendency (VAT) of a data set, using several common image and signal processing

techniques. Basic steps include:

(1)    generating a VAT image of an input dissimilarity matrix;

(2)     performing image segmentation on the VAT image to obtain a binary image, followed by directional morphological filtering;

(3)    applying a distance transform to the filtered binary image and projecting the pixel values onto the main diagonal axis of the image to form a projection signal;

(4)    smoothing the projection signal, computing its first-order derivative, and then detecting major peaks and valleys in the resulting signal to decide the number of clusters. Our new DBE method is nearly “automatic”, depending on just one easy-to-set parameter. Several numerical and real-world examples are presented to illustrate the effectiveness of DBE.

EFFICIENT PROCESSING OF METRIC SKYLINE QUERIES

Lei Chen   Xiang Lian , Dept. of Comput. Sci. & Eng., Hong Kong Univ. of Sci. & Technol., Hong Kong

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009, Volume: 21, Issue: 3, On page(s): 351-365, Location: Los Angeles, CA, USA, ISSN: 1041-4347, INSPEC Accession Number: 10416744, Digital Object Identifier: 10.1109/TKDE.2008.146,First Published: 2008-07-18, Current Version Published: 2009-01-23

ABSTRACT

Skyline query is of great importance in many applications, such as multi-criteria decision making and business planning. In particular, a skyline point is a data object in the database whose attribute vector is not dominated by that of any other objects. Previous methods to retrieve skyline points usually assume static data objects in the database (i.e. their attribute vectors are fixed), whereas several recent work focus on skyline queries with dynamic attributes. In this paper, we propose a novel variant of skyline queries,

namely metric skyline, whose dynamic attributes are defined in the metric space (i.e. not limited to the Euclidean space). We illustrate an efficient and effective pruning mechanism to answer metric skyline queries through a metric index. Most importantly, we formalize the query performance of the metric skyline query in terms of the pruning power, by a cost model, in light of which we construct an optimized metric index aiming to maximize the pruning power of metric skyline queries. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed pruning techniques as well as the constructed index in answering metric skyline queries.


ON THE EFFECT OF LOCATION UNCERTAINTY IN SPATIAL QUERYING

Frentzos, E.   Gratsias, K.   Theodoridis, Y., Univ. of Piraeus, Piraeus

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on publication date: March 2009, Volume: 21, Issue: 3, On page(s): 366-383, Location: Los Angeles, CA, USA, ISSN: 1041-4347

INSPEC Accession Number: 10416746, Digital Object Identifier: 10.1109/TKDE.2008.164,

First Published: 2008-08-08, Current Version Published: 2009-01-23

ABSTRACT

An emerging topic in the field of spatial data management is the handling of location uncertainty of spatial objects, mainly due to inaccurate measurements. The literature on location uncertainty so far has focused on modifying traditional spatial search algorithms in order to handle the impact of objects’ location uncertainty in query results. In this paper, we present the first, to the best of our knowledge, theoretical analysis that estimates the average number of false hits introduced in the results of rectangular range queries in the case of data points uniformly distributed in 2D space. Then, we relax the original distribution assumptions showing how to deal with arbitrarily distributed data points and more realistic location uncertainty distributions. The accuracy of the results of our analytical approach is demonstrated through an extensive experimental study using various synthetic and real datasets. Our proposal can be directly employed in spatial database systems in order to provide users with the accuracy of spatial query results based only on known dataset and query parameters.


DISTRIBUTED SKYLINE RETRIEVAL WITH LOW BANDWIDTH CONSUMPTION

Lin Zhu   Yufei Tao   Shuigeng Zhou., Fudan Univ., Shanghai

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009, Volume: 21, Issue: 3, on page(s): 384-400, Location: Los Angeles, CA, USA, ISSN: 1041-4347

INSPEC Accession Number: 10416743, Digital Object Identifier: 10.1109/TKDE.2008.142,

First Published: 2008-07-15, Current Version Published: 2009-01-23

ABSTRACT

We consider skyline computation when the underlying data set is horizontally partitioned onto geographically distant servers that are connected to the Internet. The existing solutions are not suitable for our problem, because they have at least one of the following drawbacks: (1) applicable only to distributed systems adopting vertical partitioning or restricted horizontal partitioning, (2) effective only when each server has limited computing and communication abilities, and (3) optimized only for skyline search in subspaces but inefficient in the full space. This paper proposes an algorithm, called feedback-based distributed skyline (FDS), to support arbitrary horizontal partitioning. FDS aims at minimizing the network bandwidth, measured in the number of tuples transmitted over the network. The core of FDS is a novel feedback-driven mechanism, where the coordinator iteratively transmits certain feedback to each participant. Participants can leverage such information to prune a large amount of local data, which otherwise would need to be sent to the coordinator. Extensive experimentation confirms that FDS significantly outperforms alternative approaches in both effectiveness and progressiveness.


SEMQA: SPARQL WITH IDEMPOTENT DISJUNCTION

Shironoshita, E.P.   Jean-Mary, Y.R.   Bradley, R.M.   Kabuka, M.R., INFO-TECH Soft Inc., Miami, FL

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009, Volume: 21, Issue: 3, On page(s): 401-414, Location: Los Angeles, CA, USA, ISSN: 1041-4347

INSPEC Accession Number: 10416750, Digital Object Identifier: 10.1109/TKDE.2008.91,

First Published: 2008-05-12, Current Version Published: 2009-01-23

ABSTRACT

The SPARQL LeftJoin abstract operator is not distributive over Union; this limits the algebraic manipulation of graph patterns, which in turn restricts the ability to create query plans for distributed processing or query optimization. In this paper, we present semQA, an algebraic extension for the SPARQL query language for RDF, which overcomes this issue by transforming graph patterns through the use of an idempotent disjunction operator or as a substitute for Union. This permits the application of a set of equivalences that transform a query into distinct forms. We further present an algorithm to derive the solution set of the original query from the solution set of a query where Union has been substituted by or. We also analyze the combined complexity of SPARQL, proving it to be NP-complete. It is also shown that the SPARQL query language is not, in the general case, fixed-parameter tractable. Experimental results are presented to validate the query evaluation methodology presented in this paper against the SPARQL standard, to corroborate the complexity analysis, and to illustrate the gains in processing cost reduction that can be obtained through the application of semQA.


DETECTING, ASSESSING AND MONITORING RELEVANT TOPICS IN VIRTUAL INFORMATION ENVIRONMENTS

Ontrup, J.   Ritter, H.   Scholz, S.W.   Wagner, R.,  Fac. of Technol., Bielefeld Univ., Bielefeld;

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009, Volume: 21, Issue: 3, On page(s): 415-427, ISSN: 1041-4347,

INSPEC Accession Number: 10458278, Digital Object Identifier: 10.1109/TKDE.2008.149

First Published: 2008-07-18, Current Version Published: 2009-01-23

ABSTRACT

The ability to assess the relevance of topics and related sources in information-rich environments is a key to success when scanning business environments. This paper introduces a hybrid system to support managerial information gathering. The system is made up of three components: 1) a hierarchical hyperbolic SOM for structuring the information environment and visualizing the intensity of news activity with respect to identified topics, 2) a spreading activation network for the selection of the most relevant

information sources with respect to an already existing knowledge infrastructure, and 3) measures of interestingness for association rules as well as statistical testing facilitates the monitoring of already identified topics. Embedding the system by a framework describing three modes of human information seeking behavior endorses an active organization, exploration and selection of information that matches the needs of decision makers in all stages of the information gathering process. By applying our system in the domain of the hotel industry we demonstrate how typical information gathering tasks are supported. Moreover, we present an empirical study investigating the effectiveness and efficiency of the visualization framework of our system.


DISTRIBUTIONAL FEATURES FOR TEXT CATEGORIZATION

Xiao-Bing Xue   Zhi-Hua Zhou.,  Nanjing Univ., Nanjing

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009, Volume: 21, Issue: 3, On page(s): 428-442, Location: Los Angeles, CA, USA, ISSN: 1041-4347, INSPEC Accession Number: 10416747, Digital Object Identifier: 10.1109/TKDE.2008.166,

First Published: 2008-08-08, Current Version Published: 2009-01-23

ABSTRACT

Text categorization is the task of assigning predefined categories to natural language text. With the widely used ‘bag of words’ representation, previous researches usually assign a word with values such that whether this word appears in the document concerned or how frequently this word appears. Although these values are useful for text categorization, they have not fully expressed the abundant information contained in the document. This paper explores the effect of other types of values, which express the distribution of a word in the document. These novel values assigned to a word are called distributional features, which include the compactness of the appearances of the word and the position of the first appearance of the word. The proposed distributional features are exploited by a tf idf style equation and different features are combined using ensemble learning techniques. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency values solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved. Further analysis shows that the distributional features are especially useful when documents are long and the writing style is casual.


THE DEVELOPMENT OF FUZZY ROUGH SETS WITH THE USE OF STRUCTURES AND ALGEBRAS OF AXIOMATIC FUZZY SETS

Xiaodong Liu   Pedrycz, W.   Tianyou Chai   Mingli Song, Res. Center of Inf. & Control, Dalian Univ. of Technol., Dalian

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009, Volume: 21, Issue: 3, On page(s): 443-462, Location: Los Angeles, CA, USA, ISSN: 1041-4347, INSPEC Accession Number: 10416745, Digital Object Identifier: 10.1109/TKDE.2008.147

First Published: 2008-07-18, Current Version Published: 2009-01-23

ABSTRACT

The notion of a rough set was originally proposed by Pawlak underwent a number of extensions and generalizations. Dubois and Prade (1990) introduced fuzzy rough sets which involve the use of rough sets and fuzzy sets within a single framework. Radzikowska and Kerre (2002) proposed a broad family of fuzzy rough sets, referred to as (phi, t)-fuzzy rough sets which are determined by some implication operator (implicator) phi and a certain t-norm. In order to describe the linguistically represented concepts coming from data available in some information system, the concept of fuzzy rough sets are redefined and further studied in the setting of the axiomatic fuzzy set (AFS) theory. Compared with the (phi, t)-fuzzy rough sets, the advantages of AFS fuzzy rough sets are twofold. They can be directly applied to data analysis present in any information system without resorting to the details concerning the choice of the implication phi, t-norm and a similarity relation S. Furthermore such rough approximations of fuzzy concepts come with a well-defined semantics and therefore offer a sound interpretation. Some examples are included to illustrate the effectiveness of the proposed construct. It is shown that the AFS fuzzy rough sets provide a far higher flexibility and effectiveness in comparison with rough sets and some of their generalizations.

LEARNING IMAGE-TEXT ASSOCIATIONS

Tao Jiang   Ah-Hwee Tan, ecPresence Technol. Pte Ltd., Singapore;

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009, Volume: 21, Issue: 2, On page(s): 161-177, Location: Los Angeles, CA, USA,ISSN: 1041-4347

INSPEC Accession Number: 10370659, Digital Object Identifier: 10.1109/TKDE.2008.150

First Published: 2008-08-01, Current Version Published: 2008-12-30

ABSTRACT

Web information fusion can be defined as the problem of collating and tracking information related to specific topics on the World Wide Web. Whereas most existing work on Web information fusion has focused on text-based multidocument summarization, this paper concerns the topic of image and text association, a cornerstone of cross-media Web information fusion. Specifically, we present two learning methods for discovering the underlying associations between images and texts based on small training data sets. The first method based on vague transformation measures the information similarity between the visual features and the textual features through a set of predefined domain-specific information categories. Another method uses a neural network to learn direct mapping between the visual and textual features by automatically and incrementally summarizing the associated features into a set of information templates. Despite their distinct approaches, our experimental results on a terrorist domain document set show that both methods are capable of learning associations between images and texts from a small training data set.


DECOMPOSITIONAL RULE EXTRACTION FROM SUPPORT VECTOR MACHINES BY ACTIVE LEARNING

Martens, D.   Baesens, B.B.   Van Gestel, T., Dept. of Bus. Adm. & Public Manage., Hogeschool Gent, Ghent

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009, Volume: 21, Issue: 2, On page(s): 178-191, Location: Los Angeles, CA, USA, ISSN: 1041-4347

INSPEC Accession Number: 10370654, Digital Object Identifier: 10.1109/TKDE.2008.131,

First Published: 2008-07-15, Current Version Published: 2008-12-30

ABSTRACT

Support vector machines (SVMs) are currently state-of-the-art for the classification task and, generally speaking, exhibit good predictive performance due to their ability to model nonlinearities. However, their strength is also their main weakness, as the generated nonlinear models are typically regarded as incomprehensible black-box models. In this paper, we propose a new active learning-based approach (ALBA) to extract comprehensible rules from opaque SVM models. Through rule extraction, some insight is provided into the logics of the SVM model. ALBA extracts rules from the trained SVM model by explicitly making use of key concepts of the SVM: the support vectors, and the observation that these are typically close to the decision boundary. Active learning implies the focus on apparent problem areas, which for rule induction techniques are the regions close to the SVM decision boundary where most of the noise is found. By generating extra data close to these support vectors that are provided with a class label by the trained SVM model, rule induction techniques are better able to discover suitable discrimination rules. This performance increase, both in terms of predictive accuracy as comprehensibility, is confirmed in our experiments where we apply ALBA on several publicly available data sets.

MULTICLASS MTS FOR SIMULTANEOUS FEATURE SELECTION AND CLASSIFICATION

Chao-Ton Su   Yu-Hsiang Hsiao, Nat. Tsing Hua Univ., Hsinchu

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009, Volume: 21, Issue: 2, On page(s): 192-205, Location: Los Angeles, CA, USA, ISSN: 1041-4347,

INSPEC Accession Number: 10370662, Digital Object Identifier: 10.1109/TKDE.2008.128,

First Published: 2008-07-15, Current Version Published: 2008-12-30

ABSTRACT

Multiclass Mahalanobis-Taguchi system (MMTS), the extension of MTS, is developed for simultaneous multiclass classification and feature selection. In MMTS, the multiclass measurement scale is constructed by establishing an individual Mahalanobis space for each class. To increase the validity of the measurement scale, the Gram-Schmidt process is performed to mutually orthogonalize the features and eliminate the multicollinearity. The important features are identified using the orthogonal arrays and the signal-to-noise ratio, and are then used to construct a reduced model measurement scale. The contribution of each important feature to classification is also derived according to the effect gain to develop a weighted Mahalanobis distance which is finally used as the distance metric for the classification of MMTS. Using the reduced model measurement scale, an unknown example will be classified into the class with minimum weighted Mahalanobis distance considering only the important features. For evaluating the effectiveness of MMTS, a numerical experiment is implemented, and the results show that MMTS outperforms other well-known algorithms not only on classification accuracy but also on feature selection efficiency. Finally, a real case about gestational diabetes mellitus is studied, and the results indicate the practicality of MMTS in real-world applications.


K-ANONYMIZATION WITH MINIMAL LOSS OF INFORMATION

Gionis, A.   Tassa, T., Yahoo! Res., Barcelona

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on publication date: Feb. 2009, Volume: 21, Issue: 2, On page(s): 206-219, Location: Los Angeles, CA, USA, ISSN: 1041-4347,

INSPEC Accession Number: 10399653,Digital Object Identifier: 10.1109/TKDE.2008.129,

First Published: 2008-07-15, Current Version Published: 2008-12-30

ABSTRACT

The technique of k-anonymization allows the releasing of databases that contain personal information while ensuring some degree of individual privacy. Anonymization is usually performed by generalizing database entries. We formally study the concept of generalization, and propose three information-theoretic measures for capturing the amount of information that is lost during the anonymization process. The proposed measures are more general and more accurate than those that were proposed by Meyerson and Williams and Aggarwal et al. We study the problem of achieving k-anonymity with minimal loss of information. We prove that it is NP-hard and study polynomial approximations for the optimal solution. Our first algorithm gives an approximation guarantee of O(ln k) for two of our measures as well as for the previously studied measures. This improves the best-known O(k)-approximation in. While the previous approximation algorithms relied on the graph representation framework, our algorithm relies on a novel hypergraph representation that enables the improvement in the approximation ratio from O(k) to O(ln k). As the running time of the algorithm is O(n2k}), we also show how to adapt the algorithm in in order to obtain an O(k)-approximation algorithm that is polynomial in both n and k.

COST-BASED PREDICTIVE SPATIOTEMPORAL JOIN

Wook-Shin Han   Jaehwa Kim   Byung Suk Lee   Yufei Tao   Rantzau, R.   Markl, V.,

Dept. of Comput. Eng., Kyungpook Nat. Univ., Daegu

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009, Volume: 21, Issue: 2, On page(s): 220-233, Location: Los Angeles, CA, USA, ISSN: 1041-4347, INSPEC Accession Number: 10370660, Digital Object Identifier: 10.1109/TKDE.2008.159,

First Published: 2008-08-01, Current Version Published: 2008-12-30

ABSTRACT

A predictive spatiotemporal join finds all pairs of moving objects satisfying a join condition on future time and space. In this paper, we present CoPST, the first and foremost algorithm for such a join using two spatiotemporal indexes. In a predictive spatiotemporal join, the bounding boxes of the outer index are used to perform window searches on the inner index, and these bounding boxes enclose objects with increasing laxity over time. CoPST constructs globally tightened bounding boxes “on the fly” to perform window searches during join processing, thus significantly minimizing overlap and improving the join performance. CoPST adapts gracefully to large-scale databases, by dynamically switching between main-memory buffering and disk-based buffering, through a novel probabilistic cost model. Our extensive experiments validate the cost model and show its accuracy for realistic data sets. We also showcase the superiority of CoPST over algorithms adapted from state-of-the-art spatial join algorithms, by a speedup of up to an order of magnitude.

BMQ-PROCESSOR: A HIGH-PERFORMANCE BORDER-CROSSING EVENT DETECTION FRAMEWORK FOR LARGE-SCALE MONITORING APPLICATIONS

Jinwon Lee   Seungwoo Kang   Youngki Lee   Sang Jeong Lee   Junehwa Song

Div. of Comput. Sci., Korea Adv. Inst. of Sci. & Technol. (KAIST), Daejeon;

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009, Volume: 21, Issue: 2, On page(s): 234-252, Location: Los Angeles, CA, USA, ISSN: 1041-4347,

INSPEC Accession Number: 10370656, Digital Object Identifier: 10.1109/TKDE.2008.140,

First Published: 2008-07-15, Current Version Published: 2008-12-30

ABSTRACT

In this paper, we present BMQ-Processor, a high-performance border-crossing event (BCE) detection framework for large-scale monitoring applications. We first characterize a new query semantics, namely, border monitoring query (BMQ), which is useful for BCE detection in many monitoring applications. It monitors the values of data streams and reports them only when data streams cross the borders of its range. We then propose BMQ-Processor to efficiently handle a large number of BMQs over a high volume of data streams. BMQ-Processor efficiently processes BMQs in a shared and incremental manner. It develops and operates over a novel stateful query index, achieving a high level of scalability over continuous data updates. Also, it utilizes the locality embedded in data streams and greatly accelerates successive BMQ evaluations. We present data structures and algorithms to support 1D as well as multidimensional BMQs. We show that the semantics of border monitoring can be extended toward more advanced ones and build region transition monitoring as a sample case. Lastly, we demonstrate excellent processing performance and low storage cost of BMQ-Processor through extensive analysis and experiments.


PRIVACY-PRESERVING KTH ELEMENT SCORE OVER VERTICALLY PARTITIONED DATA

Vaidya, J.   Clifton, C.W.,  Manage. Sci. & Inf. Syst. Dept., Rutgers Univ., Newark, NJ

This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009, Volume: 21, Issue: 2, On page(s): 253-258, Location: Los Angeles, CA, USA, ISSN: 1041-4347,

INSPEC Accession Number: 10370661, Digital Object Identifier: 10.1109/TKDE.2008.167,

First Published: 2008-08-15, Current Version Published: 2008-12-30

ABSTRACT

Given a large integer data set shared vertically by two parties, we consider the problem of securely computing a score separating the kth and the (k + 1) to compute such a score while revealing little additional information. The proposed protocol is implemented using the Fairplay system and experimental results are reported. We show a real application of this protocol as a component used in the secure processing of top-k queries over vertically partitioned

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s