In data integration, entity resolution is an important technique to improve data quality. Existing researches typically assume that the target dataset only contain string-type data and use single similarity metric. For larger high-dimensional dataset, redundant information needs to be verified using traditional blocking or windowing techniques. In this work, we propose a novel ER-resolving method using a hybrid approach, including type-based multiblocks, varying window size, and more flexible similarity metrics. In our new ER workflow, we reduce the searching space for entity pairs by the constraint of redundant attributes and matching likelihood. We develop a reference implementation of our proposed approach and validate its performance using real-life dataset from one Internet of Things project. We evaluate the data processing system using five standard metrics including effectiveness, efficiency, accuracy, recall, and precision. Experimental results indicate that the proposed approach could be a promising alternative for entity resolution and could be feasibly applied in real-world data cleaning for large datasets.
from #AlexandrosSfakianakis via Alexandros G.Sfakianakis on Inoreader http://ift.tt/2Gq3pCS
via IFTTT
Εγγραφή σε:
Σχόλια ανάρτησης (Atom)
Δημοφιλείς αναρτήσεις
-
Objectives Greece is one of the leading tobacco-producing countries in European Union, and every year over 19 000 Greeks die from tobacco-at...
-
Objectives Drug interactions, poor adherence to medication and high-risk sexual behaviour may occur in individuals with HIV using recreation...
-
Introduction Multimorbidity (MM) refers to the coexistence of two or more chronic conditions within one person, where no one condition is co...
-
Objective To describe the prevalence and severity of diabetic retinopathy (DR) and sight-threatening DR (STDR) among Chinese adults with dia...
-
Related Articles Three job stress models and their relationship with musculoskeletal pain in blue- and white-collar workers. J Psycho...
-
Abstract Background Mature T-cell and natural killer (NK)-cell lymphomas compose a heterogeneous group of non-Hodgkin lymphomas, and ext...
-
<span class="paragraphSection"><div class="boxTitle">Abstract</div>Masked hypertension (MHT), defined ...
-
Background Hepatitis B virus (HBV) transmission is known to occur through direct contact with infected blood. There has been some suspicion ...
-
In Rwanda, the prevalence of viral hepatitis (HCV) is poorly understood. The current study investigated the prevalence and risk factors of H...
Δεν υπάρχουν σχόλια:
Δημοσίευση σχολίου