useR! 2024, Salzburg, Austria
July 9, 2024
Data Quality Problems
Class overlap occurs when instances of more than one class share a common region in the data space and are not clearly separable
This overlap can happen due to:
Class overlap occurs when instances of more than one class share a common region in the data space and are not clearly separable
This overlap can happen due to:
Inherent Similarity: Natural similarity between classes
Noise: Variability or errors in data collection
Class overlap occurs when instances of more than one class share a common region in the data space and are not clearly separable
This overlap can happen due to:
Inherent Similarity: Natural similarity between classes
Noise: Variability or errors in data collection
Feature Representation: Insufficient or inadequate features to separate classes
Classifiers struggle to correctly classify instances due to overlapping regions
Higher error rates occur in areas where classes overlap, leading to more instances being misclassified
Classifiers struggle to correctly classify instances due to overlapping regions
Higher error rates occur in areas where classes overlap, leading to more instances being misclassified
If the problem of class overlap is not addressed, models may become overly complex, leading to overfitting issues where the model performs well on training data but poorly on unseen data
clap
R Packageinstall.packages('clap')
devtools::install_github("pridiltal/clap")
clap
Framework
Normalize the columns of the data. (median and IQR)
This prevents variables with large variances having disproportional influence on Euclidean distances.
clap
Framework
Leader Algorithm (Hartingan, 1975)
Calculate the nearest neighbor distances
clap
Framework
Leader Algorithm (Hartingan, 1975)
Calculate the nearest neighbor distances
Performs clustering using a radius based on the maximum nearest neighbor distance
Case Study 1: Biopsy Data of Breast Cancer Patients
clump thickness
uniformity of cell size
uniformity of cell shape
marginal adhesion
single epithelial cell size
bare nuclei (16 values are missing)
bland chromatin
normal nucleoli
mitoses
Classification Task
Benign tumor - generally do not invade and spread
Malignant tumor cells - more likely to spread to other areas of the body
Aim: detect early stages of pneumonia through chest x-ray images
Inaccurate analysis of the x-ray results may result in an improper diagnosis and decision making, resulting in a costly mistake that could otherwise save lives
Aim: detect early stages of pneumonia through chest x-ray images
Inaccurate analysis of the x-ray results may result in an improper diagnosis and decision making, resulting in a costly mistake that could otherwise save lives
Linearized chest x-rays images
Images of Class “Normal”
Images of Class “Pneumonia”
The Leader Algorithm determined cluster size using the maximum nearest neighbor distance. Further analysis is needed to identify the optimal cluster size.
Euclidean distance was used in the Leader Algorithm to determine cluster size. Additional experiments are required to find faster distance calculation methods.
The Leader Algorithm determined cluster size using the maximum nearest neighbor distance. Further analysis is needed to identify the optimal cluster size.
Euclidean distance was used in the Leader Algorithm to determine cluster size. Additional experiments are required to find faster distance calculation methods.
Further analysis is needed to identify classification methods that are robust to class overlapping instances.
The Leader Algorithm determined cluster size using the maximum nearest neighbor distance. Further analysis is needed to identify the optimal cluster size.
Euclidean distance was used in the Leader Algorithm to determine cluster size. Additional experiments are required to find faster distance calculation methods.
Further analysis is needed to identify classification methods that are robust to class overlapping instances.
Further analysis is required to assess the overlapping status of unseen test data in the high-dimensional feature space.
This work was supported in part by the RETINA research lab, funded by the OWSD, a program unit of the United Nations Educational, Scientific, and Cultural Organization (UNESCO).
Slides available at: prital.netlify.app
Slides created with Quarto, available at prital.netlify.app.