Exploring Class Overlap in Classification Challenges: Introducing the R Package

useR! 2024, Salzburg, Austria

Priyanga Dilini Talagala

July 9, 2024

Data Quality Problems

Class Overlapping Problem

  • Class overlap occurs when instances of more than one class share a common region in the data space and are not clearly separable

Class Overlapping Problem

  • Class overlap occurs when instances of more than one class share a common region in the data space and are not clearly separable

  • This overlap can happen due to:

    • Inherent Similarity: Natural similarity between classes

Class Overlapping Problem

  • Class overlap occurs when instances of more than one class share a common region in the data space and are not clearly separable

  • This overlap can happen due to:

    • Inherent Similarity: Natural similarity between classes

    • Noise: Variability or errors in data collection

Class Overlapping Problem

  • Class overlap occurs when instances of more than one class share a common region in the data space and are not clearly separable

  • This overlap can happen due to:

    • Inherent Similarity: Natural similarity between classes

    • Noise: Variability or errors in data collection

    • Feature Representation: Insufficient or inadequate features to separate classes

Implications of Class Overlap

  • Classifiers struggle to correctly classify instances due to overlapping regions

Implications of Class Overlap

  • Classifiers struggle to correctly classify instances due to overlapping regions

  • Higher error rates occur in areas where classes overlap, leading to more instances being misclassified

Implications of Class Overlap

  • Classifiers struggle to correctly classify instances due to overlapping regions

  • Higher error rates occur in areas where classes overlap, leading to more instances being misclassified

  • If the problem of class overlap is not addressed, models may become overly complex, leading to overfitting issues where the model performs well on training data but poorly on unseen data

Types of Class Overlapping Problems

clap R Package

Detecting Class OverLAPping Regions in Multidimensional Data

install.packages('clap')

devtools::install_github("pridiltal/clap")

clap Framework


  • Normalize the columns of the data. (median and IQR)

  • This prevents variables with large variances having disproportional influence on Euclidean distances.

clap Framework


  • Leader Algorithm (Hartingan, 1975)

  • Calculate the nearest neighbor distances

clap Framework


  • Leader Algorithm (Hartingan, 1975)

  • Calculate the nearest neighbor distances

  • Performs clustering using a radius based on the maximum nearest neighbor distance

clap Framework


cluster_result <-
  clap::perform_clustering(
    data, 
    class_column = class)

clap Framework


cluster_result <-
  clap::perform_clustering(
    data, 
    class_column = class)

composition <- 
  clap::compute_cluster_composition(
  cluster_result)
  • Computes cluster composition

clap Framework


cluster_result <-
  clap::perform_clustering(
    data, 
    class_column = class)

composition <- 
  clap::compute_cluster_composition(
  cluster_result)

ids_vector <- 
  clap::extract_ids_vector(
  composition) 
  • Extract IDs

Case Study 1: Biopsy Data of Breast Cancer Patients

  • clump thickness

  • uniformity of cell size

  • uniformity of cell shape

  • marginal adhesion

  • single epithelial cell size

  • bare nuclei (16 values are missing)

  • bland chromatin

  • normal nucleoli

  • mitoses

Classification Task

  • Benign tumor - generally do not invade and spread

  • Malignant tumor cells - more likely to spread to other areas of the body

Case Study 2: Detecting Class Overlapping Instances of Image databases

  • Aim: detect early stages of pneumonia through chest x-ray images

Case Study 2: Detecting Class Overlapping Instances of Image databases

  • Aim: detect early stages of pneumonia through chest x-ray images

  • Inaccurate analysis of the x-ray results may result in an improper diagnosis and decision making, resulting in a costly mistake that could otherwise save lives

Case Study 2: Detecting Class Overlapping Instances of Image databases

  • Aim: detect early stages of pneumonia through chest x-ray images

  • Inaccurate analysis of the x-ray results may result in an improper diagnosis and decision making, resulting in a costly mistake that could otherwise save lives

  • Linearized chest x-rays images

Images of Class “Normal”

Images of Class “Pneumonia”

What Next?

  • The Leader Algorithm determined cluster size using the maximum nearest neighbor distance. Further analysis is needed to identify the optimal cluster size.

What Next?

  • The Leader Algorithm determined cluster size using the maximum nearest neighbor distance. Further analysis is needed to identify the optimal cluster size.

  • Euclidean distance was used in the Leader Algorithm to determine cluster size. Additional experiments are required to find faster distance calculation methods.

What Next?

  • The Leader Algorithm determined cluster size using the maximum nearest neighbor distance. Further analysis is needed to identify the optimal cluster size.

  • Euclidean distance was used in the Leader Algorithm to determine cluster size. Additional experiments are required to find faster distance calculation methods.

  • Further analysis is needed to identify classification methods that are robust to class overlapping instances.

What Next?

  • The Leader Algorithm determined cluster size using the maximum nearest neighbor distance. Further analysis is needed to identify the optimal cluster size.

  • Euclidean distance was used in the Leader Algorithm to determine cluster size. Additional experiments are required to find faster distance calculation methods.

  • Further analysis is needed to identify classification methods that are robust to class overlapping instances.

  • Further analysis is required to assess the overlapping status of unseen test data in the high-dimensional feature space.

Thank you

This work was supported in part by the RETINA research lab, funded by the OWSD, a program unit of the United Nations Educational, Scientific, and Cultural Organization (UNESCO).

Slides available at: prital.netlify.app