Whats this rock!
  • Getting Started
  • Tutorial
  • Resources
    • Telegram Rock Classifier Chatbot
    • Keras-CV
    • Tensorflow (tutorial)
    • nbdev (docs)
  • Help
    • Report an Issue

Preprocess Data

  • Download
    • Download dataset
    • Download utilities
  • Preprocess
    • Preprocess Data
  • Exploratory Data Analysis
    • Exploratory Analysis
  • Config Management
    • Hydra
  • Training models
    • Training
    • Training utilities
    • Training models
    • Callbacks
  • MLOps
    • Experiment Tracking
    • HyperParameter Tuning
    • Model Management
  • Telegram Bot
    • Telegram bot deployment

On this page

  • Steps
    • 1. Rename and move files to data/2_processed.
      • move_to_processed
    • 2. List files other than jpg and png, to remove unsupported files.
    • 3. List file by types before cleaning.
    • 4. Remove
      • clean_images
    • 5. List file by types after cleaning.
    • 6. Get count of files by class types.
    • 7. Handle Imbalance
      • sampling
    • Putting it all together
      • process_data

Report an issue

Preprocess Data

Move images from both datasets to respective class labels, remove duplicates, bad and corrupted images.

Open In Colab

Steps

This is the description & steps of the preprocess_data function which combines the following functions

  1. Rename and move files to data/2_processed.
  2. List files other than jpg and png, to remove unsupported files.
  3. List file by types before cleaning.
  4. Remove
    • Bad Images
    • Duplicate Images
    • Misclassified Images
    • Unsupported Images
    • Corrupted Images
  5. List file by types after cleaning.
  6. Get count of files by class types.
  7. Handle Imbalance using Undersampling, Oversampling.

1. Rename and move files to data/2_processed.


source

move_to_processed

 move_to_processed ()

Combine files with same subclass and moves them to the subclass under data/2_processed.

Uses get_new_name to create new names of files and then rename them and copy to data/2_processed.

Moving files from dataset1/Basalt and dataset2/Basalt to data/2_processed/Basalt ...
Moving files from dataset1/Coal and dataset2/Coal to data/2_processed/Coal ...
Moving files from dataset1/Granite and dataset2/Granite to data/2_processed/Granite ...
Moving files from dataset1/Limestone and dataset2/Limestone to data/2_processed/Limestone ...
Moving files from dataset1/Marble and dataset2/Marble to data/2_processed/Marble ...
Moving files from dataset1/Quartzite and dataset2/Quartzite to data/2_processed/Quartzite ...
Moving files from dataset1/Sandstone and dataset2/Sandstone to data/2_processed/Sandstone ...

2. List files other than jpg and png, to remove unsupported files.

Code
print("\nFiles other than jpg and png.\n")
files, _ = find_filepaths("data/2_processed/")
print(
    "\n".join(
        list(filter(lambda x: not x.endswith("jpg") and not x.endswith("png"), files))
    )
)

Files other than jpg and png.

data/2_processed/Coal/dataset1_Coal_025_12.jpeg
data/2_processed/Coal/dataset1_Coal_070_162.jpeg
data/2_processed/Coal/dataset1_Coal_071_163.jpeg
data/2_processed/Coal/dataset1_Coal_072_164.jpeg
data/2_processed/Coal/dataset1_Coal_073_165.jpeg
data/2_processed/Coal/dataset1_Coal_074_166.jpeg
data/2_processed/Coal/dataset1_Coal_075_167.jpeg
data/2_processed/Coal/dataset1_Coal_076_168.jpeg
data/2_processed/Coal/dataset1_Coal_077_169.jpeg
data/2_processed/Coal/dataset1_Coal_079_170.jpeg
data/2_processed/Coal/dataset1_Coal_080_171.jpeg
data/2_processed/Coal/dataset1_Coal_081_172.jpeg
data/2_processed/Coal/dataset1_Coal_082_173.jpeg
data/2_processed/Coal/dataset1_Coal_083_174.jpeg
data/2_processed/Coal/dataset1_Coal_084_175.jpeg
data/2_processed/Coal/dataset1_Coal_085_176.jpeg
data/2_processed/Coal/dataset1_Coal_086_177.jpeg
data/2_processed/Coal/dataset1_Coal_087_178.jpeg
data/2_processed/Coal/dataset1_Coal_088_179.jpeg
data/2_processed/Coal/dataset1_Coal_090_180.jpeg
data/2_processed/Coal/dataset1_Coal_091_181.jpeg
data/2_processed/Granite/dataset1_Granite_017_23.jpeg
data/2_processed/Granite/dataset1_Granite_021_27.jpeg
data/2_processed/Granite/dataset1_Granite_029_34.JPEG
data/2_processed/Granite/dataset1_Granite_031_36.JPEG
data/2_processed/Granite/dataset1_Granite_036_40.JPEG
data/2_processed/Granite/dataset1_Granite_062_64.JPEG
data/2_processed/Granite/dataset1_Granite_072_73.JPEG
data/2_processed/Granite/dataset1_Granite_073_74.JPEG
data/2_processed/Granite/dataset1_Granite_074_75.JPEG
data/2_processed/Granite/dataset1_Granite_075_76.JPEG
data/2_processed/Granite/dataset1_Granite_076_77.JPEG
data/2_processed/Granite/dataset1_Granite_077_78.JPEG
data/2_processed/Granite/dataset1_Granite_078_79.JPEG
data/2_processed/Granite/dataset1_Granite_080_80.JPEG
data/2_processed/Granite/dataset1_Granite_081_81.JPEG
data/2_processed/Granite/dataset1_Granite_082_82.JPEG
data/2_processed/Granite/dataset1_Granite_083_83.JPEG
data/2_processed/Granite/dataset1_Granite_084_84.JPEG
data/2_processed/Granite/dataset1_Granite_085_85.JPEG
data/2_processed/Granite/dataset1_Granite_086_86.JPEG
data/2_processed/Granite/dataset1_Granite_092_91.JPEG
data/2_processed/Granite/dataset1_Granite_099_98.JPEG
data/2_processed/Limestone/dataset1_Limestone_004_100.jpeg
data/2_processed/Limestone/dataset1_Limestone_005_101.webp
data/2_processed/Limestone/dataset1_Limestone_006_102.jfif
data/2_processed/Limestone/dataset1_Limestone_007_103.jfif
data/2_processed/Limestone/dataset1_Limestone_008_104.jfif
data/2_processed/Limestone/dataset1_Limestone_125_21.jpeg
data/2_processed/Limestone/dataset1_Limestone_156_238.jpeg
data/2_processed/Limestone/dataset1_Limestone_203_280.jpeg
data/2_processed/Limestone/dataset1_Limestone_215_291.jfif
data/2_processed/Limestone/dataset1_Limestone_224_3.webp
data/2_processed/Limestone/dataset1_Limestone_271_38.jfif
data/2_processed/Limestone/dataset1_Limestone_306_7.jpeg
data/2_processed/Marble/dataset1_Marble_277_Marmo_z17.jfif
data/2_processed/Marble/dataset1_Marble_282_images.jfif
data/2_processed/Marble/dataset1_Marble_380_mineral-stone-marble-nonfoliated-metamorphic-260nw-349915676.webp
data/2_processed/Marble/dataset1_Marble_387_u3tqorgn31mp65iypkwy_7e2e86ad-6a3f-410e-9584-caa5b075ccd7_825x700.webp
data/2_processed/Quartzite/dataset1_Quartzite_471_quartzite-crystal-mineral-sample-studio-shot-with-black-background-972333846-5c7e6525c9e77c0001d19dda.webp
data/2_processed/Sandstone/dataset1_Sandstone_058_15.webp
data/2_processed/Sandstone/dataset1_Sandstone_248_320.webp

3. List file by types before cleaning.


File types before cleaning:
.jpg     2550
.jpeg      28
.JPEG      20
.png       17
.jfif       7
.webp       7
Name: file_type, dtype: int64

4. Remove

- Bad Images
- Duplicate Images
- Misclassified Images
- Unsupported Images
- Corrupted Images

source

clean_images

 clean_images (cfg)

Remove bad, misclassified, duplicate, corrupted and unsupported images.

Type Details
cfg cfg (omegaconf.DictConfig) Hydra Configuration

5. List file by types after cleaning.


File types after cleaning:
.jpg     2550
.jpeg      28
.JPEG      20
.png       17
.jfif       7
.webp       7
Name: file_type, dtype: int64

6. Get count of files by class types.


Counts of classes:

Quartzite    517
Coal         469
Limestone    452
Marble       427
Sandstone    370
Granite      214
Basalt       180
Name: class, dtype: int64

7. Handle Imbalance

Using Undersampling, Oversampling and No Sampling.


source

sampling

 sampling (cfg)

Oversamples/Undersample/No Sampling data into train, val, test.

Type Details
cfg cfg (omegaconf.DictConfig) Hydra Configuration

Putting it all together

process_data wraps all the above functions.


source

process_data

 process_data (cfg)

Remove unsupported and corrupted images, and splits data into train, val and test.

Steps -> download_and_move_datasets -> move_to_processed -> ‘find_filepaths’ -> clean_images -> sampling

Type Details
cfg cfg (omegaconf.DictConfig): Hydra Configuration