Moving files from dataset1/Basalt and dataset2/Basalt to data/2_processed/Basalt ...
Moving files from dataset1/Coal and dataset2/Coal to data/2_processed/Coal ...
Moving files from dataset1/Granite and dataset2/Granite to data/2_processed/Granite ...
Moving files from dataset1/Limestone and dataset2/Limestone to data/2_processed/Limestone ...
Moving files from dataset1/Marble and dataset2/Marble to data/2_processed/Marble ...
Moving files from dataset1/Quartzite and dataset2/Quartzite to data/2_processed/Quartzite ...
Moving files from dataset1/Sandstone and dataset2/Sandstone to data/2_processed/Sandstone ...
Preprocess Data
Move images from both datasets to respective class labels, remove duplicates, bad and corrupted images.
Steps
This is the description & steps of the
preprocess_data
function which combines the following functions
- Rename and move files to data/2_processed.
- List files other than jpg and png, to remove unsupported files.
- List file by types before cleaning.
- Remove
- Bad Images
- Duplicate Images
- Misclassified Images
- Unsupported Images
- Corrupted Images
- List file by types after cleaning.
- Get count of files by class types.
- Handle Imbalance using Undersampling, Oversampling.
1. Rename and move files to data/2_processed.
move_to_processed
move_to_processed ()
Combine files with same subclass and moves them to the subclass under data/2_processed.
Uses get_new_name
to create new names of files and then rename them and copy to data/2_processed.
2. List files other than jpg and png, to remove unsupported files.
Code
print("\nFiles other than jpg and png.\n")
= find_filepaths("data/2_processed/")
files, _ print(
"\n".join(
list(filter(lambda x: not x.endswith("jpg") and not x.endswith("png"), files))
) )
Files other than jpg and png.
data/2_processed/Coal/dataset1_Coal_025_12.jpeg
data/2_processed/Coal/dataset1_Coal_070_162.jpeg
data/2_processed/Coal/dataset1_Coal_071_163.jpeg
data/2_processed/Coal/dataset1_Coal_072_164.jpeg
data/2_processed/Coal/dataset1_Coal_073_165.jpeg
data/2_processed/Coal/dataset1_Coal_074_166.jpeg
data/2_processed/Coal/dataset1_Coal_075_167.jpeg
data/2_processed/Coal/dataset1_Coal_076_168.jpeg
data/2_processed/Coal/dataset1_Coal_077_169.jpeg
data/2_processed/Coal/dataset1_Coal_079_170.jpeg
data/2_processed/Coal/dataset1_Coal_080_171.jpeg
data/2_processed/Coal/dataset1_Coal_081_172.jpeg
data/2_processed/Coal/dataset1_Coal_082_173.jpeg
data/2_processed/Coal/dataset1_Coal_083_174.jpeg
data/2_processed/Coal/dataset1_Coal_084_175.jpeg
data/2_processed/Coal/dataset1_Coal_085_176.jpeg
data/2_processed/Coal/dataset1_Coal_086_177.jpeg
data/2_processed/Coal/dataset1_Coal_087_178.jpeg
data/2_processed/Coal/dataset1_Coal_088_179.jpeg
data/2_processed/Coal/dataset1_Coal_090_180.jpeg
data/2_processed/Coal/dataset1_Coal_091_181.jpeg
data/2_processed/Granite/dataset1_Granite_017_23.jpeg
data/2_processed/Granite/dataset1_Granite_021_27.jpeg
data/2_processed/Granite/dataset1_Granite_029_34.JPEG
data/2_processed/Granite/dataset1_Granite_031_36.JPEG
data/2_processed/Granite/dataset1_Granite_036_40.JPEG
data/2_processed/Granite/dataset1_Granite_062_64.JPEG
data/2_processed/Granite/dataset1_Granite_072_73.JPEG
data/2_processed/Granite/dataset1_Granite_073_74.JPEG
data/2_processed/Granite/dataset1_Granite_074_75.JPEG
data/2_processed/Granite/dataset1_Granite_075_76.JPEG
data/2_processed/Granite/dataset1_Granite_076_77.JPEG
data/2_processed/Granite/dataset1_Granite_077_78.JPEG
data/2_processed/Granite/dataset1_Granite_078_79.JPEG
data/2_processed/Granite/dataset1_Granite_080_80.JPEG
data/2_processed/Granite/dataset1_Granite_081_81.JPEG
data/2_processed/Granite/dataset1_Granite_082_82.JPEG
data/2_processed/Granite/dataset1_Granite_083_83.JPEG
data/2_processed/Granite/dataset1_Granite_084_84.JPEG
data/2_processed/Granite/dataset1_Granite_085_85.JPEG
data/2_processed/Granite/dataset1_Granite_086_86.JPEG
data/2_processed/Granite/dataset1_Granite_092_91.JPEG
data/2_processed/Granite/dataset1_Granite_099_98.JPEG
data/2_processed/Limestone/dataset1_Limestone_004_100.jpeg
data/2_processed/Limestone/dataset1_Limestone_005_101.webp
data/2_processed/Limestone/dataset1_Limestone_006_102.jfif
data/2_processed/Limestone/dataset1_Limestone_007_103.jfif
data/2_processed/Limestone/dataset1_Limestone_008_104.jfif
data/2_processed/Limestone/dataset1_Limestone_125_21.jpeg
data/2_processed/Limestone/dataset1_Limestone_156_238.jpeg
data/2_processed/Limestone/dataset1_Limestone_203_280.jpeg
data/2_processed/Limestone/dataset1_Limestone_215_291.jfif
data/2_processed/Limestone/dataset1_Limestone_224_3.webp
data/2_processed/Limestone/dataset1_Limestone_271_38.jfif
data/2_processed/Limestone/dataset1_Limestone_306_7.jpeg
data/2_processed/Marble/dataset1_Marble_277_Marmo_z17.jfif
data/2_processed/Marble/dataset1_Marble_282_images.jfif
data/2_processed/Marble/dataset1_Marble_380_mineral-stone-marble-nonfoliated-metamorphic-260nw-349915676.webp
data/2_processed/Marble/dataset1_Marble_387_u3tqorgn31mp65iypkwy_7e2e86ad-6a3f-410e-9584-caa5b075ccd7_825x700.webp
data/2_processed/Quartzite/dataset1_Quartzite_471_quartzite-crystal-mineral-sample-studio-shot-with-black-background-972333846-5c7e6525c9e77c0001d19dda.webp
data/2_processed/Sandstone/dataset1_Sandstone_058_15.webp
data/2_processed/Sandstone/dataset1_Sandstone_248_320.webp
3. List file by types before cleaning.
File types before cleaning:
.jpg 2550
.jpeg 28
.JPEG 20
.png 17
.jfif 7
.webp 7
Name: file_type, dtype: int64
4. Remove
- Bad Images
- Duplicate Images
- Misclassified Images
- Unsupported Images
- Corrupted Images
clean_images
clean_images (cfg)
Remove bad, misclassified, duplicate, corrupted and unsupported images.
Type | Details | |
---|---|---|
cfg | cfg (omegaconf.DictConfig) | Hydra Configuration |
5. List file by types after cleaning.
File types after cleaning:
.jpg 2550
.jpeg 28
.JPEG 20
.png 17
.jfif 7
.webp 7
Name: file_type, dtype: int64
6. Get count of files by class types.
Counts of classes:
Quartzite 517
Coal 469
Limestone 452
Marble 427
Sandstone 370
Granite 214
Basalt 180
Name: class, dtype: int64
7. Handle Imbalance
Using Undersampling, Oversampling and No Sampling.
sampling
sampling (cfg)
Oversamples/Undersample/No Sampling data into train, val, test.
Type | Details | |
---|---|---|
cfg | cfg (omegaconf.DictConfig) | Hydra Configuration |
Putting it all together
process_data
wraps all the above functions.
process_data
process_data (cfg)
Remove unsupported and corrupted images, and splits data into train, val and test.
Steps -> download_and_move_datasets
-> move_to_processed
-> ‘find_filepaths’ -> clean_images
-> sampling
Type | Details | |
---|---|---|
cfg | cfg (omegaconf.DictConfig): | Hydra Configuration |