Whats this rock!
  • Getting Started
  • Tutorial
  • Resources
    • Telegram Rock Classifier Chatbot
    • Keras-CV
    • Tensorflow (tutorial)
    • nbdev (docs)
  • Help
    • Report an Issue

Exploratory Analysis

  • Download
    • Download dataset
    • Download utilities
  • Preprocess
    • Preprocess Data
  • Exploratory Data Analysis
    • Exploratory Analysis
  • Config Management
    • Hydra
  • Training models
    • Training
    • Training utilities
    • Training models
    • Callbacks
  • MLOps
    • Experiment Tracking
    • HyperParameter Tuning
    • Model Management
  • Telegram Bot
    • Telegram bot deployment

On this page

  • Exploring data
    • File types
    • Corrupt file counts
      • Class Counts
    • Image size analysis
    • Sample counts
      • Training set counts
      • Validation set counts
      • Test set counts
  • Data Augmentation
    • Sample Images
      • Samples before Augmentation
      • Samples after RandAugment
      • Samples after cutmix and mixup augmentation
  • EDA using FastAI
    • Image sample size
    • Sample Images
    • Data Augmentation using FastAI
      • MixUp
      • CutMix

Report an issue

Exploratory Analysis

Let’s Explore the data

Exploring data

Show code
df = get_df("data/2_processed")
df['dimensions'] = df['file_path'].apply(lambda x: get_dims(x))
df['image_width'] = df['dimensions'].apply(lambda x: x[0] if x is not None else None)
df['image_height'] = df['dimensions'].apply(lambda x: x[1] if x is not None else None)
df['pixels'] = df['image_width'] * df['image_height']
df['corrupt_status'] = df['file_path'].apply(lambda x: check_corrupted(x))
df.head(5)
file_name class file_path file_type dimensions image_width image_height pixels corrupt_status
0 dataset1_Limestone_147_23.jpg Limestone data/2_processed/Limestone/dataset1_Limestone_... .jpg (285, 380) 285.0 380.0 108300.0 False
1 dataset2_Limestone_418_Limestone521.jpg Limestone data/2_processed/Limestone/dataset2_Limestone_... .jpg (225, 225) 225.0 225.0 50625.0 False
2 dataset1_Limestone_315_78.jpg Limestone data/2_processed/Limestone/dataset1_Limestone_... .jpg (408, 612) 408.0 612.0 249696.0 False
3 dataset1_Limestone_078_168.jpg Limestone data/2_processed/Limestone/dataset1_Limestone_... .jpg (408, 612) 408.0 612.0 249696.0 False
4 dataset1_Limestone_305_69.jpg Limestone data/2_processed/Limestone/dataset1_Limestone_... .jpg (408, 612) 408.0 612.0 249696.0 False

File types

Code
df['file_name'].apply(lambda x: x.split('.')[-1]).value_counts()
jpg     1797
jpeg      23
png        8
JPEG       2
Name: file_name, dtype: int64

Corrupt file counts

Code
df.corrupt_status.value_counts()
False    1753
True       77
Name: corrupt_status, dtype: int64

Corrupted file list

Code
df[df['corrupt_status']==True].head(5)
file_name class file_path file_type dimensions image_width image_height pixels corrupt_status
8 dataset1_Limestone_225_30.jpg Limestone data/2_processed/Limestone/dataset1_Limestone_... .jpg (450, 900) 450.0 900.0 405000.0 True
10 dataset1_Limestone_265_336.jpg Limestone data/2_processed/Limestone/dataset1_Limestone_... .jpg (600, 600) 600.0 600.0 360000.0 True
21 dataset1_Limestone_257_329.jpg Limestone data/2_processed/Limestone/dataset1_Limestone_... .jpg (1182, 1587) 1182.0 1587.0 1875834.0 True
28 dataset1_Limestone_258_33.jpg Limestone data/2_processed/Limestone/dataset1_Limestone_... .jpg (533, 800) 533.0 800.0 426400.0 True
31 dataset1_Limestone_198_276.jpg Limestone data/2_processed/Limestone/dataset1_Limestone_... .jpg (533, 800) 533.0 800.0 426400.0 True

Class Counts

Code
import seaborn as sns
import pandas as pd

class_names = df['class'].value_counts().keys()
counts = df['class'].value_counts().values

count_df = pd.DataFrame(list(zip(class_names, counts)), columns=['class', 'count'])

sns.set_theme(style="darkgrid")
ax = sns.barplot(y='class', x='count', data=count_df)

Count of rock types.

Image size analysis

Code
width_list = df.image_width
height_list = df.image_height
average_width = sum(width_list)/len(width_list)
average_height = sum(height_list)/len(height_list)

# print('average width: {} and height: {}'.format(average_width, average_height))

fig, ax =plt.subplots(1,2, figsize=(15, 8))

sns.histplot(width_list, ax=ax[0])
ax[0].set_title('Image width');
sns.histplot(height_list, ax=ax[1])
ax[1].set_title('Image height');

Code
# plot histograms to show the distribution of width and height values
fig, axs = plt.subplots(1,2, figsize=(15,7))
axs[0].hist(df.image_width.values, bins=20, color = '#91bd3a')
axs[0].set_title('Width distribution')
# axs[0].set_xlim(1000, 3000)

axs[1].hist(df.image_height.values, bins=20, color = '#91bd3a')
axs[1].set_title('Height distribution')
# axs[1].set_xlim(1000, 3000)

plt.suptitle('Image Dimensions')
plt.show()

Width and Height distribution.

Sample counts

Sampling type:- None.

Training set counts

Code
get_df("data/3_tfds_dataset/train")['class'].value_counts()
Coal         289
Quartzite    261
Limestone    228
Sandstone    209
Marble       142
Granite       84
Basalt        66
Name: class, dtype: int64

Validation set counts

Code
get_df("data/3_tfds_dataset/val")['class'].value_counts()
Coal         62
Quartzite    55
Limestone    48
Sandstone    44
Marble       30
Granite      18
Basalt       14
Name: class, dtype: int64

Test set counts

Code
get_df("data/3_tfds_dataset/test")['class'].value_counts()
Coal         63
Quartzite    57
Limestone    50
Sandstone    46
Marble       32
Granite      19
Basalt       15
Name: class, dtype: int64

Data Augmentation

Sample Images

Code
ds = builder.as_dataset(split='train', shuffle_files=True)
# tfds.show_examples(ds, builder.info)

Samples before Augmentation

Code
train_dataset = load_dataset()
visualize_dataset(train_dataset, title="Before Augmentation");

Samples after RandAugment

Code
train_dataset = load_dataset().map(apply_rand_augment, num_parallel_calls=AUTOTUNE)
visualize_dataset(train_dataset, title="After RandAugment")

Samples after cutmix and mixup augmentation

Code
train_dataset = load_dataset().map(cut_mix_and_mix_up, num_parallel_calls=AUTOTUNE)
visualize_dataset(train_dataset, title="After cut_mix and mix_up")

EDA using FastAI

Validation Samples

[Path('data/2_processed/Quartzite/dataset1_Quartzite_134_images(202).jpg'),
 Path('data/2_processed/Coal/dataset2_Coal_401_coal rock204.jpg'),
 Path('data/2_processed/Coal/dataset1_Coal_105_194.jpg')]

Image sample size

Training set samples:- 1373 images.
Validation set samples:- 457 images.

Sample Images

Data Augmentation using FastAI

MixUp

epoch train_loss valid_loss time
0 00:00

CutMix

epoch train_loss valid_loss time
0 00:00