# Reading the labels from the json file \n\nf = open(work_dir + 'label_num_to_disease_map.json')\n\nlabels = json.load(f) # Stored as dict by default \n\n\nlabels = {int(k):v for k,v in labels.items()} # Convert keys from strings to ints \n\n\n# Defining the working dataset\n\ndf['class_name'] = df.label.map(labels)
```
%% Cell type:code id: tags:
``` python
# Attaching labels to the class sets and defining them as variables\n\nmask = df['label'] == 0\nCBB = df[mask]\n\nmask = df['label'] == 1\nCBSD = df[mask]\n\nmask = df['label'] == 2\nCGM = df[mask]\n\nmask = df['label'] == 3\nCMD = df[mask]\n\nmask = df['label'] == 4\nHealthy = df[mask]\n
```
%% Cell type:code id: tags:
``` python
# Sampling images from each class\n# The goal of these sample sizes is to help reduce class imbalances \n\n# Since this is a real dataset, we can assume that the class imbalances indicate a real-world correlation \n# with the frequencies of each disease, so I am choosing to maintain a (more muted) class imbalance in the final dataset\n\nCBB = CBB.sample(frac=1)\nCBSD = CBSD.sample(frac=1)\nCGM = CGM.sample(frac=1)\nCMD = CMD.sample(frac=0.9)\nHealthy = Healthy.sample(frac=1)\n