README.md 1.76 KB
Newer Older
Thulani Hewavithana (qnm481)'s avatar
Thulani Hewavithana (qnm481) committed
1
2
# Splice_site_detection

3
4
5
This project is for the classification of splice sites in DNA sequences using neural networks in python and Keras

Dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Molecular+Biology+(Splice-junction+Gene+Sequences)
6

7
The main goal of this study is to identify sequences that include Splice Junction sites and classify the sites into ’Exon/Intron’ and ’Intron/Exon’ or ’Neither’ classes. Two different approaches were used to achieve the goal. In the first approach, DNA sequences were converted into k-mers of three different lengths. Three types of neural network models were used for training including artificial neural network (ANN), convolutional neural network (CNN) and recurrent neural network (RNN). In the second approach sequences were converted into 60x60 gray-scale images generating 3190 images in total. Scripts used for converting the sequences into images are included (preprocessing.m, Preprocessing.java). Preprocessed files are included in Preprocessed_datafiles folder.
8
9
10

Images generated were splitted into train, validation and test sets in the folder input_dataset_image. 

11
12
General_models folder includes scripts for models trained before hyperparametr tuning.

13
14
15
Models were trained changing important hyper-parameters including number of layers in the model, learning rate, number of epochs, batch size and method of optimization. Hyper-parameter tuning was conducted for k-mer size of 3 and optimal hyper-parameters identified after hyper-parameter tuning were applied for remaining k-mers. Scripts generated for hyperparameter tuning is included in folder Hyperparameter_tuning.

Models trained incorperating optimal hyperparameters after hyperparamter tuning is included in folder Models_trained_after_hyperparamter_tuning.