November 22, 2019 feature
A deep learning-based model DeepSpCas9 to predict SpCas9 activity

Thamarasee Jeewandara
contributing writer

In a new report on Science Advances, Hui Kwon Kim and interdisciplinary researchers at the departments of Pharmacology, Electrical and Computer Engineering, Medical Sciences, Nanomedicine and Bioinformatics in the Republic of Korea, evaluated the activities of ; a bacterial RNA-guided Cas9 variant (a bacterial enzyme that cuts DNA for genome editing) from . They used a high-throughput approach with 12,832 target sequences based on a human cell library to build a deep learning model and predict the activity of SpCas9.
The data contained (nucleotides or building blocks) containing target sequence pairs and a corresponding guide sequence to encode (sgRNA), which can direct the Cas9 protein to bind and cleave a specific DNA sequence for genome editing. They implemented deep learning-based training on the large dataset of SpCas9-induced to develop an SpCas9 activity predicting model named DeepSpCas9 now . When the team tested the software against independently generated datasets, the results showed high , i.e. the model could properly adapt to new, previously unseen data.
The functions as a genome editing tool with potential in a variety of species and cell types including human cells, where the capacity to accurately predict SpCas9 enzyme activity is important. Researchers had previously developed to predict SpCas9 activity based on datasets of of gene-edited cells or based on medium-sized datasets of (vehicles that transfer genes between bacteria and other cells) . However, the generalization performance of these models were limited, since the quality and size of the datasets were not ideal. For instance, model-predicted gene insertions and deletions (indels) to create functional (a method to inactivate genes in an experimental animal model in lab) resulted . Additionally, these SpCas9-induced indel frequency datasets were also only .

Kim et al. had previously reported on a deep learning-based computational model named to predict the activity of a different endonuclease (AsCpf1 from species) with high generalization performance. For this, they used of guide-RNA-encoding, target sequence pairs to generate a large training dataset known as DeepCpf1. While similar library-based methods were used to develop generated by the Cas9 enzyme, a large dataset of Cas9-induced frequencies remains to be formed.
Scientists must therefore develop Cas9 activity-predicting computational models with high generalization performance. In this work, Kim et al. generated a high-throughput model to test SpCas9-induced indel frequencies at tens of thousands of target sequences by modifying their method to form DeepSpCas9. The DeepSpCas9 web tool is a deep learning-based model that can accurately predict the activities of SpCas9 with high generalization performance.

Kim et al. first prepared a (a complex retrovirus subfamily that can incorporate foreign DNA) library of 15,656 guide RNA (gRNA)-encoding and target sequence pairs, for high-throughput assessment of SpCas9 activities. The research team amplified the pool of oligonucleotides containing pairs of guide and target sequences using the (PCR) and cloned them into a (transgene delivery system to transfer genetic material between cells) using the technique.
In a two-step approach, the researchers cut and inserted the sgRNA scaffold sequence at the cut site to generate plasmid libraries. To subsequently form a cell library, the scientists treated (HEK 293T) with lentivirus generated from the plasmid library. Each cell now contained a synthetic target sequence in its genome and expressed the corresponding sgRNA. The scientists then treated the cell library with the SpCas9-encoding lentivirus to cause sgRNA-directed cleavage and indel formation at the target sequences with frequencies that depended on the sgRNA activity. To measure the indel frequencies, the scientists PCR-amplified the target sequences and subjected them to . Based on the high throughput experiments, Kim et al. generated two datasets for training and testing purposes of the DeepSpCas9 model.
The scientists selected SpCas9 activities at 124 endogenous target sites with different properties of (effect of chromatin structure modifications on gene transcription) to test if the indel frequencies at the integrated synthetic target sequence correlated with those at the corresponding endogenous site. They observed a strong correlation between indel frequencies at the ingrained target sites and at the endogenous locations within the HEK cells.

The research team next developed an accurate computational model to predict SpCas9 activity on a large dataset using an end-to-end deep learning framework to form DeepSpCas9 and predict the SpCas9 activity. For the base model architecture, they used a (CNN, similar to ordinary neural networks) and for the input sequence they used a 30-nucleotide sequence, which they converted into a four-dimensional binary matrix using (splitting columns containing numerical categorical data to many columns). To understand the generalization performance of model selection and training, the team conducted 10-fold cross-validation using coefficients between experimental measurements and predicted Cas9 activity levels.
When they increased the size of the training dataset for cross-validation, the average Spearman correlation coefficients between the experimental indel frequencies and predicted scores from the DeepSpCas9 model steadily increased up to 0.77. Compared to conventional machine learning algorithms such as support vector machine (SVM), AdaBoost (adaptive boosting), random forest and gradient-boosted regression trees, , Spearman correlations of the DeepSpCas9 model were significantly higher. In total, DeepSpCas9 exhibited the best performance among all models.

In previous work, Kim et al. considered chromatin accessibility information to improve the prediction of AsCpf1 enzyme activities at endogenous target sites. They sought to determine if such considerations would also improve SpCas9 activity predictions. The results implied that fine-tuning with chromatin accessibility information barely improved the accuracy of DeepSpCas9 to predict indel frequencies at endogenous sites compared to their previous efforts with AsCpf1. The SpCas9 activity was only therefore slightly affected by chromatin accessibility in strong contrast to the previously developed DeepCpf1 algorithm.
To understand the generalization performance of DeepSpCas9, the research team tested the model using sufficiently large, derived from as test data. They compared the results with those of programs such as DeepCRISPR. The results suggested DeepSpCas9 to maintain the highest generalization function among nine published models used to predict SpCas9 activity. In this way, Hui Kwon Kim and research team extensively validated the potential to accurately predict SpCas9 activity using the DeepSpCas9 web tool, , alongside provided for research scientists to incorporate DeepSpCas9 into existing models. Based on the high generalization performance of DeepSpCas9, the research team expect to facilitate higher accuracy for SpCas9-based genome editing.
Written for you by our author , edited by —this article is the result of careful human work. We rely on readers like you to keep independent science journalism alive. If this reporting matters to you, please consider a (especially monthly). You'll get an ad-free account as a thank-you.
More information: Hui Kwon Kim et al. SpCas9 activity prediction by DeepSpCas9, a deep learning–based model with high generalization performance, Science Advances (2019).
Hui Kwon Kim et al. Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity, Nature Biotechnology (2018).
Hui K Kim et al. In vivo high-throughput profiling of CRISPR–Cpf1 activity, Nature Methods (2016).
Journal information: Science Advances , Nature Biotechnology , Nature Methods
© 2019 Science X Network