Machine learning models for predicting protein condensate formation from sequence determinants and embeddings PaperPlayer biorxiv biophysics

    • Biowissenschaften

Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2020.10.26.354753v1?rss=1

Authors: Saar, K. L., Morgunov, A. S., Qi, R., Arter, W. E., Krainer, G., Lee, A. A., Knowles, T.

Abstract:
Intracellular phase separation of proteins into biomolecular condensates is increasingly recognised as an important phenomenon for cellular compartmentalisation and regulation of biological function. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed with some of them probed experimentally through the use of constructs generated by sequence alterations. To broaden the scope of these observations, here, we established an in silico strategy for understanding on a global level the associations between protein sequence and condensate formation, and used this information to construct machine learning classifiers for predicting liquid-liquid phase separation (LLPS) from protein sequence. Our analysis highlighted that LLPS-prone sequences are more disordered, hydrophobic and of lower Shannon entropy than sequences in the Protein Data Bank or the Swiss-Prot database, and have their disordered regions enriched in polar, aromatic and charged residues. Using these determining features together with neural network based word2vec sequence embeddings, we developed machine learning classifiers for predicting protein condensate formation. Our model, trained to distinguish LLPS-prone sequences from structured proteins, achieved high accuracy (93%; 25-fold cross-validation) and identified condensate forming sequences from external independent test data at 97% sensitivity. Moreover, in combination with a classifier that had developed a nuanced insight into the features governing protein phase behaviour by learning to distinguish between sequences of varying LLPS propensity, the sensitivity was supplemented with high specificity (approximated ROC-AUC of 0.85). These results provide a platform rooted in molecular principles for understanding protein phase behaviour. The predictor is accessible from https://deephase.ch.cam.ac.uk/ .

Copy rights belong to original authors. Visit the link for more info

Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2020.10.26.354753v1?rss=1

Authors: Saar, K. L., Morgunov, A. S., Qi, R., Arter, W. E., Krainer, G., Lee, A. A., Knowles, T.

Abstract:
Intracellular phase separation of proteins into biomolecular condensates is increasingly recognised as an important phenomenon for cellular compartmentalisation and regulation of biological function. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed with some of them probed experimentally through the use of constructs generated by sequence alterations. To broaden the scope of these observations, here, we established an in silico strategy for understanding on a global level the associations between protein sequence and condensate formation, and used this information to construct machine learning classifiers for predicting liquid-liquid phase separation (LLPS) from protein sequence. Our analysis highlighted that LLPS-prone sequences are more disordered, hydrophobic and of lower Shannon entropy than sequences in the Protein Data Bank or the Swiss-Prot database, and have their disordered regions enriched in polar, aromatic and charged residues. Using these determining features together with neural network based word2vec sequence embeddings, we developed machine learning classifiers for predicting protein condensate formation. Our model, trained to distinguish LLPS-prone sequences from structured proteins, achieved high accuracy (93%; 25-fold cross-validation) and identified condensate forming sequences from external independent test data at 97% sensitivity. Moreover, in combination with a classifier that had developed a nuanced insight into the features governing protein phase behaviour by learning to distinguish between sequences of varying LLPS propensity, the sensitivity was supplemented with high specificity (approximated ROC-AUC of 0.85). These results provide a platform rooted in molecular principles for understanding protein phase behaviour. The predictor is accessible from https://deephase.ch.cam.ac.uk/ .

Copy rights belong to original authors. Visit the link for more info

Top‑Podcasts in Biowissenschaften