Machine learning-based prediction of DNA G-quadruplex folding topology with G4ShapePredictor

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Scientific Reports volume 14, Article number: 24238 (2024 ) Cite this article Ultrasonic Fabric Cutter

Deoxyribonucleic acid (DNA) is able to form non-canonical four-stranded helical structures with diverse folding patterns known as G-quadruplexes (G4s). G4 topologies are classified based on their relative strand orientation following the 5’ to 3’ phosphate backbone polarity. Broadly, G4 topologies are either parallel (4+0), antiparallel (2+2), or hybrid (3+1). G4s play crucial roles in biological processes such as DNA repair, DNA replication, transcription and have thus emerged as biological targets in drug design. While computational models have been developed to predict G4 formation, there is currently no existing model capable of predicting G4 folding topology based on its nucleic acid sequence. Therefore, we introduce G4ShapePredictor (G4SP), an application featuring a collection of multi-classification machine learning models that are trained on a custom G4 dataset combining entries from existing literature and in-house circular dichroism experiments. G4ShapePredictor is designed to accurately predict G4 folding topologies in potassium (\(\hbox {K}^+\) ) buffer based on its primary sequence and is able to incorporate a threshold optimization strategy allowing users to maximise precision. Furthermore, we have identified three topological sequence motifs that suggest specific G4 folding topologies of (4+0), (2+2) or (3+1) when utilising the decision-making mechanisms of G4ShapePredictor.

G-quadruplexes (G4s) are non-canonical four-stranded helical structures formed by guanine-rich nucleic acid sequences. A G4 structure consists of multiple square planar arrangements of four guanines stacked on top of each other. A single square planar is bound together by eight Hoogsteen hydrogen bonds, known collectively as a guanine-tetrad (G-tetrad)1,2. Multiple biophysical studies have demonstrated that G4s exhibit a diverse range of structures. G4s are conventionally classified based on their phosphate backbone orientation. There are three main classifications3,4, namely parallel (4+0), antiparallel (2+2) and hybrid (3+1). A G4 is known as parallel (4+0), when all four strands are oriented in the same direction, i.e., \((4\uparrow , 0\downarrow )\) ; antiparallel (2+2), when two strands are oriented in the opposite direction relative to the other two strands, i.e., \((2\uparrow , 2\downarrow )\) ; hybrid (3+1) when three strands are oriented in one direction and the remaining strand oriented in the other direction, i.e., either \((3\uparrow , 1\downarrow )\) or \((1\uparrow , 3\downarrow )\) . G4s are further stabilised by a cation located at its central core. The selection of cation is not restrictive—multiple studies have shown that G4s form in the presence of different cations, such as \(\hbox {K}^+\) 5,6,7, \(\hbox {Na}^+\) 8,9,10, \(\hbox {Li}^+\) 10 and many others11. Additionally, the same G4 may adopt different topologies in different ionic environments12,13.

G4s play key roles in various biological processes14,15, including gene regulation16,17, replication18,19, and telomere maintenance20. As G4s structures have been extensively studied in-vitro, researchers have proposed G4s as a novel pharmaceutical target for small molecules or ligands21,22,23,24, and upon binding has shown great potential to inhibit enzymes25, stabilise G4 structures26,27, modify genetic material28,29 and allow for controlled toxicity activation30. Thus, understanding the topology of G4s is essential as different topologies have different affinity to small molecule drugs15,31 and may have distinct biological functions. Past studies of G4 topologies have shown that the ability of a G4 to inhibit DNA replication depends on its topology32 and that binding force of Hemin to G4 is topology-dependent33.

While experimental methods provide insights into G4 structures, they can be time-consuming and resource-intensive. Computational methods offer a faster and more scalable approach to predict and analyze G4s. Several algorithms and machine learning models have been introduced to detect potential G4-forming DNA and RNA sequences34,35. Models include simple expression matching algorithms (e.g. G\(_x\) N\(_y\) G\(_x\) N\(_y\) G\(_x\) N\(_y\) G\(_x\) where N \(=\{\text {A}, \text {T}, \text {C}, \text {G}\}\) , \(x\ge 3\) and \(1\le y\le 7\) )36,37,38, rule-based scoring approaches39,40, traditional machine learning41,42,43,44 and deep learning41,45,46,47,48.

However, there is currently no model to identify the topological classes (i.e., parallel (4+0), antiparallel (2+2), and hybrid (3+1)) formed by putative quadruplex sequences (PQS). We introduce G4ShapePredictor (G4SP), an application that would allow users to input PQS and predict their potential topological class. G4SP specialises in predicting formation of unimolecular G4 in the presence of \(\hbox {K}^+\) ions, with the PQS composed of adenine (A), thymine (T), cytosine (C) and guanine (G). Predicting the potential topology of PQS offers researchers a more specific understanding of the structure characteristics of these sequences.

A total of 1005 sequence-topology pairs were compiled (Fig. 1E), of which 70 (40 parallel, 10 antiparallel, 20 hybrid) were from PDB, 344 (160 parallel, 102 antiparallel, 82 hybrid) were from in-house CD experiments and 591 (287 parallel, 104 antiparallel, 200 hybrid) were compiled from literature. The sequence length distribution (Fig. 1G) of the compiled dataset is as follows: a mean of 24.2 nucleotides (nt), median of 22 nt, mode of 22 nt, standard deviation of 6.3 nt and interquartile range of 3 nt. G4 folding topologies were recorded in the presence of a potassium buffer with pH 6.8 to 7.4. 344 sequence-topology pairs out of the dataset of 1005 were procured from in-house CD experiments and are previously unreported findings. The data collection workflow can be found in Fig. 1. More information about the sequence-topology pairs and their corresponding sources can be found in Supplementary Table 1.

Six models were trained for multi-class classification to distinguish between parallel (4+0), antiparallel (2+2) and hybrid (3+1) topologies on the G-quadruplex dataset. The six models are XGBoost classifier (XGB), CatBoost Classifier (CBC), Light Gradient Boosting Machine (LGBM), RandomForest classifier (RF), ExtraTrees classifier (EXT), and Gradient-Boosted Classifier (GBC). In the multi-class classification context, accuracy measures the proportion of correct predictions across all classes, distinguishing it from binary classification, where only the positive and the negative classes are considered. In this paper, \(\pm \, x\) refers to the standard deviation of the mean as accuracy calculations are averaged over 100 iterations. The average multi-class classification accuracy on the test set was 0.75 ± 0.02 (Fig. 2A). Notably, the EXT model achieved the highest accuracy of 0.76 ± 0.01.

However, it is essential to note that multi-class accuracy can sometimes be misleading49,50,51, as it may mask performance disparities across individual classes52,53,54,55. To provide a clearer insight into class-specific performance, we adopted the one-vs-rest (OVR) strategy, where each class was fitted with a binary classifier against all other classes (Fig. 2B). Importantly, accuracy for the antiparallel class is generally higher compared to other classes across all models. The CBC model boasts the highest accuracy for the antiparallel class at 0.893 ± 0.003, closely followed by the XGB model, with accuracy of 0.892 ± 0.003 for the antiparallel class. The area under the receiver operating characteristics (AUROC) is 0.927 ± 0.004, 0.95 ± 0.03, 0.86 ± 0.02 with area under precision-recall (AUPR) of 0.92 ± 0.01, 0.89 ± 0.02, 0.77 ± 0.05 when averaged across all six models for parallel (4+0), antiparallel (2+2) and hybrid (3+1) classes respectively. The individual AUROC and AUPR values for each model can be found in Fig. 2C-H.

To further enhance the predictive ability of our models, we implemented a threshold optimization approach56,57,58,59. We begin by plotting precision and recall as a function of threshold (Fig. 3E1-E6). The precision-recall-threshold plot facilitated the identification of optimal thresholds to achieve desired precision values. Figure 3A-B depicts the multi-class validation accuracy for thresholds that yield a precision of 0.9 and 0.8 respectively. The average accuracy is \(0.90\pm 0.01\) , \(0.80\pm 0.01\) for optimal thresholds yielding 0.9, 0.8 precision values respectively. Through this custom threshold optimization approach, there are noticeable improvements across all models. For a more detailed analysis, we adopted a one-vs-rest strategy, where each class was fitted with a binary classifier (Fig. 3C-D). This strategy was applied while utilizing the identified thresholds that yield precision values of 0.8 and 0.9. Across all six models, the average validation accuracies at thresholds optimized to achieve 0.9 and 0.8 precision are \(0.90 \pm 0.01\) , \(0.92 \pm 0.01\) , \(0.90 \pm 0.01\) and \(0.80 \pm 0.01\) , \(0.81 \pm 0.01\) , \(0.83 \pm 0.01\) respectively for the parallel (4+0), antiparallel (2+2), and hybrid (3+1) classes.

We wish to understand how specific nucleotides at a given sequence position influence the classification decisions of our models. To this end, we used feature importance scores to determine the relative importance of each feature (i.e., nucleotide n at sequence position x) in a given topology classifier. We implemented a feature importance permutation strategy60,61,62 using binary classifiers (Fig. 4A-C), where nucleotide(s) n centered on a position x is permuted while other positional features are kept unchanged. We used each model to determine the accuracy of the new permuted sequence and compared it with the original accuracies as shown in Fig. 2B. The largest decrease in accuracy would indicate a feature or features having the greatest importance. We implemented a minimum separation distance r to identify the most prominent importance peaks. A larger r would allow us to analyse longer DNA motifs centered on the more prominent feature(s). At \(r=10\) , we have 2 peaks at position \(x_{r=10}^{(4+0)} = (40\) -44, 59) for the parallel classifier, 2 peaks at the positions \(x_{r=10}^{(2+2)}=(40\) -44, 59) for the antiparallel classifier and 2 peaks at the position \(x_{r=10}^{(3+1)} = (41, 62)\) . A range over multiple positions were used when the peaks are ambiguous. Peaks with lower minimum separation distance r such as \(x_{r=3}\) and \(x_{r=5}\) for (4+0), (2+2), (3+1) classifiers can be found in Figure S4.1−S4.6.

Past studies have shown that PQS commonly exhibits the sequence pattern G\(_{3+}\) N\(_{1-7}\) G\(_{3+}\) N\(_{1-7}\) G\(_{3+}\) N\(_{1-7}\) G\(_{3+}\) 36. However, the motif pertains to whether a single-stranded DNA will form a G4. We will use the models of G4SP to propose G4 topological motifs that will predict folding of a specific G4 topology. Under this proposal, we adopt the IUPAC nucleotide code where R \(=\) \(\{\) A, G\(\}\) , W \(=\) \(\{\) A, T\(\}\) , Y \(=\) \(\{\) C, T\(\}\) , V \(=\) \(\{\) A, C, G\(\}\) , M \(=\) \(\{\) A, C\(\}\) , N \(=\) \(\{\) A, T, C, G\(\}\) , and limit our prediction to unimolecular G4 formed in 100 mM KCl or K\(_2\) HPO\(_4\) buffer. We permute sequences of different lengths on positions (see Heatmap visualisation of G4 topological motifs under Methods) that have the highest feature importance and plot a heatmap (Fig. 4D-F) for better visualisation. Based on the results visualised in the heatmap, we propose the following PQS motifs: RGNG\(_3\) N\(_{1-9}\) G\(_3\) NR is a parallel (4+0) topological motif, WG\(_2\) YNWN\(_{3-10}\) TWNG is an anti-parallel topological motif and VG\(_2\) WYN\(_{2-11}\) MWG\(_3\) is a hybrid G4 motif. Performing a regex pattern search on the G4 dataset used in this study, we found that the proposed parallel motif is present in 44% of all parallel sequences, 3% of antiparallel sequences and 4% of hybrid sequences. The proposed antiparallel motif is present in 2% of all parallel sequences, 30% of antiparallel sequences and 0.3% of all hybrid sequences. The proposed hybrid motif is present in 5% of all parallel sequences, 4% of all antiparallel sequences and 19% of all hybrid sequences. Identification of the proposed motif in a PQS suggests that the PQS is likely to fold into a G4 with the corresponding topology. However, the absence of a specific topological motif does not imply that the PQS is restricted from folding into that specific topology.

G4SP (Fig. 5) is a GUI application developed to facilitate the prediction of G4 structural topologies based on DNA input sequences. The application features a user-friendly interface where users either input DNA sequences directly or upload them in a .fasta file format. The backend integrates several machine learning models described in the section on Model architecture under Methods. After choosing a model, the user may adjust the precision parameter, which in turn updates the recall values displayed in real-time. Upon submission, the application predicts the potential topology that may be adopted by the G4 structure based on the input sequence, selected model and parameters. The results are displayed in a dedicated output section. Additionally, potential warnings may be displayed if sequences submitted have low G4 propensity (G4Hunter score less than 1.2) or have lengths less than 12 nt. Users have the option to export the results as a CSV file for further analysis.

G4SP is trained only on topologies of G4s formed in \(\hbox {K}^+\) buffer. We chose to focus on topologies of G4s formed in \(\hbox {K}^+\) solution as there are more data in the public domain on G4 formation in \(\hbox {K}^+\) solution compared to \(\hbox {Na}^+\) solution. This is likely due to the fact that \(\hbox {K}^+\) has more physiological relevance than \(\hbox {Na}^+\) due to the abundance of \(\hbox {K}^+\) over \(\hbox {Na}^+\) in cellular environments2. However, we found that topological data of G4 formation in \(\hbox {K}^+\) is not sufficient for machine training. Training data set in machine learning is typically large, ranging from \(10^2\) for simpler models63 to \(10^6\) for complex neural network models46,64. A popular rule-of-thumb is that the training data should be at least 10 times the number of features used in the model65,66. As G4SP is trained on features of length 100, in-house CD experiments were conducted to bolster the total sample size from 661 to 1005. The experimental data augmentation also focused on PQS thought to form antiparallel (2+2) topology to mitigate the negative effects of disproportionate class representation. A class-imbalanced training dataset has been shown to result in predictions skewed towards the majority class52,53,54,55, poor generalisation of the minority class67 and misleading performance metrics49,50,51. The adverse effects of training on a class-imbalance dataset may be mitigated by increasing the quantity of the dataset64,68. Popular machine learning techniques to increase data quantity include artificial data augmentation, synthetic data generation and transfer learning.

In machine learning, thresholding optimization refers to the process of setting a specific threshold value to classify the outcomes of a predictive model56,57,58,59, rather than using a default threshold. Usually, the default threshold is 0.5 for binary classification problems. If the predicted probability of the positive class is greater than 0.5, the instance is classified as positive, otherwise it is classified as negative. Custom thresholding optimization is useful in scenarios where the classes are imbalanced, have different levels of importance or costs associated with misclassification. As G4SP aims to solve a multi-class classification problem, threshold optimization is applied to fine-tune the model’s sensitivity to each individual class.

G4SP encompasses six different models of varying architecture trained on the same G4 dataset. Based on multi-class accuracy, AUROC and AUPR, EXT is the best performing model across all three metrics, with an accuracy of 0.76 ± 0.01, average AUROC of 0.93 ± 0.02 and average AUPR of 0.88 ± 0.04. However, the optimal model depends on the specific task. For example, XGB is the best model to identify antiparallel G4s despite having a lower accuracy score compared to EXT as the XGB model exhibits the highest AUROC/AUPR score for the antiparallel class. The threshold optimization technique utilised by G4SP adds further ambiguity to the identification of the best performing model. Nonetheless, we suggest that users consider the output of all six models, and if necessary, assign weights to the outputs based on the specific task at hand.

We sought to decipher the underlying decision-making mechanisms of G4SP. However, decision-making mechanisms of machine learning models are generally complex and challenging to distill into simple rules. In computational biology, a strategy to showcase the model mechanisms in a simple manner is to extract sequence motifs used by the model for G4 classification41,46,48. We identified the subsequences aligned on important positional features, extracted subsequences with the highest accuracy increase and labelled them as G4 topological motifs. In the process, we assume that subsequence permutation would not change the folding topology. As we increase the length of subsequence permuted, we increase the risk of an unintended change in folding topology. In other words, greater emphasis should be placed on motifs with shorter lengths. In order to mitigate the limitations of subsequence permutation, we performed a regex pattern search on the G4 dataset and found that our proposed topological motifs are present in a significant proportion of our G4 dataset. The results demonstrate the validity of our proposed G4 topological motifs.

While G4SP is a powerful tool for predicting G4 topologies, it has certain limitations. (1) Training data bias: G4SP is trained on a dataset with a mean and median length of 24.2 nt and 22 nt, respectively. Predictions might be more accurate for sequences around these lengths, and G4SP may not perform as well on sequences with lengths that significantly deviate from 20-24 nt. Additionally, G4SP is trained on in-house CD experiments, with sequences primarily taken from the human genome. This could introduce a bias towards human genomic sequences and may not be as effective for predicting G4 topologies in non-human genomes. (2) Simple categorisation of G4 topologies: G4SP is trained on G4 topologies that are broadly classified as parallel (4+0), antiparallel (2+2) and hybrid (3+1). However, G4 structural information includes loops, bulges, grooves, discontinuities, among others, and thus is much more varied and complicated70. The output of G4SP is strictly based on the three categorises of G4 topologies defined by strand orientations. Our topology classification rationale is further explored in Section 2 of Supplementary Information. (3) Class imbalance: The dataset used for training G4SP exhibits a notable class imbalance, with 487 instances for parallel (4+0) class, 216 for antiparallel (2+2) class, and 302 for hybrid (3+1) class. Such an imbalance may skew the model’s predictive capabilities as there is an inherent bias towards the overrepresented class, potentially compromising the accuracy of predictions for the underrepresented classes (antiparallel (2+2) and hybrid (3+1)). This imbalance might also hinder the model’s ability to capture the intricate variability of the minority classes, elevating the risk of misclassifications. Furthermore, there’s an increased susceptibility to overfitting the majority class, which could diminish the model’s generalizability on unseen data, especially from the minority classes. (4) Assumption of input data: G4SP preprocesses input sequences using G4Hunter to identify potential G-quadruplex sequences (PQS). However, as G4Hunter may produce false positives/negatives, G4SP’s assumption that all filtered sequences are PQS may introduce potential inaccuracies in its predictions. (5) Buffer specificity: G4SP is specifically trained on G4 formation in potassium (\(\hbox {K}^+\) ) buffer conditions. Therefore, prediction of G4 topologies formed in buffers other than \(\hbox {K}^+\) might be inaccurate.

In this study, we introduced G4SP, a computational tool designed to predict the topological classes of PQS. Leveraging a comprehensive dataset of 1005 sequence-topology pairs, G4SP offers insight into the structural topologies of G4s, bridging a gap in the current landscape of G-quadruplex research.

Our approach emphasized the importance of high-quality data, as evidenced by our decision to conduct in-house CD experiments to augment our dataset instead of using synthetic data augmentation techniques. This decision was rooted in the understanding that while larger datasets can enhance model performance, the quality and specificity of the data are paramount.

The performance metrics of our models, particularly when enhanced with thresholding optimization, underscore the efficacy of G4SP. Notably, the application’s ability to achieve high accuracy rates across different topological classes, even with the inherent challenges of class imbalance, speaks to its robustness and potential utility in real-world applications.

Using G4SP, we proposed G4 topological motifs that when present in a PQS, allow one to predict PQS folding into parallel, antiparallel or hybrid topology in the presence of \(\hbox {K}^+\) ions. Despite the limitations of the feature importance permutation strategy, we have confirmed that our proposed topological motifs are present in a significant proportion of our G4 dataset.

Like all models, G4SP has its limitations. Its performance may vary based on sequence lengths that deviate significantly from the mean and median lengths in our dataset. Furthermore, its training on primarily human genomic sequences may introduce biases when predicting G4 topologies in non-human genomes. Additionally, the method of threshold optimization, while enhancing precision, will lead to a decrease in recall. The trade-off will result in the model output of “mixed topology” when faced with ambiguous or borderline input PQS. Despite these challenges, the tool has potential to advance our understanding of complex biological structures.

G4SP represents a significant stride forward in the realm of G-quadruplex research. By offering researchers a rapid, efficient, and accurate means to predict G4 topologies, it paves the way for more targeted and informed experimental studies. As the field continues to evolve, tools like G4SP will undoubtedly play a pivotal role in shaping our understanding of G-quadruplexes and their myriad biological implications.

DNA G4s and their corresponding topologies were collated from three main sources. 1: Protein Data Bank71 (PDB). Topologies of G4s deposited in PDB were determined by inspection and confirmation by an independent database72. 2: Circular dichroism (CD) studies reported in literature. A list of procured data used in this study can be found in Supplementary Table 1 with CD spectra plotted in Figure S2.1−S2.68. G4 topologies were determined by CD spectral features73. CD spectra with 264 nm max and 245 nm min are considered parallel (4+0), 295 nm max and 260 min are antiparallel (2+2), 295 nm max (with a shoulder at 260 nm) and 245 min are hybrid (3+1). 3: In-house CD experiments. G4 topologies were determined by inspection of CD spectral features. PQS DNA sequences were randomly chosen from the Human chromosome that were thought to form a G4 with specific topology to ensure class parity in training dataset. After compilation from the three sources, the final G4 dataset consists of 487 parallel G4s, 216 antiparallel G4s and 302 hybrid G4s. All G4s chosen are unimolecular, formed in the presence of potassium (\(\hbox {K}^+\) ) ions and at pH levels of 6.8 to 7.4. We also encountered CD spectra with mixed topologies, characterized by very sharp peaks at 264 nm and less pronounced maxima at 295 nm. To maintain clarity and focus, these mixed topology spectra were omitted from our analysis.

25 nM DNA oligonucleotides were purchased from Integrated DNA Technologies (IDT) with standard desalting. The oligonucleotides were resuspended in deionized water to a final concentration of 200 \(\mu\) M. The stock DNA solutions were mixed to achieve a final concentration of 5 \(\mu\) M DNA and 100 mM potassium phosphate buffer K\(_2\) HPO\(_4\) . The DNA solution was then annealed at \(95^\circ\) C for 5 minutes and left overnight to cool down to room temperature.

Circular dichroism (CD) spectra were recorded on a JASCO-815 spectropolarimeter spanning a wavelength range of 220-320 nm using a 1-cm path length quartz cuvette and a 500 \(\mu\) L sample volume. CD spectra were averaged over ten scans with the buffer spectrum subtracted and zero-corrected at 320 nm. CD spectra are categorised into 3 different topologies based on CD features as described in Compilation of G-quadruplex structures under Methods.

PQS were represented as integers for training as follows: N\(=\) 0, A\(=\) 1, T\(=\) 2, C\(=\) 3, G\(=\) 4, where A, T, C, G are adenine, thymine, cytosine, guanine respectively and N refers to an absent nucleotide. PQS were padded to a nucleotide length of 100 using N. The padding length was chosen to capture all essential features while keeping computational efficiency in consideration. The topology for each PQS was represented as follows: ‘Parallel (4+0)’=0, ‘Antiparallel (2+2)’=1, ‘Hybrid (3+1)’=2. Nucleotides and topologies are represented as integer values to improve computational memory efficiency.

Six different machine learning models (XGBoost, CatBoostClassifier, LightBGM, Random Forest, Extra Trees Classifier, Gradient-Boosted Classifier) were used for topology-sequence classification. All models were written in Python programming language. Best hyperparameters were selected using an exhaustive search over all parameter values and are detailed in Section 6 of Supplementary Information.

The prediction performance of all six models are evaluated in terms of accuracy, Area Under Operating Receiver Operating Characteristic (AUROC) and Area Under Precision-Recall (AUPR). The test data was set at 10% of the training data. All metrics were evaluated using k-fold cross validation. The final metrics were averaged across 1000 instances where class proportion was preserved for all ten train-test folds. The one-vs-rest (OvR) strategy was used to generate both AUROC/AUPR. OvR was used to generalise multi-class classification for improved interpretability.

We extract the threshold that relates precision to recall in the precision-recall curve. We calculate the precision and recall values of a given class by treating that class as the positive class and the rest as the negative class. We begin by obtaining the probability \(\mathbb {P}\) that each instance belongs to the positive class. We iterate threshold t from 0 to 1 with a step size of 0.01. For a given t, we compare the result of \(\mathbb {P} > t\) to its actual class. Thus, we are able to calculate the precision and recall values for each t. By adjusting the threshold (decision boundary), a desired trade-off between precision and recall can be achieved, thus improving predictive ability for a given model. The exact precision-threshold relation for each model is shown in Fig. 3E1-E6. Additionally, the precision-threshold relations for models fitted with a one-vs-rest classifier can be found in Figure S5.

We use logo plots to illustrate the nucleic acid sequence alignment of different G4 topologies. The logo plots in Fig. 1F are generated using Logomaker74, a Python package that allows for simple and flexible logo generation. We first categorised the PQS from our G4 dataset into their respective topologies. Padding was added uniformly to a length of 100 and centered at position 50. As sequences are centered at position 50 and have different lengths, alpha (transparency) values are used to illustrate the percentage of nucleotide data unavailable at the flanking regions. As G4 PQS used in this study have a median length of 22 nt, the x-axis of the logo plot is rescaled to position 35 to 65, where meaningful information on nucleotide frequency is found. At each distinct position, the heights of nucleotide representations A, T, C, G are proportional to the frequency of nucleotides. Nucleotide representations stacked on top of each other are arranged (from top) in descending order of nucleotide frequency.

Using the 6 machine learning models, we transformed the multi-class classifiers into 3 binary classifiers, with each classifier representing a G4 topology. Each binary classifier was trained and evaluated with k-fold cross-validation scheme of 10 folds. The accuracy of each binary classifier is similar to the previously trained binary classifiers (Fig. 2B). Permutation importance method was used to determine feature importance. We iterated through each feature (100 in total) in our dataset and permuted it while keeping all other features unchanged. The largest accuracy decrease during permutation would signal a feature having the greatest importance. The feature importance for all 6 models for different topologies are plotted in Fig. 4A-C. Significant peaks are identified by imposing a minimum separation distance, r, between peaks. If 2 identified peaks have separation distance less than r, smaller peaks are removed until the remaining peaks have separation distance equal to r. Fig. 4A-C depicts peaks with \(r=10\) , with smaller r values \((r=\{3, 5\})\) found in Figure S4.1−S4.6.

After identifying positional feature importance, we attempt to narrow down the more prominent sequence motifs. At positions identified with significant importance, sequence motifs of varying lengths are permuted. A significant decrease in accuracy indicates relative importance. Motif importance is visualised as a heatmap in Fig. 4D-F. D1-D4 (blue), E1-E4 (green), F1-F4 (red) represent motif importance for parallel, antiparallel and hybrid respectively. The motifs represented in the heatmap can be read by combining the labels on the \(y-\) axis with the labels on the \(x-\) axis, where the underlined nucleotide is centered on position with significant importance. The lower-case alphabets a-f denote the positions where motifs are centered on a: 44, b: 57, c: 43, d: 59, e: 41, f: 62.

G4SP is fully written in Python programming language and is developed with the aid of PySimpleGUI. The user interface (UI) (See Figure 5) is designed with ease of access and user-friendliness in mind. Upon launch, G4SP prompts the user to enter PQS via typing in the multi-line text input field provided. The user may choose to provide a sequence identifier by providing a header line that begins with the \(``>"\) character. An example is shown in Figure 5(1), where c-MYC refers to TGAG\(_3\) TG\(_3\) TAG\(_3\) TG\(_3\) TA\(_2\) and c-kit refers to AG\(_3\) AG\(_3\) CGCTG\(_3\) AG\(_2\) AG\(_3\) . The user may also provide input PQS by clicking on the “Browse” button and uploading a .fasta file locally stored in the user’s computer. The “Show Parameters” button allows the user to choose a specific machine learning model and the desired precision. All six models described in G4 machine learning models under Results are available. The user may use the default precision value or choose a desired value ranging from 70 to 95. Choosing a desired precision value updates the corresponding recall values (for all 3 G4 topologies) in real-time. A help button resembling a question mark button provides additional information to the user. Upon submission of the input PQS via the “Submit” button, the results will be displayed at the bottom of the UI, as shown in Figure 5(4). Additionally, warnings may be displayed (Figure 5(2)) if any of the PQS submitted either have lengths shorter than 12 nt or yield a G4Hunter Score lower than 1.2. The user may export the G4 prediction results as a CSV file for further analysis. The exported CSV file will have 9 columns: (1) Name - a conditional column of sequence names created if sequence identifiers are provided. (2) Sequence, (3) Topology, (4) Comments - conditional warnings if the sequence is too short or has a G4Hunter score that is too low (5) Model, (6) Precision, (7) Recall (4+1), (8) Recall (2+2), (9) Recall (3+1).

Models trained on G4 sequence-topology pairs were evaluated using accuracy, AUROC and AUPR with stratified k-fold cross validation of 10 folds. Accuracy is calculated by comparing the true topology \(y_{\text {true}}\) to the predicted topology \(y_{\text {pred}}\) averaged over N predictions:

The final accuracy is calculated by repeating the stratified k-fold process 100 times. The uncertainty is given by the standard error of the mean \((\pm x)\) :

where \(s_D\) refers to the standard deviation of sampled accuracies and N refers to the total number of samples. The metrics.roc_curve and metrics.precision_recall_curve modules from scikit-learn were used to plot the AUROC and AUPR respectively. Together with stratified k-fold cross-validation of 10 folds, the process was repeated 100 times to generate a distribution of AUROC/AUPR values. The distribution serves as the foundation to determine optimal thresholds for improved model accuracy.

The step-by-step data collection workflow for the G4 dataset used in our work. The three primary sources of data are (A) results of circular dichroism reported in scientific literature, (B) Protein Data Bank, and (C) in-house circular dichroism experiments. (D) G4 sequence-topology pairs are extracted and collated. The G4 dataset consists of 1005 sequence-topology pairs, with (E) 487 parallel (4+0), 216 antiparallel (2+2) and 302 hybrid (3+1). (F) Logo plot representation of G4 dataset used in this study, depicting nucleotide frequency in parallel, antiparallel and hybrid G4s respectively. Nucleic acid sequences are centered at position 50, with alpha values (representing transparency) denoting the lack of nucleotide representation within the flanking regions. Letter height is proportional to the frequency of nucleotide at that position. (G) The distribution of oligonucleotide lengths. In-house CD experiments are conducted for nucleotide lengths of 22, resulting in the sharp peak.

Performance comparison of classifiers with ten-fold cross-validation and standard deviation as error bars. (A) Cross-validation accuracies of multi-class classifiers used to predict G4 folding topologies. (B) Binary classifiers fitted for each class against all other classes (one-vs-rest strategy). Receiver Operating Characteristic (ROC) (top) and Precision/Recall (PR) (bottom) for (C) XGBoost (XGB), (D) CatBoostClassifier (CBC), (E) Light Gradient Boosting Machine (LGBM), (F) RandomForest (RF), (G) ExtraTrees (EXT), (H) Gradient-Boosted Classifier (GBC).

Custom thresholding performance comparison of classifiers with ten-fold cross-validation. (A) Validation accuracies at thresholds optimized for a 0.9 precision (B) Validation accuracies of binary classifiers (one-vs-rest strategy) at custom threshold of 0.9 precision (C) Validation accuracies at thresholds optimized for a 0.8 precision (D) Validation accuracies of binary classifiers (one-vs-rest strategy) at custom threshold of 0.8 precision. Precision/recall as a function of threshold for (E1) XGB, (E2) CBC, (E3) LGBM, (E4) RF, (E5) EXT, (E6) GBC. Blue: parallel (4+0), green: antiparallel (2+2), red: hybrid (3+1), \({\small \star }\) : precision, \(\square\) : recall.

(A-C) Feature importance of nucleotide positions based on parallel, antiparallel and hybrid binary classifiers respectively. Feature importance (peaks) are minimally separated by \(r=10\) . (D-F) Heatmap visualisation of DNA sequence motifs for parallel (blue), antiparallel (green), and hybrid (red) binary classifiers. The lower-case alphabets a-f denote the positions where represented motifs are centered on a: 44, b: 57, c: 43, d: 59, e: 41, f: 62 respectively. The motif represented for any given cell can be read sequentially by combining the \(y-\) axis label (vertical) with the \(x-\) axis label (horizontal). The centered nucleotide is highlighted in bold.

Graphical user interface (GUI) of G4ShapePredictor (G4SP) application. (1) Text input for DNA PQS sequences, (2) Warns the user if input sequences may not form G4 or have lengths shorter than 13 nucleotides, (3) Choice of six machine learning models (Fig. 2A-B) and allows the user to apply custom thresholding to achieve a desired precision with class-specific recall values updated in real-time (Fig. 2C-H), (4) Selected model’s prediction based on input sequences.

Data is provided within the manuscript or supplementary information files. The source code written in Python programming language for training, evaluating and predicting G4 topology is available at https://github.com/donn-liew/G4ShapePredictor.

Monsen, R. C., Trent, J. O. & Chaires, J. B. G-quadruplex dna: a longer story. Accounts of Chemical Research 55, 3242–3252. https://doi.org/10.1021/acs.accounts.2c00519 (2022).

Article CAS PubMed Google Scholar

Chen, Y. & Yang, D. Sequence, stability, and structure of g-quadruplexes and their interactions with drugs. Current protocols in nucleic acid chemistry 50, 17–5 (2012).

Phan, A. T. Human telomeric G-quadruplex: structures of DNA and RNA sequences. The FEBS Journal 277, 1107–1117. https://doi.org/10.1111/j.1742-4658.2009.07464.x (2010).

Article CAS PubMed Google Scholar

Kerwin, S. M. G-quadruplex DNA as a target for drug design. Current Pharmaceutical Design 6, 441–471 (2000).

Article CAS PubMed Google Scholar

Lim, K. W. et al. Structure of the human telomere in k+ solution: a stable basket-type g-quadruplex with only two g-tetrad layers. Journal of the American Chemical Society 131, 4301–4309. https://doi.org/10.1021/ja807503g (2009).

Article CAS PubMed PubMed Central Google Scholar

Luu, K. N., Phan, A. T., Kuryavyi, V., Lacroix, L. & Patel, D. J. Structure of the human telomere in k+ solution: an intramolecular (3 + 1) g-quadruplex scaffold. Journal of the American Chemical Society 128, 9963–9970. https://doi.org/10.1021/ja062791w (2006).

Article CAS PubMed PubMed Central Google Scholar

Dai, J., Carver, M., Punchihewa, C., Jones, R. A. & Yang, D. Structure of the hybrid-2 type intramolecular human telomeric G-quadruplex in K+ solution: insights into structure polymorphism of the human telomeric sequence. Nucleic Acids Research 35, 4927–4940. https://doi.org/10.1093/nar/gkm522 (2007).

Article CAS PubMed PubMed Central Google Scholar

Lim, K.W., Ng, V. C.M., MartÃn-Pintado, N., Heddi, B. & Phan, A.T. Structure of the human telomere in Na+ solution: an antiparallel (2+2) G-quadruplex scaffold reveals additional diversity. Nucleic Acids Research 41, 10556–10562, https://doi.org/10.1093/nar/gkt771 (2013).

Tucker, B. A. et al. Stability of the na+ form of the human telomeric G-quadruplex: role of adenines in stabilizing G-quadruplex structure. ACS Omega 3, 844–855. https://doi.org/10.1021/acsomega.7b01649 (2018).

Article CAS PubMed PubMed Central Google Scholar

Shim, J. W., Tan, Q. & Gu, L.-Q. Single-molecule detection of folding and unfolding of the G-quadruplex aptamer in a nanopore nanocavity. Nucleic Acids Research 37, 972–982. https://doi.org/10.1093/nar/gkn968 (2009).

Article CAS PubMed Google Scholar

Bhattacharyya, D., Mirihana Arachchilage, G. & Basu, S. Metal cations in g-quadruplex folding and stability. Frontiers in Chemistry 4 (2016).

Makarov, D.E. & Plaxco, K.W. Measuring distances within unfolded biopolymers using fluorescence resonance energy transfer: The effect of polymer chain dynamics on the observed fluorescence resonance energy transfer efficiency. The Journal of chemical physics 131 (2009).

Kong, D.-M., Yang, W., Wu, J., Li, C.-X. & Shen, H.-X. Structure-function study of peroxidase-like g-quadruplex-hemin complexes. Analyst 135, 321–326 (2010).

Article ADS CAS PubMed Google Scholar

Sato, K. & Knipscheer, P. G-quadruplex resolution: from molecular mechanisms to physiological relevance. DNA Repair 130, 103552. https://doi.org/10.1016/j.dnarep.2023.103552 (2023).

Article CAS PubMed Google Scholar

Biver, T. Discriminating between parallel, anti-parallel and hybrid g-quadruplexes: mechanistic details on their binding to small molecules. Molecules 27, 4165. https://doi.org/10.3390/molecules27134165 (2022).

Article CAS PubMed Google Scholar

Zhang, R. et al. G-quadruplex structures are key modulators of somatic structural variants in cancers. Cancer Research 83, 1234–1248. https://doi.org/10.1158/0008-5472.CAN-22-3089 (2023).

Article CAS PubMed PubMed Central Google Scholar

Tian, T., Chen, Y.-Q., Wang, S.-R. & Zhou, X. G-quadruplex: a regulator of gene expression and its chemical targeting. Chem 4, 1314–1344. https://doi.org/10.1016/j.chempr.2018.02.014 (2018).

Besnard, E. et al. Unraveling cell type-specific and reprogrammable human replication origin signatures associated with G-quadruplex consensus motifs. Nature Structural & Molecular Biology 19, 837–844. https://doi.org/10.1038/nsmb.2339 (2012).

Valton, A.-L. et al. G4 motifs affect origin positioning and efficiency in two vertebrate replicators. The EMBO Journal 33, 732–746. https://doi.org/10.1002/embj.201387506 (2014).

Article CAS PubMed PubMed Central Google Scholar

Lange, T.d. Shelterin: the protein complex that shapes and safeguards human telomeres. Genes & Development 19, 2100–2110, https://doi.org/10.1101/gad.1346005 (2005).

Siddiqui, G. A. et al. Application of machine learning algorithms to metadynamics for the elucidation of the binding modes and free energy landscape of drug/target interactions: a case study. Chemistry - A European Journal 29, e202302375. https://doi.org/10.1002/chem.202302375 (2023).

Article CAS PubMed Google Scholar

Han, H. & Hurley, L. H. G-quadruplex DNA: a potential target for anti-cancer drug design. Trends in Pharmacological Sciences 21, 136–142. https://doi.org/10.1016/S0165-6147(00)01457-7 (2000).

Article CAS PubMed Google Scholar

Mergny, J.-L. & Hélène, C. G-quadruplex DNA: a target for drug design. Nature Medicine 4, 1366–1367. https://doi.org/10.1038/3949 (1998).

Article CAS PubMed Google Scholar

Teng, F.-Y. et al. G-quadruplex DNA: a novel target for drug design. Cellular and Molecular Life Sciences 78, 6557–6583. https://doi.org/10.1007/s00018-021-03921-8 (2021).

Article CAS PubMed PubMed Central Google Scholar

Su, Z. et al. A G-quadruplex/hemin structure-undamaged method to inhibit peroxidase-mimic DNAzyme activity for biosensing development. Analytica Chimica Acta 1221, 340143. https://doi.org/10.1016/j.aca.2022.340143 (2022).

Article CAS PubMed Google Scholar

Bhat-Ambure, J. et al. G4-QuadScreen: A Computational Tool for Identifying Multi-Target-Directed Anticancer Leads against G-Quadruplex DNA. Cancers 15, 3817. https://doi.org/10.3390/cancers15153817 (2023).

Article CAS PubMed PubMed Central Google Scholar

Arola, A. & Vilar, R. Stabilisation of G-Quadruplex DNA by Small Molecules. Current Topics in Medicinal Chemistry 8, 1405–1415. https://doi.org/10.2174/156802608786141106 (2008).

Article CAS PubMed Google Scholar

Wang, R., Hao, W., Pan, L., Boldogh, I. & Ba, X. The roles of base excision repair enzyme OGG1 in gene expression. Cellular and Molecular Life Sciences 75, 3741–3750. https://doi.org/10.1007/s00018-018-2887-8 (2018).

Article CAS PubMed PubMed Central Google Scholar

Cave, J. W. & Willis, D. E. G-quadruplex regulation of neural gene expression. The FEBS Journal 289, 3284–3303. https://doi.org/10.1111/febs.15900 (2022).

Article CAS PubMed Google Scholar

O’Hagan, M. P., Morales, J. C. & Galan, M. C. Binding and beyond: what else can G-quadruplex ligands do?. European Journal of Organic Chemistry 2019, 4995–5017 (2019).

Shiekh, S., Kodikara, S.G. & Balci, H. Structure, topology, and stability of multiple g-quadruplexes in long telomeric overhangs. Journal of Molecular Biology 168205, https://doi.org/10.1016/j.jmb.2023.168205 (2023).

Takahashi, S., Brazier, J. A. & Sugimoto, N. Topological impact of noncanonical DNA structures on Klenow fragment of DNA polymerase. Proceedings of the National Academy of Sciences 114, 9605–9610 (2017).

Article ADS CAS Google Scholar

Ai, T. et al. Insight into how telomeric G-quadruplexes enhance the peroxidase activity of cellular Hemin. Chemistry - An Asian Journal 13, 1805–1810. https://doi.org/10.1002/asia.201800464 (2018).

Yu, H., Qi, Y., Yang, B., Yang, X. & Ding, Y. G4Atlas: a comprehensive transcriptome-wide G-quadruplex database. Nucleic Acids Research 51, D126–D134. https://doi.org/10.1093/nar/gkac896 (2023).

Article CAS PubMed Google Scholar

Elimelech-Zohar, K. & Orenstein, Y. An overview on nucleic-acid G-quadruplex prediction: from rule-based methods to deep neural networks. Briefings in Bioinformatics 24, bbad252, https://doi.org/10.1093/bib/bbad252 (2023).

Huppert, J. L. & Balasubramanian, S. Prevalence of quadruplexes in the human genome. Nucleic Acids Research 33, 2908–2916. https://doi.org/10.1093/nar/gki609 (2005).

Article CAS PubMed PubMed Central Google Scholar

Todd, A. K., Johnston, M. & Neidle, S. Highly prevalent putative quadruplex sequence motifs in human DNA. Nucleic Acids Research 33, 2901–2907. https://doi.org/10.1093/nar/gki553 (2005).

Article CAS PubMed PubMed Central Google Scholar

Kudlicki, A. S. G-Quadruplexes involving both strands of genomic DNA are highly abundant and colocalize with functional sites in the human genome. PLOS ONE 11, e0146174. https://doi.org/10.1371/journal.pone.0146174 (2016).

Article CAS PubMed PubMed Central Google Scholar

Hon, J., Martínek, T., Zendulka, J. & Lexa, M. pqsfinder: an exhaustive and imperfection-tolerant search tool for potential quadruplex-forming sequences in R. Bioinformatics 33, 3373–3379. https://doi.org/10.1093/bioinformatics/btx413 (2017).

Article CAS PubMed Google Scholar

Bedrat, A., Lacroix, L. & Mergny, J.-L. Re-evaluation of G-quadruplex propensity with G4Hunter. Nucleic Acids Research 44, 1746–1759. https://doi.org/10.1093/nar/gkw006 (2016).

Article PubMed PubMed Central Google Scholar

Sahakyan, A. B. et al. Machine learning model for sequence-driven dna g-quadruplex formation. Scientific Reports 7, 14535. https://doi.org/10.1038/s41598-017-14017-4 (2017).

Article ADS CAS PubMed PubMed Central Google Scholar

Cagirici, H. B., Budak, H. & Sen, T. Z. G4Boost: a machine learning-based tool for quadruplex identification and stability prediction. BMC Bioinformatics 23, 240. https://doi.org/10.1186/s12859-022-04782-z (2022).

Article CAS PubMed PubMed Central Google Scholar

Zhang, Z., Zhang, R., Xiao, K. & Sun, X. G4beacon: an in vivo g4 prediction method using chromatin and sequence information. Biomolecules 13, 292. https://doi.org/10.3390/biom13020292 (2023).

Article CAS PubMed PubMed Central Google Scholar

Cui, Y. et al. Prediction of strand-specific and cell-type-specific G-quadruplexes based on high-resolution cut &tag data. Briefings in Functional Genomics elad024, https://doi.org/10.1093/bfgp/elad024 (2023).

Korsakova, A. & Phan, A. T. Prediction of G4 formation in live cells with epigenetic data: a deep learning approach. NAR genomics and bioinformatics 5, lqad071, https://doi.org/10.1093/nargab/lqad071 (2023).

Rocher, V., Genais, M., Nassereddine, E. & Mourad, R. Deepg4: a deep learning approach to predict cell-type specific active g-quadruplex regions. PLOS Computational Biology 17, e1009308. https://doi.org/10.1371/journal.pcbi.1009308 (2021).

Article ADS CAS PubMed PubMed Central Google Scholar

Garant, J.-M., Perreault, J.-P. & Scott, M. S. Motif independent identification of potential RNA G-quadruplexes by G4RNA screener. Bioinformatics 33, 3532–3537. https://doi.org/10.1093/bioinformatics/btx498 (2017).

Article CAS PubMed PubMed Central Google Scholar

Barshai, M., Engel, B., Haim, I. & Orenstein, Y. G4mismatch: Deep neural networks to predict g-quadruplex propensity based on g4-seq data. PLOS Computational Biology 19, e1010948. https://doi.org/10.1371/journal.pcbi.1010948 (2023).

Article ADS CAS PubMed PubMed Central Google Scholar

Gaudreault, J.-G., Branco, P. & Gama, J. An analysis of performance metrics for imbalanced classification. In Discovery Science, 67–77, https://doi.org/10.1007/978-3-030-88942-5_6 (Springer, Cham, 2021).

Jeni, L.A., Cohn, J.F. & De La Torre, F. Facing imbalanced data - recommendations for the use of performance metrics. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 245–251, https://doi.org/10.1109/ACII.2013.47 (2013).

Calvert, C.L. & Khoshgoftaar, T.M. Threshold based optimization of performance metrics with severely imbalanced big security data. In 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), 1328–1334, https://doi.org/10.1109/ICTAI.2019.00184 (2019).

Johnson, J. M. & Khoshgoftaar, T. M. Survey on deep learning with class imbalance. Journal of Big Data 6, 27. https://doi.org/10.1186/s40537-019-0192-5 (2019).

Maratea, A., Petrosino, A. & Manzo, M. Adjusted f-measure and kernel scaling for imbalanced data learning. Information Sciences 257, 331–341. https://doi.org/10.1016/j.ins.2013.04.016 (2014).

Fang, T., Lu, N., Niu, G. & Sugiyama, M. Rethinking importance weighting for deep learning under distribution shift. In Advances in Neural Information Processing Systems, vol. 33, 11996–12007 (Curran Associates, Inc., 2020).

Kumar, S., Biswas, S. K. & Devi, D. Tlusboost algorithm: a boosting solution for class imbalance problem. Soft Computing 23, 10755–10767. https://doi.org/10.1007/s00500-018-3629-4 (2019).

Bishara, I., Chen, J., Griffiths, J.I., Bild, A.H. & Nath, A. A machine learning framework for scRNA-seq UMI threshold optimization and accurate classification of cell types. Frontiers in Genetics 13 (2022).

Sallam, N. M., Saleh, A. I., Arafat Ali, H. & Abdelsalam, M. M. An Efficient Strategy for Blood Diseases Detection Based on Grey Wolf Optimization as Feature Selection and Machine Learning Techniques. Applied Sciences 12, 10760. https://doi.org/10.3390/app122110760 (2022).

Zou, Q., Xie, S., Lin, Z., Wu, M. & Ju, Y. Finding the Best Classification Threshold in Imbalanced Classification. Big Data Research 5, 2–8. https://doi.org/10.1016/j.bdr.2015.12.001 (2016).

Voigt, T., Fried, R., Backes, M. & Rhode, W. Threshold optimization for classification in imbalanced data in a problem of gamma-ray astronomy. Advances in Data Analysis and Classification 8, 195–216. https://doi.org/10.1007/s11634-014-0167-5 (2014).

Janitza, S., Strobl, C. & Boulesteix, A.-L. An AUC-based permutation variable importance measure for random forests. BMC Bioinformatics 14, 119. https://doi.org/10.1186/1471-2105-14-119 (2013).

Article PubMed PubMed Central Google Scholar

Gregorutti, B., Michel, B. & Saint-Pierre, P. Correlation and variable importance in random forests. Statistics and Computing 27, 659–678. https://doi.org/10.1007/s11222-016-9646-1 (2017).

Stegle, O., Payet, L., Mergny, J.-L., MacKay, D. J. C. & Huppert, J. L. Predicting and understanding the stability of G-quadruplexes. Bioinformatics 25, i374–i1382. https://doi.org/10.1093/bioinformatics/btp210 (2009).

Article CAS PubMed PubMed Central Google Scholar

Sage, A. T. et al. A machine-learning approach to human ex vivo lung perfusion predicts transplantation outcomes and promotes organ utilization. Nature Communications 14, 4810. https://doi.org/10.1038/s41467-023-40468-7 (2023).

Article ADS CAS PubMed PubMed Central Google Scholar

Nicolados , E.-M. , Wongprommoon , A. , Aodha , OM , Cambray , G. & Oyarzún , DA Accuracy and data efficiency in deep learning models of protein expression . Nature Communications 13, 7755. https://doi.org/10.1038/s41467-022-34902-5 (2022).

Article ADS CAS PubMed PubMed Central Google Scholar

Raudys, S. & Jain, A. Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence 13, 252–264. https://doi.org/10.1109/34.75512 (1991).

Kavzoglu, T. & Mather, P. M. The use of backpropagating artificial neural networks in land cover classification. International Journal of Remote Sensing 24, 4907–4938. https://doi.org/10.1080/0143116031000114851 (2003).

Li, Z., Kamnitsas, K. & Glocker, B. Overfitting of neural nets under class imbalance: analysis and improvements for segmentation. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2019, 402–410, https://doi.org/10.1007/978-3-030-32248-9_45 (Springer, Cham, 2019).

Wei, Z. et al. Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. American Journal of Human Genetics 92, 1008–1012. https://doi.org/10.1016/j.ajhg.2013.05.002 (2013).

Article CAS PubMed PubMed Central Google Scholar

Wang, Z., Hu, M. & Zhai, G. Application of deep learning architectures for accurate and rapid detection of internal mechanical damage of blueberry using hyperspectral transmittance data. Sensors 18, 1126. https://doi.org/10.3390/s18041126 (2018).

Article ADS PubMed PubMed Central Google Scholar

Farag, M. & Mouawad, L. Comprehensive analysis of intramolecular g-quadruplex structures: furthering the understanding of their formalism. Nucleic Acids Research 52, 3522–3546 (2024).

Article PubMed PubMed Central Google Scholar

Berman, H. M. et al. The protein data bank. Nucleic Acids Research 28, 235–242. https://doi.org/10.1093/nar/28.1.235 (2000).

Article CAS PubMed PubMed Central Google Scholar

Lu, X.-J. Dssr-enabled innovative schematics of 3d nucleic acid structures with pymol. Nucleic Acids Research 48, e74. https://doi.org/10.1093/nar/gkaa426 (2020).

Article CAS PubMed PubMed Central Google Scholar

del Villar-Guerra, R., Trent, J.O. & Chaires, J.B. G-quadruplex secondary structure from circular dichroism spectroscopy. Angewandte Chemie (International ed. in English) 57, 7171–7175, https://doi.org/10.1002/anie.201709184 (2018).

Tareen, A. & Kinney, J. B. Logomaker: beautiful sequence logos in Python. Bioinformatics 36, 2272–2274. https://doi.org/10.1093/bioinformatics/btz921 (2020).

Article CAS PubMed Google Scholar

D.L. and E.H.Y. acknowledge support from the Singapore Ministry of Education through the Academic Research Fund Tier 1 (RG140/22) and Academic Research Fund Tier 2 (MOE-T2EP50223-0014). The computational work for this article was partially performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg). We would like to thank Phan Anh Tuân for sharing his expertise on G-quadruplexes in a series of insightful discussions. We would also like to thank Tran Van Nhan for procurement of DNA oligonucleotides, Natalie Lim and Seah Hui Ling for guidance on the conduct of circular dichroism experiments.

Division of Physics and Applied Physics, School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore, Singapore

Donn Liew, Zi Way Lim & Ee Hou Yong

You can also search for this author in PubMed Google Scholar

E.H.Y conceived and supervised the project, E.H.Y and D.L. analysed the results, D.L. and Z.W.L conducted the experiments, E.H.Y and D.L. wrote the manuscript.

Correspondence to Ee Hou Yong.

The authors declare no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Liew, D., Lim, Z.W. & Yong, E.H. Machine learning-based prediction of DNA G-quadruplex folding topology with G4ShapePredictor. Sci Rep 14, 24238 (2024). https://doi.org/10.1038/s41598-024-74826-2

DOI: https://doi.org/10.1038/s41598-024-74826-2

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Scientific Reports (Sci Rep) ISSN 2045-2322 (online)

Windbreaker Packaging Machine Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Machine learning-based prediction of DNA G-quadruplex folding topology with G4ShapePredictor | Scientific Reports