Network Oriented multi-Omics Data Analysis and Integration (NOODAI)

Purpose:

The primary purpose of the NOODAI software is to offer users a method for collectively analyzing multiple omics datasets by integrating them into a unified framework (relying on a network). It addresses the broad challenge of multi-omics integrative analysis methods. The elements of interest (e.g. upregulated/downregulated genes or proteins or small molecules) belonging to each of the omics datasets provided for a specific comparison between two conditions are merged together into a joint protein-protein interaction network (enriched with protein-small molecules interaction knowledge). Afterwards, multiple centrality scores are computed for each node in the network, highlighting the most important proteins/small molecules. Subsequently, the network is decomposed into subnetworks (modules), each subnetwork being composed of upregulated or downregulated omics elements that belong to a specific signaling pathway. These subnetworks characterize a specific biological function, in their entirety defining the analyzed conditions.

Important: To use this platform, you should have available at least one list of upregulated or downregulated omics data comparing a condition to another (or data that is conceptually similar).

 

General analysis pipeline and expected results:

The NOODAI algorithm starts from lists of upregulated (or downregulated) proteins/small molecules extracted from different omics analyses, organized in separate Excel tables that have multiple sheets corresponding to different comparisons of interest (contrasts). The algorithm is organized into 4 analysis segments as follows:

·       Segment I: For a specific contrast, all the proteins/small molecules from different omics analyses are merged into a unified interaction network using filtered knowledge available in STRING, BioGrid, and IntAct databases. After the network is built, for each protein/small molecule (node) a number of centrality metrics are computed, the one of interest being the current-flow betweenness centrality. These centrality scores provide a metric of the importance of each node in the context of the overall network.

·       Segment II: After constructing the network and computing the centrality scores, the full-size network is decomposed into subnetworks (modules) using the MONET decomposition tool. Theoretically, each module should be related to a specific biological function, and this can be checked by evaluating how well the members of the subnetwork associate with a particular signaling pathway.

·       Segment III: After decomposing the network, the associated signaling pathways are extracted for each module containing more than 10 members. Subsequently, their over-representation against a specified pathway database is evaluated relative to a selected background.

·       Segment IV: Finally, three types of outputs are created. First, circular diagrams that highlight the most important transcription factors and their connecting proteins in the full-size network are marked based on the module in which the proteins are found. “Most important” is defined by default as being in the top 30% of most central nodes sorted by the current-flow betweenness centrality. Secondly, the top 3 pathways (based on FDR levels) for the first 5 modules (based on their size) are represented. Notably, there may be overlap among these pathways, and it is recommended that for publication purposes, a plot be recreated to emphasize unique pathways supported by strong biological rationales! Lastly, a summary report is created, highlighting the most important and robust nodes, pathways, transcription factors and kinases.

The entire analysis is performed for each contrast individually.

All the definitions and the entire folder structure are presented in the summary report.

NOODAI webtool interface

The webtool contains 3 principal tab panels:

·       Run the full pipeline” – The main tab, where with few input parameters you can run the entire analysis pipeline.

·       Custom algorithms” – In this tab, each segment of the analysis pipeline can be customized depending on the needs. Requires a thorough understanding of the analysis pipeline!

·       Results download” – The only place from which you can download your analysis results using a unique identifier that was given to you after you submitted an analysis request.

Set-up considering the available Demo using the “Run the full pipeline” tab

As a tutorial dataset, we analyzed proteomics, phosphoproteomics and transcriptomics measurements of 3 macrophage phenotypes derived from primary cells: M1, M2a and M2c (Original publication). For these conditions, differential expression analysis (DEA) between them was performed. The upregulated elements of M1 compared to M2a from the DEA are denoted as M1vsM2a. The Demo dataset can be downloaded from the NOODAI platform using the top-right button . It includes an archive containing formatted Excel table examples (“Uploaded_Data_Archive.zip”) and a txt file specifying the setup values for the fields in the NOODAI platform.

Mandatory input fields description

Conditions names: Provide the names of each analyzed conditions (phenotypes), including the “control” (so all of them!). Enter each name separated by a comma (e.g., M1, M2a, M2c). Do not use quotes, and avoid using '_' in any of the names! Ensure that all the names of the input Excel tables sheets are composed of these names, and that all conditions contrasts are constructed using these names.

Conditions contrasts: As the analysis is based on the upregulated/downregulated elements when comparing one condition to another, such a comparison is termed here as a “contrast” (ex. the comparison of M1 to M2a has two contrasts: M1vsM2a contrast corresponding to the upregulated elements of M1 compared to M2a and M2avsM1 corresponding to the upregulated elements of M2a compared to M1). Each contrast should match an Excel sheet with an identical name in the tables from the uploaded ZIP archive. Please enter the contrasts separated by commas (ex: M1vsM2a,M2avsM1,M1vsM2c).

Omics files archive: The web server accepts as input an archive that contains one Excel table for each of the omics profiles of interest. Each table should have at least one sheet, named after a unique analyzed contrast (ex: M1vsM2a) and should contain the upregulated elements of the first covariate compared to the second one (ex: the M1vsM2a sheet in the Proteomics.xlsx file contains the upregulated proteins found in M1 when compared to M2a considering a log2FC threshold of 1 and FDR threshold of 0.05). All sheets MUST have the column that contains the proteins/small molecules that are upregulated named UniProt_ChEBI. The names of the entries for this column MUST be Uniprot or ChEBI Ids. All tables must contain the same number of sheets with the same names, even though they are empty! All sheets that are to be analyzed must contain the UniProt_ChEBI column.

Example of table format:

UniProt_ChEBI Optional Optional
P42224 STAT1 0.05
P40763 STAT3 0.05

The demo can be run by loading the Conditions names, Conditions contrasts and Uploaded_Data_Archive.zip using the Demo button followed by pressing the Submit button. All other fields have default values and are optional, depending on your needs.

 

Optional inputs (all have default values):

MONET method: The MONET method parameters must be specified following the MONET tool instructions (MONET). Please consult the respective instructions for further details. The format must remain as the one pre-loaded. The MONET avilable methods are: M1, K1 and R1. By default the networks are undirectional with a desired average degree for nodes in output of 10.

Pathway databases: To extract the over-represented pathways associated with each identified MONET module, the pathway databases against which this enrichment is conducted must be specified. Possible databases are Reactome, Wikipathways, BioCarta, PID, NetPath, HumanCyc, INOH and SMPDB. Note that licensing restrictions may apply to some databases, and you are responsible for compliance. The developers assume no liability in the event of a legal dispute. Currently, Reactome and Wikipathways are public domain licenses (CC0) and you can use them freely.

BioMart dataset: The conversion of input Uniprot IDs to other identifiers is facilitated through the BioMart database service. The pre-selected dataset is the one corresponding to humans. If you are using data from other organisms, please provide the correct identifier. (ex. mmusculus_gene_ensembl (mouse), btaurus_gene_ensembl (Bos taurus), celegans_gene_ensembl (Caenorhabditis elegans), clfamiliaris_gene_ensembl (Canis familiaris), drerio_gene_ensembl (Danio rerio), dmelanogaster_gene_ensembl (Drosophila melanogaster), ggallus_gene_ensembl (Gallus gallus), rnorvegicus_gene_ensembl (Rattus norvegicus), scerevisiae_gene_ensembl (Saccharomyces cerevisiae), xtropicalis_gene_ensembl (Xenopus tropicalis), sscrofa_gene_ensembl (Sus scrofa),spombe_gene_ensembl (Schizosaccharomyces pombe) ). If you have data from another organism than the pre-laoded ones (above ones), besides selecting the BioMart dataset YOU MUST offer a pre-formatted interaction table, pathway, TF and Kinome databases (depedning on the analysis segment that you are interested in)!

Interaction table file: Activating the sign highlights the interaction databases fields that are dependent on the selection option for the “Use Pre-compiled Interaction file” field. Default protein-protein interaction databases are already loaded on the server, however, due to technical limitations ALL databases are already filtered to include only the selected species protein-protein and protein-small_molecule interactions! The pre-formatted interaction file that you can upload has the following format: 2 columns with the names Interactor1 and Interactor2, followed by lines with 2 proteins/small_molecule per line using NCBI or ChEBI IDs! It must be a tab-separated text file. YOU MUST provide this table if you have data from other organism than the selected ones! Formatting example:

Interactor1 Interactor2
10421 23020
10755 4646

BioGrid database file: One of the protein-protein interaction sources is the BioGrid database. The available version is 4.4.218 in mitab format filtered to include only entries with a confidence score. If you would like to upload another version you can upload it here. The database will be filtered automatically for humans!

STRING database file: Another protein-protein interaction database is STRING. The available version on the server is 11.5 containing the complete interactions data considering all sources and filtered to include only entries with a combined score above 0.7. This database is filtered a priori for the selected organisms interactions. If provided, this database will not be filtered. If you choose to provide this database, filter it a priori for your organism of interest! Do not upload the entire database as it is too big!

IntAct database file: IntAct database is used as well for the analysis. The psimitab from 13/07/2022 is already loaded and filtered to keep only interactions with a confidence score above 0.7 for the selected organisms. If you choose to upload another database, please filter it a priori.

Interaction table file: The protein-protein interactions from BioGrid, STRING and IntAct databases filtered as above and the small molecule interactions from IntAct, are merged into a final pre-compiled interaction table.

 

Analysis submission and "Results download" tab

After providing the input data, you can submit an analysis request using the Submit button . If there are no evident errors related to the data formatting, the output panel containing your analysis ID will appear. Save this 'results folder ID' as it cannot be retrieved later. If you provide an email address in the 'Email address (Optional)' field, you will receive an email notification once the analysis is finished.

To download your results, navigate to the “Results download” tab and enter your “results folder ID” in the “Results directory index“ field. Finally, use the button on the right-hand side to initiate the download of your results. From this tab, you can download a pre-compiled version of the Demo dataset.

“Custom algorithms” tab

The options in this tab are tailored for advanced users who possess a thorough understanding of all analysis steps. Users have the possibility to submit an initial analysis or, for an existing analysis, redo specific segments of the analysis pipeline, thereby changing the original results. To ensure full flexibility, error monitoring is kept to a minimum.

To run any code segment, please provide inputs for all non-transparent fields!

Below is an explanation of the fields that were not covered under the “Run the full pipeline” section.

Results directory index: Provide the folder associated with your results in which you would like to redo some analysis. The original analysis for the respective segment will be discarded. If no folder is provided, a new one will be created.

DTU file id: In this field, you can specify the name of the omics table that contains the alternative splicing DTU results. The NOODAI algorithm was designed to include splicing data as an input omics type. Compared to proteomics or transcriptomics data, differentially used transcripts (DTU) extracted from analyzing spliced data have some particular characteristics. Firstly, DTU hits often form large, well-connected networks that are disconnected from any proteomics or transcriptomics hits. To address this, NOODAI restricts the analysis of networks composed of more than 75% DTU hits, ensuring the meaningfulness of the resulting networks. Moreover, DTU hits are the same for symmetric contrasts (M1vsM2a is equivalent to M2avsM1). However, to keep the networks meaningful, common elements between DTU and other analyses are merged only for that specific contrast. Example: If MAPK1 is identified as alternatively spliced between M1 and M2a and is also found to be upregulated in proteomics for M1, it will be retained only for M1. To achieve this, the uploaded data must have values for all the contrasts symmetrically (for both M1vsM2a and M2avsM1). You can upload any omics data meeting these characteristics (the entries are the same for both M1vsM2a and M2avsM1) as splicing data.

Edge file path: By default, the names used for the nodes during the MONET decomposition and pathway analysis are sourced from the 'Symbol' folder as Gene Names. If you prefer to use Uniprot IDs instead of Gene names, you can specify the folder here.

Temporary folder: This is the temporary folder for the MONET analysis. By default, it is deleted after MONET results are successfully copied to their default location. If you run the interface locally, you can specify your desired temporary folder location.

MONET path: This is the path to the MONET executable. Change only if you run the analysis locally!

MONET background file: By default, the over-representation analysis of members associated with each MONET cluster uses a background consisting of all proteins/small molecules from the full-size network, stored in 'Background_total.xlsx'. This file contains all proteins/small molecules of the full-size network. If you prefer to use a different background, you can load an Excel table formatted similarly to 'Background_total.xlsx' (no header, only NCBI IDs, and each sheet representing a contrast).

CPDB database file: The signaling pathway knowledge is sourced mainly from CPDB-provided files and Reactome. If you prefer to utilize different databases or newer versions, please supply an updated CPDB pathway database or a file with a similar format. You must provide this table if you have species other than the pre-loaded ones and wish to perform this analysis segment.

Edge files directory: This must be consistent with the 'Edge file path'. By default, the circular plots and the final summary report use gene names. If you prefer to use Uniprot IDs, you can switch to the Uniprot folder (formatted identically to the 'Edge file path').

TF database: If you wish to utilize your custom transcription factor dataset, you can upload it here. Ensure it follows the same format as the dataset available on GitHub (Databases/_TF with the mandatory column 'Symbol'). You must provide this table if you have species other than the pre-loaded ones and wish to perform this analysis segment.

Species Symbol Family
9606 PAX8 PAX
9606 STAT3 STAT

Centralities file: By default, circular diagrams are generated for the combination of all omics datasets. However, you have the flexibility to create circular plots and pathways for individual omics datasets as they were all analyzed. Given the option to upload your file, you can customize your input, which can influence the final plots. This is particularly beneficial if the representation of the top 30% most central nodes does not align with your data.

Kinome Database: In the summary report, kinases are identified using a pre-compiled database. If you wish to highlight elements other than kinases, you have the option to change this database to one of your preferences. Ensure that the format of the uploaded text file matches that of the one provided on GitHub (Databases/kinome.txt with mandatory columns 'Uniprot' and 'Gene_name'). You must provide this table if you have species other than the pre-loaded ones and wish to perform this analysis segment.

File_ending: By default, circular diagrams, pathway plots, and summary reports are generated using the network composed of all omics data (files ending with 'Total'). If you prefer to generate plots only for a specific omics dataset, you can change the file ending here to match the one corresponding to your input omics file name.

 

When changing the parameters in the custom algorithm tab, ensure consistency by updating all relevant fields! For example, do not change the File_ending without uploading another Centrality file!

Purpose of the Custom algorithms:

·       Integrate splicing data or data with similar properties with other omics measurements.

·       Run MONET using another algorithm such as (R1 or K1) without waiting for the computation of all centralities;

·       Provide a different background for the pathway over-representation analysis;

·       Provide a different signaling pathways knowledge database for the pathway over-representation analysis;

·       Instead of highlighting the kinases in the summary report, you can leverage the algorithm and replace the Kinome database with one containing elements you wish to search and highlight. For instance, you can search for important phosphatases by providing a phosphatase database structured similarly to the Kinome database.

·       Modify the centrality file used for generating the circular diagram plot to emphasize different hits

Miscellaneous

·       NOODAI relies mainly on protein-protein and small molecules interaction networks. Certain omics data, such as transcriptomics, can be utilized to construct such a network. However, some omics measurements are not inherently linked to proteins/small molecules, rendering a protein-protein interaction network purposeless (such as miRNAs). These should be mapped to either a ChEBI ID or UniProt ID.

·       Small molecule interactions are taken from IntAct. If you want other interaction source you should provide them as a custom made "Interaction File".

·       If you analyze other species that the pre-loaded ones you must provide all databases files and you can use the platform only for protein-protein interaction networks (due to ID conversion restrictions from BiomaRt).