Go to previous page

PIKE - Quick Tutorial

PIKE is a bioinformatic web tool that checks the information stored in a set of protein databases and systematically extracts and reports in real-time, non-redundant functional and biological information regarding an initial set of proteins.

Finally, relevant information is reported to the user in a wide range of standard output formats that can be viewed, exported or downloaded.


Introduction

One of the main goals in proteomics is to solve biological and molecular questions regarding a set of identified proteins. In order to achieve this goal, one has to extract and collect the existing biological data from public repositories for every protein and afterwards, analyze and organize the collected data. Due to the complexity of this task and the huge amount of data available, it is not possible to gather this information by hand, making it necessary to find automatic methods of data collection.

Within a proteomics context, we have developed PIKE (Protein Information and Knowledge extractor) which solves this problem by automatically accessing several public information systems and databases across the Internet. PIKE bioinformatics tool starts with a set of identified proteins, listed as the most common protein databases accession codes, and retrieves all relevant and updated information from the most relevant databases.

Once the search is complete, PIKE summarizes the information for every single protein using several file formats that share and exchange the information with other software tools. It is our opinion that PIKE represents a great step forward for information procurement and drastically reduces manual database validation for large proteomic studies.


How to use PIKE?

PIKE starts from a list of proteins accession numbers -UniprotKB, EBI IPI, NCBI nr, GenBank, EMBL, PDB, DDBJ, and Entrez Gene ID- within a file. This file could be a text file (a row for each protein accesion code - see figure below) or a EBI PRIDE file.




Figure 1: PIKE input file: A protein entry by row. Additionaly protein name could be attached together each protein code.


Query submission to PIKE is as follows:

1.- Introduce a user name and e-mail address -The reason for requiring this information is to allow the results of a search to be returned by email.
2.- Specify the Database used to obtain the protein list: UniprotKB, EBI IPI, NCBI nr, GenBank, EMBL, PDB, DDBJ, Entrez Gene ID.
3.- Select one, several or all of the following fields of interest, according to the objectives of the experiment (table 1).
4-. Select the specific level of representation for Gene Ontology clustering.
5.- Upload the input protein list. Accepted formats are either Text flat files or PRIDE XML.
6.- Click on Run button.

Resources

URL

Comment

Protein Features

http://www.uniprot.org/

Protein Name,
Gene Names,
Protein Sequence,
Alternative IDs.

Protein Annotations

http://www.uniprot.org/

Function,
Subcellular location,
Tissue Specificity,
Diseases.

Keywords

http://www.uniprot.org/

Comprehensive list of terms which summarises the content of an entry

Taxonomy

http://www.ncbi.nlm.nih.gov/Taxonomy/

NCBI unique identifier for the source organism

Gene Ontology

Amigo

http://amigo.geneontology.org/

Search Engine of GO database

Pathways

KEGG

http://www.genome.jp/kegg/pathway.html

Collection of manually drawn pathway maps

Protein Interactions

STRING

http://string-db.org/

Database of known and predicted protein interactions

IntAct

http://www.ebi.ac.uk/intact/main.xhtml

Database system and analysis tools for protein interaction data

Repositories

HPRD

http://www.hprd.org/

Centralized platform to visually depict and integrate information regarding the human proteome.

PRIDE

http://www.ebi.ac.uk/pride/

Largest standards compliant, public data repository for proteomics experimental data.

PTMs

 

 

Phosphosite

http://www.phosphosite.org

Phosphorylation site database.

Families and Domains

Interpro

http://www.ebi.ac.uk/interpro/

Database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes

Other databases

PharmaGKB

http://www.pharmgkb.org

Pharmacogenetics and Pharmacogenomics Knowledge Base


Table 1: List of URLs of well-known proteomics resources and tools linked by PIKE.

With the input (protein list and parameters) the system checks where to get the required information. Once the systems know what is the appropriate repository (or repositories) PIKE follows the right way to solve the protein identifiers cross-relations through the several databases.



PIKE output

Once the analysis process has been initiated, the correctness of the input file is checked before uploading it to the server. Next, based on the size of the protein list, PIKE selects the execution mode. This process will allow the user to monitor the retrieved information on-the-fly -real-time mode- or to receive the information by e-mail -background mode- . Regardless of the execution mode used, a default output (as XHTML web page) is reported (figure 2) displaying the following elements:

Results visualization, export and filtering form: Several output formats are optional (figure 2A): 1) XHTML or screen display: The information is shown as a valid XHTML page. The first option -Table View- groups the proteins listed in a table, where each protein heads a row and each field of interest is a column. The second option -List View-, displays each protein as a table itself, being each field of interest an individual row.

2) Files: The results containing the required information are compressed and downloaded in a .zip file. The results are exported according to the formats defined in the FMM module: PRIDE XML (figure 2B), CSV or plain text.

3) Images: Both JPEG and SVG formats are available for each category according to Gene Ontology classification (figure 2C).

4) Clusters: A single CSV file is created for each option: 1) SwissProt Keywords, 2) OMIN terms, 3) KEGG pathways and 4) Gene Ontology classification (BP, MF and CC).

5) Protein Enrichment: It links to DAVID web API to obtain a complete statistical analysis of the input protein set.




Figure 2: : PIKE output formats. A) XHTML or screen display: The information is shown as a valid XHTML page. It contains the following elements: 1) Output selection control, 2) Most frequent terms regarding the PIKE search, 3) Filtering options and 4) Protein detailed information table. B) PRIDE-XML output: The information retrieved by PIKE is annotated using either controlled vocabulary (CV) terms –semantically validated- or user parameters. (UP) C) Graphical Gene Ontology classification. Both JPG and SVG formats are available.



Exporting results

Regarding export formats, one of the advantages of PIKE is the possibility to integrate with existing laboratory workflows implementing international standards for data representation such as PRIDE XML or mzData. EMBL-EBI's PRIDE (PRoteomics IDEntifications) database is a valuable public repository for storing protein and peptide identifications, and the evidence supporting these identifications. It is therefore easy to extend existing laboratory workflows by connecting the experimental information stored in PRIDE files with the information available in public databases gathered by PIKE.

Moreover, the inclusion of controlled vocabularies from PRIDE XML ensures compatibility with other tools and databases. In that way, analogous to PRIDE schema, PIKE divides the retrieved information in two groups: the first may be validated using specific controlled vocabularies (CV) or Ontologies such as Gene Ontology, IntAct, KEGG or OMIN terms while the second is Free Information,without any control of the terms as in the case of SwissProt Comment field. This feature allows validating semantically PIKE output to assure information accuracy. Once PIKE has been added the biological information, the generated EBI PRIDE XML file can be submitted again to the PRIDE central repository.




Figure 3: : integration within third-part tools such as ProteinScape has been demonstrated. PIKE may be linked from any bioinformatic platform in order to enrich the information derived from an experiment.




Conclusions

PIKE represents a suitable and useful bioinformatics tool for small-or large-scale proteomics projects. PIKE main characteristic is its ability to systematically access and automatically retrieve comprehensive biological information contained in common databases. PIKE uses protein lists obtained in protein identification experiments and correlates these lists with user-defined information from databases such as NCBI nr, IPI or Uniprot/SwissProt. The resulting information is output in a wide range of standard formats that can be directly viewed, exported, or downloaded for additional analysis.


Go to previous page