What is NCBI Remap?Back to NCBI Remap Page
NCBI Remap is a tool that allows users to project annotation data from one coordinate system to another. This remapping (sometimes called 'liftover') uses genomic alignments to project features from one sequence to the other. For each feature on the source sequence, we perform a base by base analysis of each feature on the source sequence in order to project the feature through the alignment to the new sequence.
We support three variations of Remap. Assembly-Assembly allows the remapping of features from one assembly to another. RefSeqGene allows for the remapping of features from assembly sequences to RefSeqGene sequences (including transcript and protein sequences annoted on the RefSeqGene) or from RefSeqGene sequences to an assembly. Alt loci remap allows for the mapping of features between the Primary assembly and the alternate loci and Patches available for GRC assemblies.
You can view a short video describing how to use remap here: http://www.youtube.com/watch?v=0lhcMGGReVQ
With the November 2012 update, we added the following features:
- Alt locus remap: remap features between the primary assembly and the alternate loci/patches in GRC assemblies.
- Clinical Remap: When you run this we will now make a call to the variation reporter and insert the results into Clincal Remap.
- Added support for upload of compressed files. Currently GZip (.gz) and BZip2 (.bz) files are supported.
- Improved HGVS nomenclature.
Specifying the data
In order to use the NCBI Remap service, you must select the organism of interest, the assembly your features are on (Source Assembly) and the assembly on which you wish to project these features (Target Assembly). If you would like to request additional organisms or assemblies to be added to the list, please use the Write to the Help Desk to make this request.
List of supported assembly-assembly alignments in remap:
|Organism||Source Assembly||Target Assembly||Software version||Last Updated|
|Mus musculus||MGSCv36||MGSCv37||1.4||01/11/2013 19:56:52|
|Mus musculus||MGSCv35||MGSCv37||1.4||01/11/2013 20:32:08|
|Mus musculus||MGSCv34||MGSCv37||1.4||01/11/2013 21:21:15|
|Mus musculus||MGSCv3||MGSCv37||1.4||01/12/2013 00:03:06|
|Mus musculus||Mm_Celera||GRCm38.p1||1.4||01/12/2013 00:56:39|
|Mus musculus||Mm_Celera||GRCm38||1.4||01/12/2013 01:16:43|
|Mus musculus||MGSCv37||GRCm38.p1||1.4||01/12/2013 02:07:12|
|Mus musculus||MGSCv37||GRCm38||1.4||01/12/2013 02:13:58|
|Mus musculus||MGSCv36||GRCm38.p1||1.4||01/12/2013 02:39:37|
|Mus musculus||MGSCv36||GRCm38||1.4||01/12/2013 02:43:22|
|Mus musculus||MGSCv35||GRCm38.p1||1.4||01/12/2013 03:16:50|
|Mus musculus||MGSCv35||GRCm38||1.4||01/12/2013 03:20:07|
|Mus musculus||MGSCv34||GRCm38.p1||1.4||01/12/2013 04:10:15|
|Mus musculus||MGSCv34||GRCm38||1.4||01/12/2013 04:11:25|
|Mus musculus||Mm_Celera||MGSCv37||1.4||01/12/2013 06:58:09|
|Mus musculus||MGSCv3||GRCm38.p1||1.4||01/12/2013 06:58:33|
|Mus musculus||MGSCv3||GRCm38||1.4||01/12/2013 07:03:51|
|Rattus norvegicus||Rn_Celera||Rnor_5.0||1.4||01/12/2013 00:10:32|
|Rattus norvegicus||RGSC_v3.4||Rnor_5.0||1.4||01/12/2013 03:41:51|
|Rattus norvegicus||Rn_Celera||RGSC_v3.4||1.4||03/06/2013 20:29:03|
|Heterocephalus glaber||HetGla_1.0||HetGla_female_1.0||1.4||01/11/2013 20:15:52|
|Vitis vinifera||8x_WGS||12X||1.2||01/25/2012 21:13:29|
|Cucumis sativus||CSB10A_v1||CucSat_1.0||1.4||01/11/2013 18:13:35|
|Oryza sativa Japonica Group||IRGSP_3.0||Build 4.0||1.4||02/14/2013 11:08:40|
|Hydra magnipapillata||h7||Hydra_RP_1.0||1.4||01/25/2013 14:53:32|
|Nomascus leucogenys||Nleu1.0||Nleu_3.0||1.4||02/15/2013 22:33:26|
|Caenorhabditis elegans||WS195||WBcel215||1.4||02/14/2013 10:58:30|
|Nasonia vitripennis||Nvit_1.0||Nvit_2.0||1.4||02/15/2013 22:31:37|
|Apis mellifera||Amel_4.0||Amel_4.5||1.4||01/11/2013 19:40:09|
|Apis mellifera||Amel_2.0||Amel_4.5||1.4||01/11/2013 19:52:31|
|Strongylocentrotus purpuratus||Spur_v2.1||Spur_3.1||1.4||01/11/2013 21:17:15|
|Strongylocentrotus purpuratus||Spur_0.5||Spur_3.1||1.4||01/11/2013 22:48:01|
|Ciona intestinalis||v1.0||KH||1.4||02/15/2013 18:54:35|
|Danio rerio||Zv8||Zv9||1.4||01/11/2013 20:08:07|
|Danio rerio||Zv7||Zv9||1.4||01/11/2013 20:40:45|
|Gallus gallus||Gallus_gallus-2.1||Gallus_gallus-4.0||1.4||01/11/2013 20:21:55|
|Pan troglodytes||Pan_troglodytes-2.1.3||Pan_troglodytes-2.1.4||1.4||01/12/2013 01:12:48|
|Pan troglodytes||Pan_troglodytes-2.1||Pan_troglodytes-2.1.4||1.4||03/06/2013 20:53:01|
|Homo sapiens||NCBI34||NCBI35||1.2||01/25/2012 16:47:09|
|Homo sapiens||CHM1_1.0||GRCh37.p10||1.4||01/11/2013 14:02:15|
|Homo sapiens||NCBI36||GRCh37||1.4||01/11/2013 16:32:47|
|Homo sapiens||NCBI35||GRCh37||1.4||01/11/2013 17:10:16|
|Homo sapiens||NCBI34||GRCh37||1.4||01/11/2013 17:46:16|
|Homo sapiens||HuRef||GRCh37.p10||1.4||01/11/2013 18:55:13|
|Homo sapiens||NCBI33||GRCh37||1.4||01/11/2013 19:06:50|
|Homo sapiens||NCBI36||GRCh37.p10||1.4||01/11/2013 19:47:26|
|Homo sapiens||NCBI35||GRCh37.p10||1.4||01/11/2013 20:24:47|
|Homo sapiens||NCBI34||GRCh37.p10||1.4||01/11/2013 21:01:21|
|Homo sapiens||NCBI33||GRCh37.p10||1.4||01/11/2013 21:40:03|
|Homo sapiens||HuRef||GRCh37||1.4||01/11/2013 21:52:42|
|Homo sapiens||CHM1_1.0||GRCh37||1.4||01/11/2013 23:17:40|
|Homo sapiens||NCBI36||GRCh37.p11||1.4||01/15/2013 18:14:37|
|Homo sapiens||NCBI35||GRCh37.p11||1.4||01/15/2013 18:53:38|
|Homo sapiens||NCBI34||GRCh37.p11||1.4||01/15/2013 19:32:59|
|Homo sapiens||NCBI33||GRCh37.p11||1.4||01/15/2013 20:54:06|
|Homo sapiens||HuRef||GRCh37.p11||1.4||01/15/2013 23:28:03|
|Homo sapiens||CHM1_1.0||GRCh37.p11||1.4||01/16/2013 01:10:08|
|Homo sapiens||CHM1_1.0||HuRef||1.4||01/23/2013 22:04:20|
|Homo sapiens||HuRef||GRCh37.p9||1.2||07/10/2012 17:18:53|
|Homo sapiens||NCBI36||GRCh37.p5||1.2||09/03/2012 15:56:32|
|Homo sapiens||NCBI35||GRCh37.p5||1.2||09/03/2012 16:34:05|
|Homo sapiens||NCBI34||GRCh37.p5||1.2||09/03/2012 17:11:24|
|Homo sapiens||HuRef||GRCh37.p5||1.3||09/12/2012 00:24:09|
|Homo sapiens||Hs_Celera||GRCh37||1.3||09/12/2012 03:53:52|
|Homo sapiens||NCBI36||GRCh37.p9||1.3||09/12/2012 14:11:06|
|Homo sapiens||NCBI35||GRCh37.p9||1.3||09/12/2012 14:47:29|
|Homo sapiens||NCBI34||GRCh37.p9||1.3||09/12/2012 15:23:35|
|Homo sapiens||NCBI35||NCBI36||1.4||02/15/2013 22:08:20|
|Homo sapiens||NCBI34||NCBI36||1.4||02/15/2013 22:37:41|
|Homo sapiens||NCBI33||NCBI36||1.4||02/15/2013 23:09:43|
|Canis lupus familiaris||CanFam2.0||CanFam3.1||1.4||03/06/2013 17:43:30|
|Sus scrofa||Sscrofa10||Sscrofa10.2||1.4||01/11/2013 20:56:22|
|Sus scrofa||Sscrofa9.2||Sscrofa10.2||1.4||01/11/2013 23:10:45|
|Sus scrofa||Sscrofa5||Sscrofa10.2||1.4||01/12/2013 00:41:16|
|Bos taurus||Btau_4.0||Btau_4.6.1||1.4||01/11/2013 17:38:38|
|Bos taurus||Btau_3.1||Btau_4.6.1||1.4||01/11/2013 18:32:47|
|Bos taurus||Btau_4.0||Bos_taurus_UMD_3.1||1.4||02/15/2013 22:24:01|
|Bos taurus||Bos_taurus_UMD_3.1||Btau_4.2||1.4||02/15/2013 22:29:14|
|Bos taurus||Btau_4.0||Btau_4.2||1.4||02/15/2013 23:15:28|
|Bos taurus||Btau_3.1||Btau_4.2||1.4||02/16/2013 00:01:16|
|Bos taurus||UMD Bos_taurus 2.0||Bos_taurus_UMD_3.1||1.4||02/16/2013 00:04:34|
Only human is supported for the RefSeqGene tab, so all that is needed is for you to select the sequence upon which your features are annotated (either an assembly or RefSeqGenes) and the sequences to which you want the features mapped (either RefSeqGenes or an assembly).
Alt loci remap
Alt loci remap allows you to map data between the Primary Assembly and the Alternate Loci/ Patches that may be available for an assembly. Only assemblies produced by the Genome Reference Consortium are supported on this page. All you need to select on this page is the organism and the assembly, the software will figure out the direction in which you want to map.
NOTE: For both Clinical Remap and Alt loci remap if you map FROM an assembly to either the RefSeqGenes or the Alternate Loci/Patches, you may have a lot of failed features as both of these sequences only cover a fraction of the genome. To see genome coverage for Alternate Loci/Patches see the GRC pages for human and mouse.
Some configuration options are available that will allow you to configure the stringency of remapping. This options are only configurable in the Assembly-Assembly tab.
- Minimum ratio of bases that must be remapped (default: 0.5): This option specifies the percentage of the interval that must be able to be remapped. Raising this value increases the stringency of the remapping process.
- Maximum ratio for difference between the source length and the target length (default 2.0): This feature allows the remapping algorithm to tolerate insertions and deletions in the alignment. This is calculated by taking the interval length on the target assembly (stop-start+1) and dividing it by the interval length on the source assembly (stop-start+1). An insertion or deletion in the target assembly will affect this ratio. Lowering this value will increase the stringency of the remapping process.
- Allow multiple locations to be returned (default: on): We perform alignments in two phases (see 'About our alignments'). Selecting this option will allow the 'Second Pass' alignments to be used and improve coordinate projection in regions of duplication. This can also lead to multiple features being remapped to the same location.
- Merge Fragments (default: on): An insertion in the target assembly will split a feature on the source assembly, selecting this option will merge these two locations into a single location in the annotation file. Turning this feature off will increase the stringency of the remapping process, specifically in cases where there is an insertion in the target sequence as each remapped interval will be compared to the original interval.
The merge function can help you remap features that cross an assembly gap, or have a large insertion that causes a gap in the alignment.
Example of a feature crossing a gap
Figure 1: A region with a feature that crosses an assembly gap. This feature was successfully remapped because the merge function was on.
However, in regions with message alignments, the merge function can cause a feature to be remapped to the same, or overlapping positions. This only happens when using the Second Pass alignments for reamapping as these alignments are not guaranteed to bee unique.
Region with complicated alignments in the second pass.
Figure 2: A region with nice First Pass alignments and many Second Pass alignments.
Using the merge function, this feature remaps to six locations in GRCh37, one using the First Pass alignments and five using the Second Pass. These are easily distinguished using the remap report as the 'recip' column specifies whether the first pass or second pass alignments were used.
remap report for feature in region with complicated second pass alignments.
Figure 3: Remap report for feature with multiple locations returned due to complicated second pass alignments.
These features are relatively easy to identify in a post processing step, or you can turn the merge function off. This will, however, negatively affect features that cross a gap. You may need to review the alignments (which you can do using the Genome Workbench project files) to determine the best course of action.
Note: Alignments are processed in a strand specific manner. If a feature aligns to a region for which there are alignments on both strands, you may get a placement returned for the plus and the minus strand. Using the merge feature may increase the chances of this as merge helps to span alignment gaps. Turning merge off will cause a decrease in remapped features as gaps will not be crossed on either strand.
We accept file formats that are commonly used in the bioinformatics community. We currently accept:
Because the GTF/GFF/GFF3 formats are so similar we provide a single menu item for these formats.
The default behavior is to provide the remapped annotation file in the same format as the input file, but you can specify a different format for the output.
If you have a small amount of data, you can just copy and paste the data in the large text box labeled 'Paste data here'. Otherwise, you can just upload the data file.
Please note: the larger your file is, the longer it will take to perform the remapping process. If you find that the process is taking a very long time, or failing, you may want to split your files into smaller ones, perhaps based on chromosome assignment. There is also an absolute limit on the amount of RAM available to the system. If this is exceeded, Remap will fail. If this happens try again with a smaller file.
You may also provide data in the text box provided. In addition to the formats described above, you can put a region into the text box. For example:
Clinical Remap tab only Data options
Mapping from a RefSeqGene(s) to an assembly: In this case, an additional option is provided (checked by default). This will allow the service to return features on both the genomic sequences as well as any transcripts (NMs) or proteins (NPs) available at that locus.
Mapping from an assembly to RefSeqGenes: In this case, you have the ability to map to any available RefSeqGene (default) or you can specify a list of RefSeqGenes as targets. If you select to map to any available RefSeqGene there are two additional options for providing locations on transcripts (NMs) or proteins. One is to provide the transcript (NM) and protein (NP) locations for features that map to RefSeqGenes and the other is to provide transcript (NM) and protein (NP) locations even if there isn't a RefSeqGene where your feature maps. Not all genes in the genome have a RefSeqGene. There is a link on the page that allows you to request the construction of a RefSeqGene if one is not available for your gene of interest.
Summary Data: This is a global report to provide an overview of remapping results. The format of the report is (by column):
- ID: The sequence ID in the source assembly (often something like 'chr1' or NC_000001.9).
- Source Features: The number of features on the ID in the source file.
- Remapped Features: The number of features that could be projected onto the Target assembly.
- Source Intervals: The number of intervals on the ID in the source file. This happens because some features will have more than one sequence interval, for example, mRNA features will often have multiple intervals (corresponding to exons).
- Remapped Intervals: The number of intervals that could be projected onto the Target assembly.
The summary data appears on the web page and is available for download.
Mapping Report: This is a report that provides a feature by feature breakdown of the remapping status. The format of this report on the web page is (by column):
- Feature: The name or ID of the feature (the source of this will depend on the format submitted, but it should be possible to robustly associate the information in this column with the data in the input file).
- Src. Intervals: Number of intervals the feature has in the source file.
- Remap Intervals: Number of intervals that were projected to the target assembly.
- Src location: The feature location in the input file.
- Src length: The length of the feature in the input file.
- Map Location: Projected location (or reason that the remap failed) on the target assembly.
- Map length: Length of the feature on the target assembly.
- Coverage: Coverage of feature on the target assembly.
Only a few lines of this report are displayed on the web page, but the entire report is available for download in a tab separated file (tsv) that can be easily parsed, or loaded to spreadsheet program. The downloaded report has 18 columns as follows:
- #feat_name: user supplied feature name. If no feature name is supplied, a name is calculated using the line number in the file or the location.
- source_int: The number of intervals in the source file (useful for tracking features with multiple intervals, like genes).
- mapped_int: the number of intervals in the remapped file.
- source_id: sequence identifier the feature maps to in the source file.
- mapped_id: sequence identifier the features maps to on the target assembly.
- source_length: length of the feature on the source assembly.
- mapped_length: length of the feature on the target assembly.
- source_start: first base of the feature on the source assembly.
- source_stop: last base of the feature on the source assembly.
- source_strand: strand the feature is annotated on in the source assembly.
- source_sub_start: first base of sub interval on the source assembly (i.e. an exon feature).
- source_sub_start: last base of sub interval on the source assembly (i.e. an exon feature).
- mapped_start: first base of remapped interval.
- mapped_stop: last base of remapped interval.
- mapped_strand: strand of remapped base.
- coverage: This is calculated by taking the ratio of the mapped_length to the source_length. If coverage =1 the remapped and source interval are identical. A coverage score of less than 1 indicates a deletion in the target assembly and a score of greater than 1 indicates an insertion in the target assembly.
- recip: Two possible values are in this column. First Pass means the remapping is based on the 'First Pass' or reciprocal best hit alignments. 'Second Pass' means the remapping is based on the non-reciprocal best hit alignments.
- asm_unit: The assembly unit to which the mapped_id belongs. For more information on assembly units, see: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/model.shtml
Features that don't remap will have the word 'NOMAP' in column 15 and the reason for not mapping in column 16. The reasons are:
- NOALIGN: There was no alignment for this region.
- LOWCOV: The percent of the interval covered in the alignment was below the coverage threshold specified in the 'Remapping Options' (Minimum ratio of bases that must be remapped).
- EXPANDED: The ratio of the length on the target sequence versus the length on the source sequence is greater than specified in the remap options (default is 2).
Clinical Remap Only Output:
When you run Cliincal Remap, we will make a call to the Variation Reporter to provide an analysis of your variant data. We then inject the report produced by the Variation Reporter into the Remap output. For more information on the Variation Reporter report, so the help page.
Annotation Data: This file contains only the remapped features, in the format specified on the input page. No sample data is shown on the web page, but the file is available for download and display in your favorite viewer.
Genome Workbench Files: These are files that can be loaded directly into our client side viewer called Genome Workbench. They contain the sequence information for both the source and target assemblies, the assembly-assembly alignments used in the remapping and feature annotations (both the source features and the remap features). These files are available for download and are very useful for understanding how the alignments influenced the feature remapping (see Figure 1).Example of a GBench file produced by Remap
Figure 1: View of remapping in genome workbench. The sequence being shown in this view is the Target assembly. The tracks are (in order from the top):
- Ruler: showing basepair coordinates.
- Sequence: for some organisms this will be colored and for others it will be grey. This track will show you the actual base pairs if you zoom in enough.
- Tiling Path: Shows the INSDC sequences used to construct the sequence.
- Genes Track: Gene annotation from NCBI annotation process.
- Alignments: Alignment to the Source assembly. This will have the 'First Pass' alignments and the 'Second Pass' alignments if the 'Allow duplications' option was checked. The alignments are zoomed to the base pair level. Mismatches are colored in red. Insertions are shown using a blue triangle (none in this view).
- SNP features: Variation features defined by dbSNP.
- Only the remapped features are shown here. In this example features from dbVar were mapped from NCBI36->GRCh37.p9. Only remapped features are shown on the target assembly. If you open a sequence that is part of the Source assembly you can see the orginal features.