HGNC BioMart help

The HGNC BioMart application allows users to create customised data tables without the need for any programming knowledge by interacting with a form to filter the data and select the columns/attributes they want within the table. This page details how to interact with the BioMart Mart form and provides definitions of the filters and attributes.

Contents

BioMart overview & project
HGNC Marts
Example of how to use BioMart
BioMart RESTful service

BioMart overview & project

BioMart is a generic data management system which offers a range of advanced query interfaces and administration tools.

The system comes with built-in support for query-optimisation and database federation. BioMart provides users with the ability to conduct fast, powerful queries using either web, graphical, or text based applications, or programatically using web service or software libraries written in Perl and Java. For data providers, the system simplifies the task of integrating their own data with other datasets hosted on the network.

All the software, including an easy to install BioMart website, is available for local installation. BioMart software is completely Open Source, licensed under the LGPL, and freely available to anyone without restrictions.

For more information about the BioMart project and to download the code visit the BioMart site.

HGNC Marts

The HGNC BioMart homepage provide a list of HGNC Marts that are available to use. By clicking on a Mart name the user will be taken to a mart form for the dataset of choice. So far we have two marts to choose from, a gene mart for gene symbol centric data and a family mart for the gene family centric data.

All the mart forms have the same template where the form is split into three parts, Datasets, Filters and Attributes.

Datasets

The datasets part of the mart form is for the user to select the database and the dataset they would like to query and download. The HGNC only have one database and so the database dropdown can be ignored. If the user has entered the site via the HGNC BioMart homepage the user will not have to change the dataset. However if the user has changed their mind and want to download data from another dataset the user can select a different dataset using the "Datasets" dropdown box which will change the form. As we have already mentioned we have two datasets to choose from so far, the gene dataset and the family dataset.

Filters

The filters section is an area for the user to filter the data by the provided fields. There are several types of filter for the user to interact with, the most common type being the text input filter. The filters are split into subsections, according to the type of field/data they filter. Filters are not required for a BioMart search. If a user wants to select attributes for all the data in the dataset they should ignore this section of the form.

Text input filters: Text input filters usually allow the user to add a wildcard "%" symbol to allow BioMart to search the field for data that is like the filter query.
Select box filters: Select box filters are easy to use in that all the user has to do is click on the filter and select the value to filter by. By default the filter will say "-- Select --" and by leaving it like this BioMart will ignore the filter.
Multiple select filters: Multiple select filters are scroll boxes that contain many values per line. To filter by a particular value the user can click on that value. If the user would like to filter on many values, a user using a windows computer should hold down the control (ctrl) key and click on another value. Mac users need to hold down the command (cmd) key instead.
Bulk upload filters: Our Mart forms also have bulk upload filters. The user first selects the field in which they would like to query multiple time by selecting a value within the drop down select box. The user can then place their values within the text area box or click the "upload file" link to select a file which contains the query values. All of the values have to be of the type selected within the drop down (i.e a user cannot provide a file or type in values that contain mixed ID/symbol/accession types).

Gene filters

HGNC data filter

Approved symbol: The official gene symbol that has been approved by the HGNC and is publicly available. Symbols are approved based on specific HGNC nomenclature guidelines . In the HTML results page this ID links to the HGNC Symbol Report for that gene.

Approved name: The official gene name that has been approved by the HGNC and is publicly available. Names are approved based on specific HGNC nomenclature guidelines .

Alias gene symbol: Other symbols used to refer to this gene.

Alias name: Other names used to refer to this gene.

Previous HGNC symbol: Symbols previously approved by the HGNC for this gene.

Previous HGNC name: Gene names previously approved by the HGNC for this gene.

Filter by genes...: This filter allows the user to remove rows from the results table for genes that do not have a value within a selected field.

Status

Indicates whether the gene is classified as:

Approved - these genes have HGNC-approved gene symbols
Entry withdrawn - these previously approved genes are no longer thought to exist
Symbol withdrawn - a previously approved record that has since been merged into a another record

Locus group

Groups locus types together into related sets. Below is a list of groups and the locus types within the group:

protein-coding gene - contains the "gene with protein product" locus type
non-coding RNA - contains the following locus types:
- RNA, Y
- RNA, cluster
- RNA, long non-coding
- RNA, micro
- RNA, misc
- RNA, ribosomal
- RNA, small cytoplasmic
- RNA, small nuclear
- RNA, small nucleolar
- RNA, transfer
- RNA, vault
pseudogene - contains the following types:
- immunoglobulin pseudogene
- pseudogene
- T cell receptor pseudogene
phenotype - contains the "phenotype only" locus type
other - contains the following types:
- endogenous retrovirus
- fragile site
- immunoglobulin gene
- protocadherin
- readthrough
- region
- T cell receptor gene
- transposable element
- unknown
- virus integration site
withdrawn - contains the "withdrawn" locus type only

Locus type

Specifies the type of locus described by the given entry:

gene with protein product - protein-coding genes (the protein may be predicted and of unknown function) ( SO:0001217)
RNA, Y - non-protein coding genes that encode Y RNAs ( SO:0000405)
RNA, cluster - region containing a cluster of small non-coding RNA genes
RNA, long non-coding - non-protein coding genes that encode long non-coding RNAs (lncRNAs) ( SO:0001877); these are at least 200 nt in length. Subtypes include intergenic ( SO:0001463), intronic ( SO:0001903) and antisense ( SO:0001904).
RNA, micro - non-protein coding genes that encode microRNAs (miRNAs) ( SO:0001265)
RNA, misc - non-protein coding genes that encode miscellaneous types of small ncRNAs
RNA, ribosomal - non-protein coding genes that encode ribosomal RNAs (rRNAs) ( SO:0001637)
RNA, small cytoplasmic - non-protein coding genes that encode small cytoplasmic RNAs (scRNAs) ( SO:0001266)
RNA, small nuclear - non-protein coding genes that encode small nuclear RNAs (snRNAs) ( SO:0001268)
RNA, small nucleolar - non-protein coding genes that encode small nucleolar RNAs (snoRNAs) containing C/D or H/ACA box domains ( SO:0001267)
RNA, transfer - non-protein coding genes that encode transfer RNAs (tRNAs) ( SO:0001272)
RNA, vault - non-protein coding genes that encode vault RNAs ( SO:0000404)
phenotype only - mapped phenotypes where the causative gene has not been identified ( SO:0001500)
T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
immunoglobulin pseudogene - immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
pseudogene - genomic DNA sequences that are similar to protein-coding genes but do not encode a functional protein ( SO:0000336)
T cell receptor gene - gene segments that undergo somatic recombination to form either alpha, beta, gamma or delta chain T cell receptor genes ( SO:0000460). Also includes T cell receptor gene segments with open reading frames that either cannot undergo somatic recombination, or encode a peptide that is not predicted to fold correctly; these are identified by inclusion of the term “non-functional” in the gene name.
complex locus constituent - transcriptional unit that is part of a named complex locus
endogenous retrovirus - integrated retroviral elements that are transmitted through the germline ( SO:0000100)
fragile site - a heritable locus on a chromosome that is prone to DNA breakage
immunoglobulin gene - gene segments that undergo somatic recombination to form heavy or light chain immunoglobulin genes ( SO:0000460). Also includes immunoglobulin gene segments with open reading frames that either cannot undergo somatic recombination, or encode a peptide that is not predicted to fold correctly; these are identified by inclusion of the term “non-functional” in the gene name.
protocadherin - gene segments that constitute the three clustered protocadherins (alpha, beta and gamma)
readthrough - a naturally occurring transcript containing coding sequence from two or more genes that can also be transcribed individually
region - extents of genomic sequence that contain one or more genes, also applied to non-gene areas that do not fall into other types
transposable element - a segment of repetitive DNA that can move, or retrotranspose, to new sites within the genome ( SO:0000101)
unknown - entries where the locus type is currently unknown
virus integration site - target sequence for the integration of viral DNA into the genome

Chromosome: The chromosome where the gene can be found.

Bulk upload filter

Filter by ID, accession or symbol

This field allows the user to provide multiple query values to bulk search BioMart. The list of values must all be of the type selected using the drop down box. Values can be typed/pasted into the text area or uploaded within a file by clicking on the "upload file" link. The types accepted in this filter are as follows:

HGNC ID(s) - A unique ID provided by the HGNC for each gene with an approved symbol. IDs are of the format HGNC:n, where n is a unique number.
Approved symbols - The official gene symbol that has been approved by the HGNC.
Alias gene symbols - Other symbols used to refer to the gene.
Previous HGNC symbols - Symbols previously approved by the HGNC for the gene.
CCDS accessions - The Consensus CDS (CCDS) accession.
INSDC (ENA/GenBank/DDBJ) accessions - INSDC nucleotide sequence accession numbers.
Ensembl gene ID(s) - The ID for an Ensembl gene entry.
Mouse genome informatics (MGI) ID(s) - Mouse Genome Informatics ID for a mouse homolog of human genes.
NCBI Gene ID(s) - IDs that are associated with a gene with NCBI gene.
OMIM ID(s) - Identifier from the Online Mendelian Inheritance in Man (OMIM).
Orphanet ID(s) - The Orphanet ID identifies a gene within orphanet and the rare diseases that are associated to the gene.
Pseudogene.org ID(s) - An ID for a pseudogene entry/sequence within the Pseudogene.org database.
RefSeq accessions - The Reference Sequence (RefSeq) identifier.
Rat Genome Database (RGD) ID(s) - Rat Genome Database ID for a rat homolog of human genes.
UniProt accessions - The UniProt identifier for a protein product of a gene.
Vega gene ID(s) - The Vega gene ID.

Family filters

HGNC data filter

Family name: The name given/chosen by the HGNC for the family.

Family alias: Alternative names that are also used to describe the gene family.

Root gene symbol: The root/stem symbol that is common to most of the genes belonging to the gene family.

Bulk upload filter

Filter by IDs or symbols: This field allows the user to provide multiple family IDs, HGNC (gene) IDs and approved gene symbols to BioMart to search. The list of values must all be of the type selected using the drop down box. Values can be typed/pasted into the text area or uploaded within a file by clicking on the "upload file" link.

Attributes

The Attributes section of the form is where the user selects what they want displayed within their table for download and it is a requirement of BioMart to select at least one attribute. On both the gene and family marts some of the key attributes are selected by default however the user can deselect these defaults. The attribute section is divided up into subsections to group similar attributes fields together. To select or deselect an attribute the user should click on the check box next to the attributes label. Alternatively the user can select or deselect all the attributes within subsection by clicking on the links labelled "select all" and "select none".

Gene attributes

HGNC data

HGNC ID: A unique ID provided by the HGNC for each gene with an approved symbol. IDs are of the format HGNC:n, where n is a unique number.

Status

Indicates whether the gene is classified as:

Approved - these genes have HGNC-approved gene symbols
Entry withdrawn - these previously approved genes are no longer thought to exist
Symbol withdrawn - a previously approved record that has since been merged into a another record

Approved symbol: The official gene symbol that has been approved by the HGNC and is publicly available. Symbols are approved based on specific HGNC nomenclature guidelines.

Approved name: The official gene name that has been approved by the HGNC and is publicly available. Names are approved based on specific HGNC nomenclature guidelines .

Alias symbol: Other symbols used to refer to the gene.

Alias name: Other names used to refer to the gene.

Previous symbol: Symbols previously approved by the HGNC for the gene.

Previous name: Gene names previously approved by the HGNC for the gene.

Chromosome: The chromosome where the gene can be found.

Chromosome location: Indicates the location of the gene or region on the chromosome

Locus group

Groups locus types together into related sets. Below is a list of groups and the locus types within the group:

protein-coding gene - contains the "gene with protein product" locus type
non-coding RNA - contains the following locus types:
- RNA, cluster
- RNA, long non-coding
- RNA, micro
- RNA, ribosomal
- RNA, small cytoplasmic
- RNA, small misc
- RNA, small nuclear
- RNA, small nucleolar
- RNA, transfer
pseudogene - contains the following types:
- immunoglobulin pseudogene
- pseudogene
- T cell receptor pseudogene
phenotype - contains the "phenotype only" locus type
other - contains the following types:
- endogenous retrovirus
- fragile site
- immunoglobulin gene
- protocadherin
- readthrough
- region
- T cell receptor gene
- transposable element
- unknown
- virus integration site
withdrawn - contains the "withdrawn" locus type only

Locus type

Specifies the type of locus described by the given entry:

complex locus constituent - transcriptional unit that is part of a named complex locus
endogenous retrovirus - integrated retroviral elements that are transmitted through the germline ( SO:0000100)
fragile site - a heritable locus on a chromosome that is prone to DNA breakage
gene with protein product - protein-coding genes (the protein may be predicted and of unknown function) ( SO:0001217)
immunoglobulin gene - gene segments that undergo somatic recombination to form heavy or light chain immunoglobulin genes ( SO:0000460)
immunoglobulin pseudogene - immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
phenotype only - mapped phenotypes ( SO:0001500)
protocadherin - gene segments that constitute the three clustered protocadherins (alpha, beta and gamma)
pseudogene - genomic DNA sequences that are similar to protein-coding genes but do not encode a functional protein ( SO:0000336)
readthrough - a naturally occurring transcript containing coding sequence from two or more genes that can also be transcribed individually
region - extents of genomic sequence that contain one or more genes, also applied to non-gene areas that do not fall into other types
RNA, cluster - region containing a cluster of small non-coding RNA genes
RNA, long non-coding - non-protein coding genes that encode long non-coding RNAs (lncRNAs); these are at least 200 nt and are represented by a processed trancript and/or at least 3 ESTs
RNA, micro - non-protein coding genes that encode microRNAs (miRNAs) ( SO:0001265)
RNA, ribosomal - non-protein coding genes that encode ribosomal RNAs (rRNAs) ( SO:0001637)
RNA, small nuclear - non-protein coding genes that encode small nuclear RNAs (snRNAs) ( SO:0001268)
RNA, small nucleolar - non-protein coding genes that encode small nucleolar RNAs (snoRNAs) containing C/D or H/ACA box domains ( SO:0001267)
RNA, small cytoplasmic - non-protein coding genes that encode small cytoplasmic RNAs (scRNAs) ( SO:0001266)
RNA, transfer - non-protein coding genes that encode transfer RNAs (tRNAs) ( SO:0001272)
RNA, small misc - non-protein coding genes that encode miscellaneous types of small ncRNAs
T cell receptor gene - gene segments that undergo somatic recombination to form either alpha, beta, gamma or delta chain T cell receptor genes ( SO:0000460)
T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
transposable element - a segment of repetitive DNA that can move, or retrotranspose, to new sites within the genome ( SO:0000101)
unknown - entries where the locus type is currently unknown
virus integration site - target sequence for the integration of viral DNA into the genome

HGNC family ID: Each gene family has a unique numerical ID that forms the last part of the gene family page URL to aid linking and downloading.

HGNC family name: The name given/chosen by the HGNC for the family.

Date approved: Date the gene symbol and name were approved by the HGNC.

Date modified: If applicable, the date the entry was modified by the HGNC.

Date symbol changed: If applicable, the date the approved gene symbol was last changed by the HGNC.

Date name changed: If applicable, the date the approved gene name was last changed by the HGNC.

Model organism databases

Mouse genome informatics (MGI) ID: Mouse Genome Informatics ID for the mouse homologs of the human gene.

Rat genome database (RGD) ID: Rat Genome Database ID for the rat homologs of the human gene.

Gene resources

Ensembl gene ID: The Ensembl gene ID associated with the HGNC gene symbol. The Ensembl project produces genome databases for vertebrates and other eukaryotic species.

NCBI gene ID: The NCBI gene ID associated with the HGNC gene symbol. NCBI gene at the NCBI provide curated sequence and descriptive information about genetic loci including official nomenclature, synonyms, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites.

UCSC gene ID: The UCSC gene ID associated with the HGNC gene symbol. The ID is used within the UCSC genome browser to identify an annotated human gene record within the UCSC genome browser.

Vega gene ID: The Vega gene ID associated with the HGNC gene symbol. The VEGA database is a repository for high-quality gene models produced by the manual annotation of vertebrate genomes.

Nucleotide resources

CCDS accession: The Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations.

INSDC (ENA/GenBank/DDBJ) accession: INSDC nucleotide sequence accession numbers selected by the HGNC for a gene.

RefSeq accession: The Reference Sequence ( RefSeq) identifier displayed within the HGNC gene symbol report. RefSeq aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. RefSeq identifiers are designed to provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses.

RNAcentral ID: RNAcentral is a public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases.

Protein resources

UniProt accession: The UniProt identifier for a protein product of the gene . The UniProt Protein Knowledgebase is described as a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases.

Enzyme (EC) ID: Enzyme entries have Enzyme Commission (EC) numbers associated with them that indicate the hierarchical functional classes to which they belong.

Clinical resources

Cosmic symbol: The gene symbol displayed within the Catalogue Of Somatic Mutations In Cancer ( Cosmic). Most of the gene symbols will be the same as HGNC approved gene symbol but for some genes in Cosmic this may not be the case.

OMIM ID: Identifier provided by Online Mendelian Inheritance in Man (OMIM) . This database is described as a catalog of human genes and genetic disorders containing textual information and links to additional related resources.

Orphanet ID: Orphanet is the reference portal for information on rare diseases and orphan drugs, for all audiences. Orphanet’s aim is to help improve the diagnosis, care and treatment of patients with rare diseases. The Orphanet ID identifies a gene within orphanet and the rare diseases that are associated to the gene.

Locus reference genomic (LRG) ID: LRG sequences provide a stable genomic DNA framework for reporting mutations with a permanent ID and a core content that never changes.

Locus specific database (LSDB) name: This contains LSDB database names pertinent to the gene.

Locus specific database (LSDB) URL: This contains LSDB database URL pertinent to the gene.

References

PubMed ID: Identifier that links to published articles relevant to the gene in the NCBI's PubMed database .

Other external resources

HCDM CD name: The CD name for a cellular differentiation molecule found within the HCDM database.

HomeoDB ID: ID for a homeobox gene within the Homeobox database ( HomeoDB2).

HORDE symbol: The ID for an olfactory receptor gene entry within the Human Olfactory Receptor Data Exploratorium ( HORDE) database.

IMGT gene symbol: The IMGT/GENE-DB gene symbol for immunoglobulin and T-cell receptor genes associated to the HGNC gene. The gene symbols are either the same as, or equivalent to, HGNC approved gene symbols. Equivalent IMGT symbols include the character "/" which is not present in HGNC approved symbols. The presence of an IMGT gene symbol indicates that the gene can be found within the IMGT/GENE-DB.

Intermediate filament database HGNC ID: The HGNC ID stored within the Human Intermediate Filament Database for an intermediate filament gene.

IUPHAR/BPS guide to pharmacology ID: IUPHAR/BPS Guide to PHARMACOLOGY is an expert-driven guide to pharmacological targets and the substances that act on them. The ID is their object ID that is used as an identifier for a gene record within their database.

KZNF gene catalog ID: The KZNF catalog is a comprehensive collection of Krüppel-type zinc finger genes (KZNFs) in primates with finished or high quality draft genomes. The ID refers to a gene report within the KZNF catalog.

mamit-tRNADB ID: Mamit-tRNAdb is a compilation of mammalian mitochondrial tRNA genes. The ID refers to a tRNA gene within the mamit-tRNAdb database.

Merops ID: The MEROPS database is an information resource for peptidases (also termed proteases, proteinases and proteolytic enzymes) and the proteins that inhibit them.

mirBase accession: An accession number for a microRNA sequence within the miRBase database for the HGNC gene.

Pseudogene.org ID: An ID for a pseudogene entry/sequence within the Pseudogene.org database for the HGNC gene.

SLC bioparadigms symbol: The gene symbol for a solute carrier gene as found in the Bioparadigms SLC tables database.

snoRNABase (snoid) ID: snoRNABase is a comprehensive database of human H/ACA and C/D box snoRNAs. The ID itself refers to a snoRNA page within the database resource.

LncRNADB symbol: lncRNAdb is a database providing comprehensive annotations of eukaryotic long non-coding RNAs (lncRNAs). Most of the gene symbols will be the same as HGNC approved gene symbols however for some genes this may not be the case.

LNCipedia symbol: LNCipedia is a comprehensive compendium of human long non-coding RNAs (lncRNAs). Most of the gene symbols will be the same as HGNC approved gene symbols however for some genes this may not be the case.

Family attributes

HGNC family attributes

Family ID: Each gene family has a unique numerical ID that forms the last part of the gene family page URL to aid linking and downloading.

Family name: The name given/chosen by the HGNC for the family.

Family alias: Other commonly-used gene family names and abbreviations.

Root gene symbol: The root/stem symbol that is common to most of the genes belonging to the gene family.

Description: A brief description about the gene family in question.

Description source: The source of the text for the description. Sources are usually from wikipedia, UniProt or our own HGNC description. Other sources may be used.

External family resources

Resource name: Gene family specific database resource name.

Resource description: Gene family specific database resource description.

Resource URL: Gene family specific database resource URL.

PubMed ID: PubMed ID for a reference pertinent to the gene family. We do not aim to list all possible published papers on the family but we provide PubMed IDs to papers that first described the gene family in question or papers that are particularly relevant to the nomenclature of the genes.

HGNC Gene attributes

HGNC ID (gene): A unique ID provided by the HGNC for each gene with an approved symbol. IDs are of the format HGNC:n, where n is a unique number.

Approved symbol: The official gene symbol that has been approved by the HGNC and is publicly available. Symbols are approved based on specific HGNC nomenclature guidelines . In the HTML results page this ID links to the HGNC Symbol Report for that gene.

Approved name: The official gene name that has been approved by the HGNC and is publicly available. Names are approved based on specific HGNC nomenclature guidelines .

Status

Indicates whether the gene is classified as:

Approved - these genes have HGNC-approved gene symbols
Entry withdrawn - these previously approved genes are no longer thought to exist
Symbol withdrawn - a previously approved record that has since been merged into a another record

Locus group

Groups locus types together into related sets. Below is a list of groups and the locus types within the group:

protein-coding gene - contains the "gene with protein product" locus type
non-coding RNA - contains the following locus types:
- RNA, cluster
- RNA, long non-coding
- RNA, micro
- RNA, ribosomal
- RNA, small cytoplasmic
- RNA, small misc
- RNA, small nuclear
- RNA, small nucleolar
- RNA, transfer
pseudogene - contains the following types:
- immunoglobulin pseudogene
- pseudogene
- T cell receptor pseudogene
phenotype - contains the "phenotype only" locus type
other - contains the following types:
- endogenous retrovirus
- fragile site
- immunoglobulin gene
- protocadherin
- readthrough
- region
- T cell receptor gene
- transposable element
- unknown
- virus integration site
withdrawn - contains the "withdrawn" locus type only

Locus type

Specifies the type of locus described by the given entry:

complex locus constituent - transcriptional unit that is part of a named complex locus
endogenous retrovirus - integrated retroviral elements that are transmitted through the germline ( SO:0000100)
fragile site - a heritable locus on a chromosome that is prone to DNA breakage
gene with protein product - protein-coding genes (the protein may be predicted and of unknown function) ( SO:0001217)
immunoglobulin gene - gene segments that undergo somatic recombination to form heavy or light chain immunoglobulin genes ( SO:0000460)
immunoglobulin pseudogene - immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
phenotype only - mapped phenotypes ( SO:0001500)
protocadherin - gene segments that constitute the three clustered protocadherins (alpha, beta and gamma)
pseudogene - genomic DNA sequences that are similar to protein-coding genes but do not encode a functional protein ( SO:0000336)
readthrough - a naturally occurring transcript containing coding sequence from two or more genes that can also be transcribed individually
region - extents of genomic sequence that contain one or more genes, also applied to non-gene areas that do not fall into other types
RNA, cluster - region containing a cluster of small non-coding RNA genes
RNA, long non-coding - non-protein coding genes that encode long non-coding RNAs (lncRNAs); these are at least 200 nt and are represented by a processed trancript and/or at least 3 ESTs
RNA, micro - non-protein coding genes that encode microRNAs (miRNAs) ( SO:0001265)
RNA, ribosomal - non-protein coding genes that encode ribosomal RNAs (rRNAs) ( SO:0001637)
RNA, small nuclear - non-protein coding genes that encode small nuclear RNAs (snRNAs) ( SO:0001268)
RNA, small nucleolar - non-protein coding genes that encode small nucleolar RNAs (snoRNAs) containing C/D or H/ACA box domains ( SO:0001267)
RNA, small cytoplasmic - non-protein coding genes that encode small cytoplasmic RNAs (scRNAs) ( SO:0001266)
RNA, transfer - non-protein coding genes that encode transfer RNAs (tRNAs) ( SO:0001272)
RNA, small misc - non-protein coding genes that encode miscellaneous types of small ncRNAs
T cell receptor gene - gene segments that undergo somatic recombination to form either alpha, beta, gamma or delta chain T cell receptor genes ( SO:0000460)
T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
transposable element - a segment of repetitive DNA that can move, or retrotranspose, to new sites within the genome ( SO:0000101)
unknown - entries where the locus type is currently unknown
virus integration site - target sequence for the integration of viral DNA into the genome

Chromosome: The chromosome where the gene can be found.

Date approved: Date the gene symbol and name were approved by the HGNC.

Date modified: If applicable, the date the gene entry was modified by the HGNC.

Date name changed: If applicable, the date the approved gene name was last changed by the HGNC.

Date symbol changed: If applicable, the date the approved gene symbol was last changed by the HGNC.

Other external resources

Ensembl gene ID: The Ensembl gene ID associated with the HGNC gene symbol. The Ensembl project produces genome databases for vertebrates and other eukaryotic species.

NCBI gene ID: The NCBI gene ID associated with the HGNC gene symbol. NCBI gene at the NCBI provide curated sequence and descriptive information about genetic loci including official nomenclature, synonyms, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites.

UCSC gene ID: The UCSC gene ID associated with the HGNC gene symbol. The ID is used within the UCSC genome browser to identify an annotated human gene record within the UCSC genome browser.

UniProt accession: The UniProt identifier for a protein product of the gene. The UniProt Protein Knowledgebase is described as a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases.

Vega gene ID: The Vega gene ID associated with the HGNC gene symbol. The VEGA database is a repository for high-quality gene models produced by the manual annotation of vertebrate genomes.

Example of how to use BioMart

BioMart RESTful service

Create a query

On the preview page of your BioMart search you will see near the top a tab labelled "REST/SOAP". Clicking on this tab will produce a query as XML to retrieve the same data that you are previewing.

Using the XML query

To use the XML query you need the url http://biomart.genenames.org/martservice/results. You can either POST the query or use the query in a GET request.

POSTing the query

POSTing the query in my opinion is the better solution. Save the XML snippet in a file called query.xml (or name it however you like). Then you can use the file in a POST request to the martserver using a tool like curl.

curl --data-urlencode query@query.xml
              http://biomart.genenames.org/martservice/results

GET request method

Copy the XML query snippet and use the martservice URL as shown below:


              http://biomart.genenames.org/martservice/results?query=<PASTE
              QUERY HERE>

You may have to URL encode the XML query for the GET to work. You can do this by using a tool such as the online URL encode/decode tool. GET requests have a 2,048 character limit so depending on the size of the query you may have to use the POST method.

More help about the BioMart RESTful service and more can be found within the BioMart 0.9 documentation PDF.