The
HGNC BioMart application
allows users to create customised data tables without the need for
any programming knowledge by interacting with a form to filter the
data and select the columns/attributes they want within the table.
This page details how to interact with the BioMart Mart form and
provides definitions of the filters and attributes.
Contents
BioMart overview & project
BioMart is a generic data management system which offers a range of
advanced query interfaces and administration tools.
The system comes with built-in support for query-optimisation and
database federation. BioMart provides users with the ability to
conduct fast, powerful queries using either web, graphical, or text
based applications, or programatically using web service or software
libraries written in Perl and Java. For data providers, the system
simplifies the task of integrating their own data with other
datasets hosted on the network.
All the software, including an easy to install BioMart website, is
available for local installation. BioMart software is completely
Open Source, licensed under the LGPL, and freely available to anyone
without restrictions.
For more information about the BioMart project and to download the
code visit the BioMart site.
HGNC Marts
The
HGNC BioMart homepage
provide a list of HGNC Marts that are available to use. By clicking
on a Mart name the user will be taken to a mart form for the dataset
of choice. So far we have two marts to choose from, a gene mart for
gene symbol centric data and a family mart for the gene family
centric data.
All the mart forms have the same template where the form is split
into three parts, Datasets, Filters and Attributes.
Datasets
The datasets part of the mart form is for the user to select the
database and the dataset they would like to query and download. The
HGNC only have one database and so the database dropdown can be
ignored. If the user has entered the site via the HGNC BioMart
homepage the user will not have to change the dataset. However if
the user has changed their mind and want to download data from
another dataset the user can select a different dataset using the
"Datasets" dropdown box which will change the form. As we have
already mentioned we have two datasets to choose from so far, the
gene dataset and the family dataset.
Filters
The filters section is an area for the user to filter the data by
the provided fields. There are several types of filter for the user
to interact with, the most common type being the text input filter.
The filters are split into subsections, according to the type of
field/data they filter. Filters are not required for a BioMart
search. If a user wants to select attributes for all the data in the
dataset they should ignore this section of the form.
-
Text input filters
-
Text input filters usually allow the user to add a wildcard "%"
symbol to allow BioMart to search the field for data that is like
the filter query.
-
Select box filters
-
Select box filters are easy to use in that all the user has to do
is click on the filter and select the value to filter by. By
default the filter will say "-- Select --" and by leaving it like
this BioMart will ignore the filter.
-
Multiple select filters
-
Multiple select filters are scroll boxes that contain many values
per line. To filter by a particular value the user can click on
that value. If the user would like to filter on many values, a
user using a windows computer should hold down the control (ctrl)
key and click on another value. Mac users need to hold down the
command (cmd) key instead.
-
Bulk upload filters
-
Our Mart forms also have bulk upload filters. The user first
selects the field in which they would like to query multiple time
by selecting a value within the drop down select box. The user can
then place their values within the text area box or click the
"upload file" link to select a file which contains the query
values. All of the values have to be of the type selected within
the drop down (i.e a user cannot provide a file or type in values
that contain mixed ID/symbol/accession types).
Gene filters
-
HGNC data filter
-
-
Approved symbol
-
The official gene symbol that has been approved by the
HGNC and is publicly available. Symbols are approved based
on specific
HGNC nomenclature guidelines
. In the HTML results page this ID links to the HGNC
Symbol Report for that gene.
-
Approved name
-
The official gene name that has been approved by the HGNC
and is publicly available. Names are approved based on
specific
HGNC nomenclature guidelines
.
-
Alias gene symbol
-
Other symbols used to refer to this gene.
-
Alias name
-
Other names used to refer to this gene.
-
Previous HGNC symbol
-
Symbols previously approved by the HGNC for this
gene.
-
Previous HGNC name
-
Gene names previously approved by the HGNC for this
gene.
-
Filter by genes...
-
This filter allows the user to remove rows from the results
table for genes that do not have a value within a selected
field.
-
Status
-
Indicates whether the gene is classified as:
-
Approved - these genes have HGNC-approved
gene symbols
-
Entry withdrawn - these previously
approved genes are no longer thought to exist
-
Symbol withdrawn - a previously approved
record that has since been merged into a another record
-
Locus
group
-
Groups
locus types together into related sets. Below is a list of groups
and the locus types within the group:
-
protein-coding gene - contains the "gene
with protein product" locus type
-
non-coding RNA - contains the following
locus types:
- RNA, Y
- RNA, cluster
- RNA, long non-coding
- RNA, micro
- RNA, misc
- RNA, ribosomal
- RNA, small cytoplasmic
- RNA, small nuclear
- RNA, small nucleolar
- RNA, transfer
- RNA, vault
-
pseudogene - contains the following types:
- immunoglobulin pseudogene
- pseudogene
- T cell receptor pseudogene
-
phenotype - contains the "phenotype only"
locus type
-
other - contains the following types:
- endogenous retrovirus
- fragile site
- immunoglobulin gene
- protocadherin
- readthrough
- region
- T cell receptor gene
- transposable element
- unknown
- virus integration site
-
withdrawn - contains the "withdrawn" locus
type only
-
Locus
type
-
Specifies the type of locus described by the given entry:
-
gene with protein product - protein-coding
genes (the protein may be predicted and of unknown
function) (
SO:0001217)
-
RNA, Y - non-protein coding genes that encode Y
RNAs (
SO:0000405)
-
RNA, cluster - region containing a cluster of
small non-coding RNA genes
-
RNA, long non-coding - non-protein coding genes
that encode long non-coding RNAs (lncRNAs) (
SO:0001877); these are at least 200 nt in length. Subtypes
include intergenic (
SO:0001463), intronic (
SO:0001903) and antisense (
SO:0001904).
-
RNA, micro - non-protein coding genes that
encode microRNAs (miRNAs) (
SO:0001265)
-
RNA, misc - non-protein coding genes that
encode miscellaneous types of small ncRNAs
-
RNA, ribosomal - non-protein coding genes that
encode ribosomal RNAs (rRNAs) (
SO:0001637)
-
RNA, small cytoplasmic - non-protein coding
genes that encode small cytoplasmic RNAs (scRNAs) (
SO:0001266)
-
RNA, small nuclear - non-protein coding genes
that encode small nuclear RNAs (snRNAs) (
SO:0001268)
-
RNA, small nucleolar - non-protein coding genes
that encode small nucleolar RNAs (snoRNAs) containing
C/D or H/ACA box domains (
SO:0001267)
-
RNA, transfer - non-protein coding genes that
encode transfer RNAs (tRNAs) (
SO:0001272)
-
RNA, vault - non-protein coding genes that
encode vault RNAs (
SO:0000404)
-
phenotype only - mapped phenotypes where the
causative gene has not been identified (
SO:0001500)
-
T cell receptor pseudogene - T cell receptor
gene segments that are inactivated due to frameshift
mutations and/or stop codons in the open reading frame
-
immunoglobulin pseudogene - immunoglobulin gene
segments that are inactivated due to frameshift
mutations and/or stop codons in the open reading frame
-
pseudogene - genomic DNA sequences that are
similar to protein-coding genes but do not encode a
functional protein (
SO:0000336)
-
T cell receptor gene - gene segments that
undergo somatic recombination to form either alpha,
beta, gamma or delta chain T cell receptor genes (
SO:0000460). Also includes T cell receptor gene segments with
open reading frames that either cannot undergo somatic
recombination, or encode a peptide that is not predicted
to fold correctly; these are identified by inclusion of
the term “non-functional” in the gene name.
-
complex locus constituent - transcriptional
unit that is part of a named complex locus
-
endogenous retrovirus - integrated retroviral
elements that are transmitted through the germline (
SO:0000100)
-
fragile site - a heritable locus on a
chromosome that is prone to DNA breakage
-
immunoglobulin gene - gene segments that
undergo somatic recombination to form heavy or light
chain immunoglobulin genes (
SO:0000460). Also includes immunoglobulin gene segments with open
reading frames that either cannot undergo somatic
recombination, or encode a peptide that is not predicted
to fold correctly; these are identified by inclusion of
the term “non-functional” in the gene name.
-
protocadherin - gene segments that constitute
the three clustered protocadherins (alpha, beta and
gamma)
-
readthrough - a naturally occurring transcript
containing coding sequence from two or more genes that
can also be transcribed individually
-
region - extents of genomic sequence that
contain one or more genes, also applied to non-gene
areas that do not fall into other types
-
transposable element - a segment of repetitive
DNA that can move, or retrotranspose, to new sites
within the genome (
SO:0000101)
-
unknown - entries where the locus type is
currently unknown
-
virus integration site - target sequence for
the integration of viral DNA into the genome
-
Chromosome
- The chromosome where the gene can be found.
-
Bulk upload filter
-
-
Filter by ID, accession
or symbol
-
This field allows the user to provide multiple query values
to bulk search BioMart. The list of values must all be of
the type selected using the drop down box. Values can be
typed/pasted into the text area or uploaded within a file by
clicking on the "upload file" link. The types accepted in
this filter are as follows:
-
HGNC ID(s) - A unique ID provided by the HGNC
for each gene with an approved symbol. IDs are of the
format HGNC:n, where n is a unique number.
-
Approved symbols - The official gene symbol
that has been approved by the HGNC.
-
Alias gene symbols - Other symbols used to
refer to the gene.
-
Previous HGNC symbols - Symbols previously
approved by the HGNC for the gene.
-
CCDS accessions - The Consensus CDS (CCDS)
accession.
-
INSDC (ENA/GenBank/DDBJ) accessions - INSDC
nucleotide sequence accession numbers.
-
Ensembl gene ID(s) - The ID for an Ensembl gene
entry.
-
Mouse genome informatics (MGI) ID(s) - Mouse
Genome Informatics ID for a mouse homolog of human
genes.
-
NCBI Gene ID(s) - IDs that are associated with
a gene with NCBI gene.
-
OMIM ID(s) - Identifier from the Online
Mendelian Inheritance in Man (OMIM).
-
Orphanet ID(s) - The Orphanet ID identifies a
gene within orphanet and the rare diseases that are
associated to the gene.
-
Pseudogene.org ID(s) - An ID for a pseudogene
entry/sequence within the Pseudogene.org database.
-
RefSeq accessions - The Reference Sequence
(RefSeq) identifier.
-
Rat Genome Database (RGD) ID(s) - Rat Genome
Database ID for a rat homolog of human genes.
-
UniProt accessions - The UniProt identifier for
a protein product of a gene.
- Vega gene ID(s) - The Vega gene ID.
Family filters
-
HGNC data filter
-
-
Family name
- The name given/chosen by the HGNC for the family.
-
Family
alias
-
Alternative names that are also used to describe the gene
family.
-
Root gene
symbol
-
The root/stem symbol that is common to most of the genes
belonging to the gene family.
-
Bulk upload filter
-
-
Filter by IDs or
symbols
-
This field allows the user to provide multiple family IDs,
HGNC (gene) IDs and approved gene symbols to BioMart to
search. The list of values must all be of the type selected
using the drop down box. Values can be
typed/pasted into the text area or uploaded within a file by
clicking on the "upload file" link.
Attributes
The Attributes section of the form is where the user selects what
they want displayed within their table for download and it is a
requirement of BioMart to select at least one attribute. On both the
gene and family marts some of the key attributes are selected by
default however the user can deselect these defaults. The attribute
section is divided up into subsections to group similar attributes
fields together. To select or deselect an attribute the user should
click on the check box next to the attributes label. Alternatively
the user can select or deselect all the attributes within subsection
by clicking on the links labelled "select all" and "select none".
Gene attributes
-
HGNC data
-
-
HGNC
ID
-
A unique ID provided by the HGNC for each gene with an
approved symbol. IDs are of the format HGNC:n, where n is
a unique number.
-
Status
-
Indicates whether the gene is classified as:
-
Approved - these genes have HGNC-approved
gene symbols
-
Entry withdrawn - these previously
approved genes are no longer thought to exist
-
Symbol withdrawn - a previously approved
record that has since been merged into a another record
-
Approved symbol
-
The official gene symbol that has been approved by the HGNC
and is publicly available. Symbols are approved based on
specific
HGNC nomenclature guidelines.
-
Approved name
-
The official gene name that has been approved by the HGNC
and is publicly available. Names are approved based on
specific
HGNC nomenclature guidelines
.
-
Alias symbol
-
Other symbols used to refer to the gene.
-
Alias name
-
Other names used to refer to the gene.
-
Previous symbol
-
Symbols previously approved by the HGNC for the
gene.
-
Previous name
-
Gene names previously approved by the HGNC for the
gene.
-
Chromosome
- The chromosome where the gene can be found.
-
Chromosome location
-
Indicates the location of the gene or region on the
chromosome
-
Locus
group
-
Groups
locus types together into related sets. Below is a list of groups
and the locus types within the group:
-
protein-coding gene - contains the "gene
with protein product" locus type
-
non-coding RNA - contains the following
locus types:
- RNA, cluster
- RNA, long non-coding
- RNA, micro
- RNA, ribosomal
- RNA, small cytoplasmic
- RNA, small misc
- RNA, small nuclear
- RNA, small nucleolar
- RNA, transfer
-
pseudogene - contains the following types:
- immunoglobulin pseudogene
- pseudogene
- T cell receptor pseudogene
-
phenotype - contains the "phenotype only"
locus type
-
other - contains the following types:
- endogenous retrovirus
- fragile site
- immunoglobulin gene
- protocadherin
- readthrough
- region
- T cell receptor gene
- transposable element
- unknown
- virus integration site
-
withdrawn - contains the "withdrawn" locus
type only
-
Locus
type
-
Specifies the type of locus described by the given entry:
-
complex locus constituent -
transcriptional unit that is part of a named complex
locus
-
endogenous retrovirus - integrated
retroviral elements that are transmitted through the
germline (
SO:0000100)
-
fragile site - a heritable locus on a
chromosome that is prone to DNA breakage
-
gene with protein product - protein-coding
genes (the protein may be predicted and of unknown
function) (
SO:0001217)
-
immunoglobulin gene - gene segments that
undergo somatic recombination to form heavy or light
chain immunoglobulin genes (
SO:0000460)
-
immunoglobulin pseudogene - immunoglobulin
gene segments that are inactivated due to frameshift
mutations and/or stop codons in the open reading frame
-
phenotype only - mapped phenotypes (
SO:0001500)
-
protocadherin - gene segments that
constitute the three clustered protocadherins (alpha,
beta and gamma)
-
pseudogene - genomic DNA sequences that
are similar to protein-coding genes but do not encode a
functional protein (
SO:0000336)
-
readthrough - a naturally occurring
transcript containing coding sequence from two or more
genes that can also be transcribed individually
-
region - extents of genomic sequence that
contain one or more genes, also applied to non-gene
areas that do not fall into other types
-
RNA, cluster - region containing a cluster
of small non-coding RNA genes
-
RNA, long non-coding - non-protein coding
genes that encode long non-coding RNAs (lncRNAs); these
are at least 200 nt and are represented by a processed
trancript and/or at least 3 ESTs
-
RNA, micro - non-protein coding genes that
encode microRNAs (miRNAs) (
SO:0001265)
-
RNA, ribosomal - non-protein coding genes
that encode ribosomal RNAs (rRNAs) (
SO:0001637)
-
RNA, small nuclear - non-protein coding
genes that encode small nuclear RNAs (snRNAs) (
SO:0001268)
-
RNA, small nucleolar - non-protein coding
genes that encode small nucleolar RNAs (snoRNAs)
containing C/D or H/ACA box domains (
SO:0001267)
-
RNA, small cytoplasmic - non-protein
coding genes that encode small cytoplasmic RNAs (scRNAs)
(
SO:0001266)
-
RNA, transfer - non-protein coding genes
that encode transfer RNAs (tRNAs) (
SO:0001272)
-
RNA, small misc - non-protein coding genes
that encode miscellaneous types of small ncRNAs
-
T cell receptor gene - gene segments that
undergo somatic recombination to form either alpha,
beta, gamma or delta chain T cell receptor genes (
SO:0000460)
-
T cell receptor pseudogene - T cell
receptor gene segments that are inactivated due to
frameshift mutations and/or stop codons in the open
reading frame
-
transposable element - a segment of
repetitive DNA that can move, or retrotranspose, to new
sites within the genome (
SO:0000101)
-
unknown - entries where the locus type is
currently unknown
-
virus integration site - target sequence
for the integration of viral DNA into the genome
-
HGNC family ID
-
Each gene family has a unique numerical ID that forms the
last part of the gene family page URL to aid linking and
downloading.
-
HGNC family name
- The name given/chosen by the HGNC for the family.
-
Date
approved
-
Date the gene symbol and name were approved by the
HGNC.
-
Date
modified
-
If applicable, the date the entry was modified by the
HGNC.
-
Date symbol changed
-
If applicable, the date the approved gene symbol was last
changed by the HGNC.
-
Date name changed
-
If applicable, the date the approved gene name was last
changed by the HGNC.
-
Model organism databases
-
-
Gene resources
-
-
Ensembl gene ID
-
The Ensembl gene ID associated with the HGNC gene
symbol.
The Ensembl project produces genome databases for
vertebrates and other eukaryotic species.
-
NCBI gene ID
-
The NCBI gene ID associated with the HGNC gene symbol.
NCBI gene
at the NCBI provide curated sequence and
descriptive information about genetic loci including
official nomenclature, synonyms, sequence accessions,
phenotypes, EC numbers, MIM numbers, UniGene clusters,
homology, map locations, and related web sites.
-
UCSC gene ID
-
The UCSC gene ID associated with the HGNC gene symbol. The
ID is used within the UCSC genome browser to identify an
annotated human gene record within the
UCSC genome browser.
-
Vega gene ID
-
The Vega gene ID associated with the HGNC gene
symbol. The
VEGA database is a
repository for high-quality gene models produced by the
manual annotation of vertebrate genomes.
-
Nucleotide resources
-
-
CCDS accession
-
The
Consensus CDS (CCDS) project
is a collaborative effort to identify a core set of
human and mouse protein coding regions that are
consistently annotated and of high quality. The long term
goal is to support convergence towards a standard set of
gene annotations.
-
INSDC (ENA/GenBank/DDBJ) accession
-
INSDC nucleotide sequence accession numbers selected by
the HGNC for a gene.
-
RefSeq accession
-
The Reference Sequence (
RefSeq)
identifier displayed within the HGNC gene symbol report.
RefSeq aims to provide a comprehensive, integrated,
non-redundant set of sequences, including genomic DNA,
transcript (RNA), and protein products. RefSeq identifiers
are designed to provide a stable reference for gene
identification and characterization, mutation analysis,
expression studies, polymorphism discovery, and comparative
analyses.
-
RNAcentral ID
-
RNAcentral
is a public resource that offers integrated access to
a comprehensive and up-to-date
set of non-coding RNA sequences provided by a
collaborating group of
Expert Databases.
-
Protein resources
-
-
UniProt accession
-
The
UniProt identifier for a protein product of the gene
. The UniProt Protein Knowledgebase is described as a
curated protein sequence database that provides a high
level of annotation, a minimal level of redundancy and
high level of integration with other databases.
-
Enzyme (EC) ID
-
Enzyme entries have
Enzyme Commission (EC)
numbers associated with them that indicate the
hierarchical functional classes to which they
belong.
-
Clinical resources
-
-
Cosmic symbol
-
The gene symbol displayed within the Catalogue Of Somatic
Mutations In Cancer (
Cosmic).
Most of the gene symbols will be the same as HGNC approved
gene symbol but for some genes in Cosmic this may not be the
case.
-
OMIM
ID
-
Identifier provided by
Online Mendelian Inheritance in Man (OMIM)
. This database is described as a catalog of human genes
and genetic disorders containing textual information and
links to additional related resources.
-
Orphanet ID
-
Orphanet is the reference portal for information on rare
diseases and orphan drugs, for all audiences. Orphanet’s
aim is to help improve the diagnosis, care and treatment
of patients with rare diseases. The Orphanet ID identifies
a gene within orphanet and the rare diseases that are
associated to the gene.
-
Locus
reference genomic (LRG) ID
-
LRG sequences provide a stable genomic DNA framework for
reporting mutations with a permanent ID and a core content
that never changes.
-
Locus
specific database (LSDB) name
-
This contains LSDB database names pertinent to the
gene.
-
Locus
specific database (LSDB) URL
-
This contains LSDB database URL pertinent to the gene.
-
References
-
-
PubMed ID
-
Identifier that links to published articles relevant to
the gene in the NCBI's
PubMed database
.
-
Other external resources
-
-
HCDM CD
name
-
The CD name for a cellular differentiation
molecule found within the
HCDM database.
-
HomeoDB ID
-
ID for a homeobox gene within the Homeobox database (
HomeoDB2).
-
HORDE symbol
-
The ID for an olfactory receptor gene entry within the
Human Olfactory Receptor Data Exploratorium (
HORDE) database.
-
IMGT gene symbol
-
The IMGT/GENE-DB gene symbol for immunoglobulin and T-cell
receptor genes associated to the HGNC gene. The gene symbols
are either the same as, or equivalent to, HGNC approved gene
symbols. Equivalent IMGT symbols include the character "/"
which is not present in HGNC approved symbols. The presence
of an IMGT gene symbol indicates that the gene can be found
within the IMGT/GENE-DB.
-
IUPHAR/BPS guide to pharmacology ID
-
IUPHAR/BPS Guide to PHARMACOLOGY
is an
expert-driven guide to pharmacological targets and the
substances that act on them. The ID is their object ID
that is used as an identifier for a gene record within
their database.
-
KZNF gene catalog ID
-
The KZNF catalog is a comprehensive collection of
Krüppel-type zinc finger genes (KZNFs)
in primates with finished or high quality draft
genomes. The ID refers to a gene report within
the KZNF catalog.
-
mamit-tRNADB ID
-
Mamit-tRNAdb
is a compilation of mammalian mitochondrial tRNA genes. The
ID refers to a tRNA gene within the mamit-tRNAdb database.
-
Merops ID
-
The
MEROPS
database is an information resource for peptidases
(also termed proteases, proteinases and proteolytic
enzymes) and the proteins that inhibit them.
-
mirBase accession
-
An accession number for a microRNA sequence within the
miRBase database for the HGNC gene.
-
Pseudogene.org ID
-
An ID for a pseudogene entry/sequence within the
Pseudogene.org database for the HGNC gene.
-
SLC bioparadigms symbol
-
The gene symbol for a solute carrier gene as found in the
Bioparadigms SLC tables
database.
-
snoRNABase (snoid) ID
-
snoRNABase is a
comprehensive database of human H/ACA and C/D box snoRNAs.
The ID itself refers to a snoRNA page within the database
resource.
-
LncRNADB symbol
-
lncRNAdb is a
database providing comprehensive annotations of
eukaryotic long non-coding RNAs (lncRNAs). Most of
the gene symbols will be the same as HGNC approved gene
symbols however for some genes this may not be the
case.
-
LNCipedia symbol
-
LNCipedia is
a
comprehensive compendium of human long non-coding RNAs
(lncRNAs). Most of the gene symbols will be the same
as HGNC approved gene symbols however for some genes this
may not be the case.
Family attributes
-
HGNC family attributes
-
-
Family
ID
-
Each gene family has a unique numerical ID that forms the
last part of the gene family page URL to aid linking and
downloading.
-
Family name
- The name given/chosen by the HGNC for the family.
-
Family
alias
-
Other commonly-used gene family names and
abbreviations.
-
Root gene
symbol
-
The root/stem symbol that is common to most of the genes
belonging to the gene family.
-
Description
-
A brief description about the gene family in question.
-
Description source
-
The source of the text for the description. Sources are
usually from wikipedia, UniProt or our own HGNC description.
Other sources may be used.
-
External family resources
-
-
Resource name
- Gene family specific database resource name.
-
Resource description
- Gene family specific database resource description.
-
Resource URL
- Gene family specific database resource URL.
-
PubMed
ID
-
PubMed ID for a reference pertinent to the gene family.
We do
not aim to list all possible published papers on
the family but we provide PubMed IDs to papers that first
described the gene family in question or papers that are
particularly relevant to the nomenclature of the
genes.
-
HGNC Gene attributes
-
-
HGNC
ID (gene)
-
A unique ID provided by the HGNC for each gene with an
approved symbol. IDs are of the format HGNC:n, where n is
a unique number.
-
Approved
symbol
-
The official gene symbol that has been approved by the
HGNC and is publicly available. Symbols are approved based
on specific
HGNC nomenclature guidelines
. In the HTML results page this ID links to the HGNC
Symbol Report for that gene.
-
Approved
name
-
The official gene name that has been approved by the HGNC
and is publicly available. Names are approved based on
specific
HGNC nomenclature guidelines
.
-
Status
-
Indicates whether the gene is classified as:
-
Approved - these genes have HGNC-approved
gene symbols
-
Entry withdrawn - these previously
approved genes are no longer thought to exist
-
Symbol withdrawn - a previously approved
record that has since been merged into a another record
-
Locus
group
-
Groups
locus types together into related sets. Below is a list of groups
and the locus types within the group:
-
protein-coding gene - contains the "gene
with protein product" locus type
-
non-coding RNA - contains the following
locus types:
- RNA, cluster
- RNA, long non-coding
- RNA, micro
- RNA, ribosomal
- RNA, small cytoplasmic
- RNA, small misc
- RNA, small nuclear
- RNA, small nucleolar
- RNA, transfer
-
pseudogene - contains the following types:
- immunoglobulin pseudogene
- pseudogene
- T cell receptor pseudogene
-
phenotype - contains the "phenotype only"
locus type
-
other - contains the following types:
- endogenous retrovirus
- fragile site
- immunoglobulin gene
- protocadherin
- readthrough
- region
- T cell receptor gene
- transposable element
- unknown
- virus integration site
-
withdrawn - contains the "withdrawn" locus
type only
-
Locus
type
-
Specifies the type of locus described by the given entry:
-
complex locus constituent -
transcriptional unit that is part of a named complex
locus
-
endogenous retrovirus - integrated
retroviral elements that are transmitted through the
germline (
SO:0000100)
-
fragile site - a heritable locus on a
chromosome that is prone to DNA breakage
-
gene with protein product - protein-coding
genes (the protein may be predicted and of unknown
function) (
SO:0001217)
-
immunoglobulin gene - gene segments that
undergo somatic recombination to form heavy or light
chain immunoglobulin genes (
SO:0000460)
-
immunoglobulin pseudogene - immunoglobulin
gene segments that are inactivated due to frameshift
mutations and/or stop codons in the open reading frame
-
phenotype only - mapped phenotypes (
SO:0001500)
-
protocadherin - gene segments that
constitute the three clustered protocadherins (alpha,
beta and gamma)
-
pseudogene - genomic DNA sequences that
are similar to protein-coding genes but do not encode a
functional protein (
SO:0000336)
-
readthrough - a naturally occurring
transcript containing coding sequence from two or more
genes that can also be transcribed individually
-
region - extents of genomic sequence that
contain one or more genes, also applied to non-gene
areas that do not fall into other types
-
RNA, cluster - region containing a cluster
of small non-coding RNA genes
-
RNA, long non-coding - non-protein coding
genes that encode long non-coding RNAs (lncRNAs); these
are at least 200 nt and are represented by a processed
trancript and/or at least 3 ESTs
-
RNA, micro - non-protein coding genes that
encode microRNAs (miRNAs) (
SO:0001265)
-
RNA, ribosomal - non-protein coding genes
that encode ribosomal RNAs (rRNAs) (
SO:0001637)
-
RNA, small nuclear - non-protein coding
genes that encode small nuclear RNAs (snRNAs) (
SO:0001268)
-
RNA, small nucleolar - non-protein coding
genes that encode small nucleolar RNAs (snoRNAs)
containing C/D or H/ACA box domains (
SO:0001267)
-
RNA, small cytoplasmic - non-protein
coding genes that encode small cytoplasmic RNAs (scRNAs)
(
SO:0001266)
-
RNA, transfer - non-protein coding genes
that encode transfer RNAs (tRNAs) (
SO:0001272)
-
RNA, small misc - non-protein coding genes
that encode miscellaneous types of small ncRNAs
-
T cell receptor gene - gene segments that
undergo somatic recombination to form either alpha,
beta, gamma or delta chain T cell receptor genes (
SO:0000460)
-
T cell receptor pseudogene - T cell
receptor gene segments that are inactivated due to
frameshift mutations and/or stop codons in the open
reading frame
-
transposable element - a segment of
repetitive DNA that can move, or retrotranspose, to new
sites within the genome (
SO:0000101)
-
unknown - entries where the locus type is
currently unknown
-
virus integration site - target sequence
for the integration of viral DNA into the genome
-
Chromosome
- The chromosome where the gene can be found.
-
Date
approved
-
Date the gene symbol and name were approved by the HGNC.
-
Date
modified
-
If applicable, the date the gene entry was modified by the
HGNC.
-
Date name changed
-
If applicable, the date the approved gene name was last
changed by the HGNC.
-
Date symbol changed
-
If applicable, the date the approved gene symbol was last
changed by the HGNC.
-
Other external resources
-
-
Ensembl gene ID
-
The Ensembl gene ID associated with the HGNC gene
symbol. The Ensembl project produces genome databases
for vertebrates and other eukaryotic species.
-
NCBI gene ID
-
The NCBI gene ID associated with the HGNC gene symbol.
NCBI gene at the NCBI provide curated sequence and descriptive
information about genetic loci including official
nomenclature, synonyms, sequence accessions, phenotypes, EC
numbers, MIM numbers, UniGene clusters, homology, map
locations, and related web sites.
-
UCSC gene ID
-
The UCSC gene ID associated with the HGNC gene symbol. The
ID is used within the UCSC genome browser to identify an
annotated human gene record within the
UCSC genome browser.
-
UniProt accession
-
The
UniProt identifier for a protein product of the gene. The
UniProt Protein Knowledgebase is described as a curated
protein sequence database that provides a high level of
annotation, a minimal level of redundancy and high level of
integration with other databases.
-
Vega gene ID
-
The Vega gene ID associated with the HGNC gene
symbol. The
VEGA database is a repository for high-quality gene
models produced by the manual annotation of vertebrate
genomes.
Example of how to use BioMart
BioMart RESTful service
Create a query
On the preview page of your BioMart search you will see near the
top a tab labelled "REST/SOAP". Clicking on this tab will produce
a query as XML to retrieve the same data that you are previewing.
Using the XML query
To use the XML query you need the url
http://biomart.genenames.org/martservice/results. You can either
POST the query or use the query in a GET request.
POSTing the query
POSTing the query in my opinion is the better solution. Save the
XML snippet in a file called query.xml (or name it however you
like). Then you can use the file in a POST request to the
martserver using a tool like curl.
curl --data-urlencode query@query.xml
http://biomart.genenames.org/martservice/results
GET request method
Copy the XML query snippet and use the martservice URL as shown
below:
http://biomart.genenames.org/martservice/results?query=<PASTE
QUERY HERE>
You may have to URL encode the XML query for the GET to work. You
can do this by using a tool such as the online
URL encode/decode
tool. GET requests have a 2,048 character limit so depending on
the size of the query you may have to use the POST method.
More help about the BioMart RESTful service and more can be found
within the
BioMart 0.9 documentation
PDF.