Frequently Asked Questions: Blat
Topics
Return to FAQ Table of Contents
Blat vs. Blast
What are the differences between Blat and Blast?
Blat is an alignment tool like BLAST, but it is structured differently. On DNA, Blat works by
keeping an index of an entire genome in memory. Thus, the target database of BLAT is not a set of
GenBank sequences, but instead an index derived from the assembly of the entire genome. By default,
the index consists of all non-overlapping 11-mers except for those heavily involved in repeats, and
it uses less than a gigabyte of RAM. This smaller size means that Blat is far more easily
mirrored than BLAST. Blat of DNA is designed to quickly find
sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or
shorter sequence alignments. (The default settings and expected behavior of standalone Blat are
slightly different from those on the graphical version of Blat.)
On proteins, Blat uses 4-mers rather than 11-mers, finding protein sequences of 80% and greater
similarity to the query of length 20+ amino acids. The protein index requires slightly more than
2 gigabytes of RAM. In practice -- due to sequence divergence rates over evolutionary time -- DNA
Blat works well within humans and primates, while protein Blat continues to find good matches within
terrestrial vertebrates and even earlier organisms for conserved proteins. Within humans, protein
Blat gives a much better picture of gene families (paralogs) than DNA Blat. However, BLAST and
psi-BLAST at NCBI can find much more remote matches.
From a practical standpoint, Blat has several advantages over BLAST:
-
speed (no queues, response in seconds) at the price of lesser homology depth
-
the ability to submit a long list of simultaneous queries in fasta format
-
five convenient output sort options
-
a direct link into the UCSC browser
-
alignment block details in natural genomic order
-
an option to launch the alignment later as part of a custom track
Blat is commonly used to look up the location of a sequence in the genome or determine the exon
structure of an mRNA, but expert users can run large batch jobs and make internal parameter
sensitivity changes by installing command line Blat on their own Linux server.
Blat can't find a sequence
I can't find a sequence with Blat although I'm sure it is in the genome. Am I doing
something wrong?
First, check if you are using the correct version of the genome. For example, two versions of the
human genome are currently in use (called hg19 and hg38) and your sequence may be only in one of
them. Many published articles do not specify the version so trying a few may be necessary.
Very short sequences that go over a splice site in a cDNA sequence can't be found, as they are not
in the genome. QPCR primers are a typical example. For these cases, try using
In-Silico PCR and selecting a gene set as the target. In general,
the In-Silico PCR tool is more sensitive and should be preferred for pairs of primers.
If you have verified that you are using the correct genome and that the sequence is indeed there,
for example by using the "Short match" track, the problem may be a result of Blat's query-masking.
The online version of Blat masks 11mers from the query that occur more than 1024 times in the
genome. This is done to improve speed, but may result in missed hits when you are searching for
sequences in repeats.
To find these matches using the online version of Blat, you can add more flanking sequence to your
query. If this is not possible, the only alternative is to download the executables of Blat and the
.2bit file of a genome to your own machine and use blat on the command line. See
Downloading Blat source and documentation for more information.
Blat usage restrictions
I received a warning from your Blat server informing me that I had exceeded the server use
limitations. Can you give me information on the UCSC Blat server use parameters?
Due to the high demand on our Blat servers, we restrict service for users who programmatically query
Blat or do large batch queries. Program-driven use of Blat is limited to a maximum of one hit every
15 seconds and no more than 5,000 hits per day. Please limit batch queries to 25 sequences
or less.
For users with high-volume Blat demands, we recommend downloading Blat for local use. For more
information, see Downloading Blat source and documentation.
Downloading Blat source and documentation
Is the Blat source available for download? Is documentation available?
Blat source and executables are freely available for academic, nonprofit and personal use.
Commercial licensing information is available on the Kent Informatics website.
Blat source may be downloaded from
http://www.soe.ucsc.edu/~kent (look for the
blatSrc* zip file with the most recent date). For Blat executables, go to
http://hgdownload.soe.ucsc.edu/admin/exe/
and choose your machine type.
Documentation on Blat program specifications is available
here.
Replicating web-based Blat parameters in command-line version
I'm setting up my own Blat server and would like to use the same parameter values that the
UCSC web-based Blat server uses.
We almost always expect there to be some small differences between the hgBlat/gfServer and the
stand-alone command-line blat. The best matches can be found using pslReps and pslCDnaFilter
utilities. The web-based blat is tuned permissively with a minimum cut-off score of 20, which will
display most of the alignments. Other than to confirm that your command-line blat is working, there
is little use in perfectly replicating the web-based blat results. We advise deciding which
filtering parameters make the most sense for the experiment or analysis. Often these settings will
be different and more stringent than those of the web-based blat. With that in mind, use the
following settings to replicate the search results of the web-based blat:
faToTwoBit:
gfServer (this is how the UCSC web-based blat servers are configured):
-
blat server (capable of PCR):
gfServer start blatMachine portX -stepSize=5 -log=untrans.log
database.2bit
-
translated blat server:
gfServer start blatMachine portY -trans -mask -log=trans.log
database.2bit
For enabling DNA/DNA and DNA/RNA matches, only the host, port and twoBit files are needed. The
same port is used for both untranslated blat (gfClient) and PCR (webPcr). You'll need a separate
blat server on a separate port to enable translated blat (protein searches or translated searches
in protein-space).
gfClient:
- Set -minScore=0 and -minIdentity=0. This will result in some low-scoring,
generally spurious hits, but for interactive use it's sufficiently easy to ignore them (because
results are sorted by score) and sometimes the low-scoring hits come in handy.
standalone blat:
- blat search:
blat -stepSize=5 -repMatch=2253 -minScore=0 -minIdentity=0
database.2bit query.fa output.psl
Notes on repMatch:
-
The default setting for gfServer dna matches is: repMatch = 1024 * (tileSize/stepSize).
-
The default setting for blat dna matches is: repMatch = 1024 (if tileSize=11).
- To get command-line results that are equivalent to web-based results, repMatch must be
specified when using blat.
For more information about how to replicate the score and percent identity matches displayed by our
web-based blat, please see this blat FAQ.
For more information on the parameters available for blat, gfServer, and gfClient, see the
blat specifications.
Using the -ooc flag
What does the -ooc flag do?
Using any -ooc option in blat, such as -ooc=11.ooc, speeds up searches similar to
repeat-masking sequence. The 11.ooc file contains sequences determined to be
over-represented in the genome sequence. To improve search speed, these sequences are not used when
seeding an alignment against the genome. For reasonably sized sequences, this will not create a
problem and will significantly reduce processing time.
By not using the 11.ooc file, you will increase alignment time, but will also slightly
increase sensitivity. This may be important if you are aligning shorter sequences or sequences of
poor quality. For example, if a particular sequence consists primarily of sequences in the
11.ooc file, it will never be seeded correctly for an alignment if the -ooc flag
is used.
In summary, if you are not finding certain sequences and can afford the extra processing time, you
may want to run blat without the 11.ooc file if your particular situation warrants its
use.
Replicating web-based Blat percent identity and score calculations
Using my own command-line Blat server, how can I replicate the percent identity and score
calculations produced by web-based Blat?
There is no option to command-line Blat that gives you the percent ID and the score. However, we
have created scripts that include the calculations:
-
View the perl script from the source tree:
pslScore.pl
-
-
View the corresponding C program:
pslScore.c
and associated library functions pslScore
and pslCalcMilliBad
in
psl.c
See our FAQ on source code licensing and downloads for information on
obtaining the source.
Replicating web-based Blat "I'm feeling lucky" search results
How do I generate the same search results as web-based Blat's "I'm feeling lucky"
option using command-line blat?
The code for the "I'm feeling lucky" Blat search orders the results based on the sort
output option that you selected on the query page. It then returns the highest-scoring alignment of
the first query sequence.
If you are sorting results by "query, start" or "chrom, start", generating the
"I'm feeling lucky" result is straightforward: sort the output file by these columns, then
select the top result.
To replicate any of the sort options involving score, you first must calculate the score for each
result in your PSL output file, then sort the results by score or other combination (e.g.
"query, score" and "chrom, score"). See the section on
Replicating web-based Blat percent identity and score calculations for
information on calculating the score.
Alternatively, you can try filtering your Blat PSL output using either the
pslReps
or
pslCDnaFilter
program available in the Genome Browser source code. For information on
obtaining the source code, see our FAQ on source code licensing and
downloads.
Using Blat for short sequences with maximum sensitivity
How do I configure blat for short sequences with maximum sensitivity?
Here are some guidelines for configuring standalone blat and gfServer/gfClient for these
conditions:
-
The formula to find the shortest query size that will guarantee a match (if matching tiles are not
marked as overused) is: 2 * stepSize + tileSize - 1
For example, with stepSize set to 5 and tileSize set to 11, matches of query
size 2 * 5 + 11 - 1 = 20 bp will be found if the query matches the target exactly.
The stepSize parameter can range from 1 to tileSize.
The tileSize parameter can range from 6 to 15. For protein, the range starts lower.
For minMatch=1 (e.g., protein), the minimum guaranteed match length is: 1 *
stepSize + tileSize - 1
Note: There is also a "minimum lucky size" for hits. This is the smallest possible hit
that BLAT can find. This minimum lucky size can be calculated using the formula: stepSize
+ tileSize. For example, if we use a tileSize of 11 and stepSize of 5,
hits smaller than 16 bases won't be reported.
-
Try using -fine.
-
Use a large value for repMatch (e.g. -repMatch = 1000000) to reduce the chance
of a tile being marked as over-used.
-
Do not use an .ooc file.
-
Do not use -fastMap.
-
Do not use masking command-line options.
The above changes will make BLAT more sensitive, but will also slow the speed and increase the
memory usage. It may be necessary to process one chromosome at a time to reduce the memory
requirements.
A note on filtering output: increasing the -minScore parameter value beyond one-half of
the query size has no further effect. Therefore, use either the pslReps
or
pslCDnaFilter
program available in the Genome Browser source code to filter for the size,
score, coverage, or quality desired. For information on obtaining the source code, see our
FAQ on source code licensing and downloads.