Polishing Long Reads With Illumina Short Reads A Comprehensive Guide
Assembling genomes from long reads has revolutionized genomics, allowing us to piece together complex regions that were previously inaccessible with short-read sequencing. However, long-read assemblies often contain errors, primarily insertions and deletions (indels), which necessitate polishing with higher-accuracy data. Polishing long-read assemblies with short paired-end (PE) reads is a common strategy to improve the accuracy and contiguity of the final assembly. This article provides a detailed guide on how to effectively polish long reads using Illumina short PE reads, addressing the key steps and considerations involved in the process.
Understanding the Polishing Process
Genome polishing is a crucial step in the assembly pipeline, aiming to correct errors present in the initial assembly. Long-read sequencing technologies, such as those from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), generate reads that span tens of thousands of base pairs, enabling the resolution of repetitive regions and complex genomic structures. However, these long reads have a higher error rate compared to short reads generated by Illumina sequencing. Therefore, hybrid assembly approaches, combining the advantages of both long and short reads, are often employed.
The polishing process typically involves aligning short reads to the long-read assembly and then using the alignment information to correct errors in the consensus sequence. This can be achieved using various tools, including Racon, Pilon, and Medaka. Each tool employs different algorithms and strategies for error correction, but the underlying principle remains the same: leveraging the high accuracy of short reads to improve the quality of the long-read assembly.
Addressing the Key Questions
Let's address the specific questions raised about using Racon for polishing, as it exemplifies the general workflow applicable to most polishing tools.
1. Contigs = Long Read Assembly File?
Yes, the contigs input for Racon (and similar polishing tools) refers to your long-read assembly file. This file typically contains the assembled contigs or scaffolds in FASTA or FASTQ format. These contigs represent the initial assembly that you want to improve by polishing with short reads. It's crucial that this file is in the correct format and contains the complete assembly you intend to polish.
2. Reads = Merged PE Short Reads?
Yes, the reads input refers to your short paired-end reads, which can be either in separate files (one for each read direction) or merged into a single file. Merging paired-end reads can simplify the input process for some tools. The reads should also be in FASTA or FASTQ format. The high accuracy of these short reads is what allows for effective error correction in the long-read assembly. Before using them for polishing, it's often beneficial to perform quality control steps, such as adapter trimming and quality filtering, to ensure the reads are as clean as possible.
3. Overlaps = Alignment of Short PE Reads to the Long Read Assembly File?
Correct. The overlaps input represents the alignment of your short PE reads to the long-read assembly. This alignment information is crucial for the polishing tool to identify discrepancies between the short reads and the long-read contigs. The alignment file is typically in SAM (Sequence Alignment/Map), BAM (Binary Alignment/Map), PAF (Pairwise mApping Format), or MHAP format, depending on the alignment tool used. Popular aligners for this purpose include Bowtie 2, BWA-MEM, and Minimap2. The choice of aligner can influence the accuracy and speed of the polishing process, so selecting an appropriate aligner is essential.
Step-by-Step Guide to Polishing with Illumina Short Reads
To effectively polish your long-read assembly, follow these steps:
1. Data Preparation
- Long-Read Assembly: Ensure your long-read assembly is in FASTA or FASTQ format. Verify the assembly statistics (N50, contig lengths, etc.) to understand the initial assembly quality.
- Short-Read Data: Obtain your Illumina paired-end reads in FASTQ format. Perform quality control steps, including:
- Adapter Trimming: Remove adapter sequences using tools like Trimmomatic or Cutadapt.
- Quality Filtering: Filter out low-quality reads using tools like Trimmomatic or Sickle. This step ensures that only high-quality reads are used for polishing, improving the accuracy of the final assembly.
- Optional: Merge Paired-End Reads: Some polishing tools benefit from merged paired-end reads. You can merge reads using tools like FLASH or PEAR.
2. Alignment of Short Reads to the Long-Read Assembly
-
Choose an Aligner: Select an appropriate aligner for mapping short reads to the assembly. BWA-MEM and Bowtie 2 are commonly used for this purpose. Minimap2 is also a popular choice, especially for long-read to long-read alignments, but it can also be used for short-read to long-read alignments.
-
Align Reads: Align the quality-controlled short reads to the long-read assembly using the chosen aligner. For example, using BWA-MEM:
bwa mem -t <threads> <long_read_assembly.fasta> <read1.fastq> <read2.fastq> > alignment.sam
This command aligns the short reads to the long-read assembly and outputs the alignment in SAM format.
-
Convert to BAM (Optional): Convert the SAM file to BAM format for efficient storage and indexing:
samtools view -bS alignment.sam > alignment.bam samtools sort alignment.bam -o alignment.sorted.bam samtools index alignment.sorted.bam
These commands convert the SAM file to BAM, sort the BAM file, and create an index file, which is necessary for many polishing tools.
3. Polishing with Racon
-
Install Racon: If you haven't already, install Racon. You can typically install it via conda or by compiling from source.
-
Run Racon: Use Racon to polish the assembly. The basic command structure is:
racon -t <threads> <reads.fastq> <alignment.sam> <long_read_assembly.fasta> > polished_assembly.fasta
This command takes the short reads, the alignment file, and the long-read assembly as input and outputs the polished assembly in FASTA format.
-
Iterative Polishing: Polishing is often an iterative process. Running Racon multiple times (e.g., 2-4 iterations) can further improve the assembly quality. Each iteration corrects additional errors, leading to a more accurate final assembly.
4. Alternative Polishing Tools
- Pilon: Pilon is another popular polishing tool that uses short reads to correct errors in assemblies. It can also be used to improve base qualities, fill gaps, and correct small indels. Pilon requires a BAM file as input.
- Medaka: Medaka is specifically designed for polishing long-read assemblies generated by Oxford Nanopore Technologies. It uses a deep learning-based approach to correct errors and can achieve high accuracy.
5. Assembly Evaluation
- Assess Assembly Quality: After polishing, evaluate the assembly quality using metrics such as N50, contig lengths, and the number of misassemblies. Tools like Quast can be used for comprehensive assembly evaluation.
- Compare to a Reference Genome (If Available): If a reference genome is available, compare your polished assembly to the reference to assess its accuracy and completeness. This comparison can reveal misassemblies, structural variations, and other differences between the assembly and the reference.
Optimizing the Polishing Process
To achieve the best polishing results, consider the following optimizations:
1. Read Depth
- Ensure sufficient short-read coverage. A higher coverage (e.g., 50x or more) generally leads to better polishing results. Insufficient coverage may result in incomplete error correction, while excessive coverage can increase computational demands without significant improvements in accuracy.
2. Alignment Parameters
- Experiment with different alignment parameters to optimize the mapping of short reads to the assembly. For example, adjusting the minimum mapping quality or the gap penalties can affect the alignment accuracy and the subsequent polishing results.
3. Iterative Polishing
- Perform multiple rounds of polishing. Each iteration can correct additional errors, leading to a more accurate final assembly. However, be mindful of diminishing returns; after a few iterations, the improvements may become marginal.
4. Polishing Tool Selection
- Consider using a combination of polishing tools. Different tools employ different algorithms and may be better suited for correcting specific types of errors. For example, using Racon for initial polishing followed by Pilon for fine-tuning can be an effective strategy.
5. Base Quality Recalibration
- If using Pilon, consider performing base quality recalibration on the short reads before polishing. This step can improve the accuracy of the short-read alignments and the subsequent polishing results.
Troubleshooting Common Issues
During the polishing process, you may encounter some common issues. Here are some tips for troubleshooting:
1. Low Polishing Improvement
- Check Read Coverage: Ensure sufficient short-read coverage. Low coverage may lead to incomplete error correction.
- Evaluate Alignment Quality: Assess the alignment quality. Poor alignment can result in inaccurate polishing.
- Try Different Polishing Tools: Experiment with different polishing tools or combinations of tools.
2. Introduction of New Errors
- Use High-Quality Reads: Ensure that you are using high-quality short reads for polishing. Low-quality reads can introduce new errors into the assembly.
- Adjust Polishing Parameters: Experiment with different polishing parameters. Overly aggressive polishing settings can sometimes introduce errors.
3. Computational Resources
- Allocate Sufficient Resources: Polishing can be computationally intensive, especially for large genomes. Ensure that you have sufficient memory and processing power.
- Use Parallel Processing: Many polishing tools support parallel processing, which can significantly reduce the runtime. Utilize multi-threading options to speed up the process.
Conclusion
Polishing long-read assemblies with Illumina short PE reads is an essential step in generating high-quality genomes. By carefully preparing your data, aligning short reads to the long-read assembly, and using appropriate polishing tools, you can significantly improve the accuracy and contiguity of your assembly. Understanding the nuances of the polishing process and optimizing your workflow will ensure that you achieve the best possible results. Remember to evaluate your assembly after polishing to confirm the improvements and identify any remaining issues. Through this comprehensive guide, you should now be well-equipped to tackle the task of polishing long reads with short reads, leading to more accurate and complete genome assemblies.