CloneQC: Lightweight Sequence Verification For Synthetic Biology
Title:CLONEQC: lightweight sequence verification for synthetic biology
Publication:Published in: Nucleic Acids Research
Volume: 38, Issue: 8, Pages: 2617-2623
Sequence verification is an important step in the Gene Synthesis process. Creating a plasmid or chromosome from a series of smaller building blocks introduces many opportunities for error, so sequence verification as a quality control has relevance throughout the process, including after development of the building blocks, and after they have been combined through the PCR. Although a number of software tools exist that provide users with the algorithms necessary to verify sequences, most do not combine the steps into a turn-key solution, frequently requiring users to perform some level of manual analysis – a tedious procedure that also introduces the opportunity for error.
CloneQC was one of the first applications to provide a one-stop solution for verification of sequencing results against target reference sequences. Users could submit two trace files (one forward, one reverse) per physical sample to be verified against a series of one or more reference sequences. The algorithm behind the tool was a novel approach:
- The user uploads a zipped file of the trace files (or reads with quality scores) that they want to evaluate, along with a fasta file containing all of the target sequences they want to compare. It is expected that the trace files would follow the naming convention <clone_id>_F.ab1 and <clone_id>_R.ab1, where <clone_id> uniquely relates the pair to the physical sample from which the trace files were derived.
- The fasta file of target sequences is used to generate a database to be used for BLAST searches.
- The reads of the trace files are queried from the database developed in the previous step using BLAST. A quality control step confirms that if one trace file is found to be a BLAST match with a target sequence, the related trace file also is a BLAST match against the same target sequence.
- The two traces are then aligned using a ClustalW algorithm. Hanging ends are trimmed from the final alignment.
- The alignment is parsed to look for discrepancies. For each “column”, the following is assessed:
- If both reads agrees with the target sequence at a particular column, then the column is considered a match;
- if only one read is available and it agrees with the target with a high-quality score (PHRED quality score >= 25), then it is still considered a match;
- if both reads disagree with the target sequence at a given column, then the column is assumed to be a synthesis error and is marked as a mismatch;
- if there are two reads that disagree with each other, and the one that agrees with the target sequence has a high quality score at that point and the one that disagrees has a low quality score, then it is considered a sequencing error and is treated as a match;
- if the read with the high quality score is the mismatch, then the column is marked as a mismatch;
- if neither (or both) of the reads have high quality scores, then the column is marked as a check.
Using the results from Step 5 above, a summary assessment is calculated as follows:
- Pass: All of the columns in the alignment were a match.
- Check: All of the columns in the alignment were either a match or a check.
- Fail: At least one column was a mismatch.
- Fixable: If the alignment was a fail, but there are no more than 6 mismatches and they all occur within 20 bp of an end, then reamplification could fix the error, so it is considered fixable.
CloneQC’s results were compared to human expert assessments for 133 clones. For the 125 clones with definitive assessments, CloneQC was found to have a high success rate of 98%.
Peccoud Lab has its own Sequence Verification Pipeline, GenoREAD, so we would agree that the problem of sequence verification is one that should not be understated. Overall, CloneQC seemed to have a pretty good algorithm, and a lot of consideration went into its component parts (ie, the ClustalW algorithm was evaluated against the MUSCLE algorithm for the alignment, and only after the pros and cons were considered was the ClustalW algorithm approved). However, I did have a few comments and observations.
- The article states that it would be easy to extend the algorithm so that more than two trace files could be compared to the target sequences. If so, why didn’t they go ahead and build that in? After all, an algorithm that supports only up to two trace files per target sequence will not be able to handle sequences longer than two sequencing reads (and since it can’t make definitive assessments in some cases of only one read, even that may be too long.) Within the constraints of the algorithm as written, I can see where extending it to support a variable number of reads might actually be pretty complicated.
- Using BLAST to match trace files to target sequences is an interesting approach; it might be to the users’ advantage if there is a chance of pipetting error (ie, wrong sample in the wrong well), or sample cross-contamination. If the samples come from different plates, though, then it may not be all that helpful. There is also the risk that a trace could match multiple target sequences according to BLAST; how does CloneQC handle that situation?
Finally, I was disappointed to see that CloneQC does not seem to be available from the indicated website, nor was I able to find a site where I could download the software. There seems to be a number of positive comments about CloneQC out on the web, so it would be nice to be able to try it out.