The ipython notebook below contains everything (including the download and indexing of data BAM files and the reference fasta file) to run IsoMut on some of the samples described in the article. After slight modifications to input and output directories, it can be run on any data, whenever multiple isogenic samples are available.
The dataset used below is a reduced version of the one used in the article and is available online at the website genomics.hu. The BAM files of 10 samples contain only a 10 MB length portion of the first chromosome of the genome, allowing for a rapid analysis for testing purposes. Also, the Gallus gallus reference genome in fasta format can be downloaded from ensembl.org.
Besides identifying unique mutations with predefined filters, this notebook also demonstrates the tuning of the mutation quality score (score), when control samples are available. In these samples no unique, treatment-induced mutations should be present, thus the score can be tuned by minimizing the number of detected mutations in these samples, while maintaining satisfyingly high numbers in samples that underwent strong mutagenic treatments. The tuning procedure can be individually carried out for SNVs, insertions and deletions, achieving optimized results for all types of mutations.
The value of the score is related to the probability of correctly categorizing a candidate mutation as a true positive. More precisely, the probability p of incorrectly detecting a difference in the genotypes of the two noisiest samples in the given position can be calculated by Fisher’s exact test on the two samples. This gives the probability of being wrong to state that the reads found in the two noisiest samples are from different nucleotide distributions. The score is calculated as the negative logarithm of p. Thus a higher score value gives a ‘surer’ mutation.
The notebook contains shell and python commands combined.
Note The score optimisation for insertions and deletions are skipped in the notebook below, due to the lack of sufficiently high numbers of indels in the dataset. However, the commands are included in the notebook and can be run readily on any data very similarly to the score optimisation process for SNVs.
Warning We strongly advise against the biological interpretation of the data below. As IsoMut was run on 10 samples only and on a very short portion of the whole genome, any conclusions drawn from the results are unreliable. The notebook below serves demonstration purposes only.
import pandas as pd
import numpy as np
import os
import subprocess
import matplotlib.pyplot as plt
%matplotlib inline
%%bash
wget -q http://www.genomics.hu/tools/isomut/IsoMut_example_dataset.tar.gz
tar -zxf IsoMut_example_dataset.tar.gz
for file in IsoMut_example_dataset/*
do
gunzip $file
done
for file in IsoMut_example_dataset/*
do
samtools index $file
done
%%bash
mkdir chrom1_ref_fasta
cd chrom1_ref_fasta
wget -q ftp://ftp.ensembl.org/pub/release-84/fasta/gallus_gallus/dna/Gallus_gallus.Galgal4.dna.chromosome.1.fa.gz
gunzip Gallus_gallus.Galgal4.dna.chromosome.1.fa.gz
samtools faidx Gallus_gallus.Galgal4.dna.chromosome.1.fa
cd ..
%%bash
########################################
# Download from git
git clone https://github.com/genomicshu/isomut.git
cd isomut/src
# compile
gcc -c -O3 isomut_lib.c fisher.c -W -Wall
gcc -O3 -o isomut isomut.c isomut_lib.o fisher.o -lm -W -Wall
cd ../..
import os
os.chdir('isomut/')
os.mkdir('output')
%%writefile isomut_example_script.py
#!/usr/bin/env python
#################################################
# importing the wrapper
#################################################
import sys,os
#add path for isomut_wrappers.py
# if not running it from the isomut directory
# change os.getcwd for the path to it
sys.path.append(os.getcwd()+'/src')
#load the parallel wrapper function
from isomut_wrappers import run_isomut
#add path for isomut, if its in the path comment/delete this line
# if not running it from the isomut directory
# change os.getcwd for the path to it
os.environ["PATH"] += os.pathsep + os.getcwd() +'/src'
#################################################
# defining administrative parameters
#################################################
#using parameter dictionary, beacause there are awful lot of parameters
params=dict()
#minimum number of blocks to run
# usually there will be 10-20 more blocks
params['n_min_block']=100
#number of concurrent processes to run
params['n_conc_blocks']=4
#genome
params['ref_fasta']="/".join(os.getcwd().split("/")[:-1])+'/chrom1_ref_fasta/Gallus_gallus.Galgal4.dna.chromosome.1.fa'
#input dir output dir
params['input_dir']="/".join(os.getcwd().split("/")[:-1])+'/IsoMut_example_dataset/'
params['output_dir']='output/'
#the bam files used
samples=['S07','S09','S12','S13','S15','S22','S24','S27','S28','S30']
params['bam_filenames']=[sample+'_reduced.bam' for sample in samples ]
#limit chromosomes (for references with many scaffolds)
# just comment/delete this line if you want to analyze all contigs in the ref genome
#params['chromosomes']=map(str,range(1,29))+ ['32','W','Z','MT']
params['chromosomes']=['1']
#################################################
# defining mutation calling parameters
# default values here ...
#################################################
params['min_sample_freq']=0.21
params['min_other_ref_freq']=0.93
params['cov_limit']=5
params['base_quality_limit']=30
params['min_gap_dist_snv']=0
params['min_gap_dist_indel']=20
#################################################
# and finally run it
#################################################
run_isomut(params)
%%bash
chmod +x isomut_example_script.py
./isomut_example_script.py
%%bash
head -n1 output/all_SNVs.isomut > all_output.csv
cat output/all_SNVs.isomut | grep -v sample >> all_output.csv
cat output/all_indels.isomut | grep -v sample >> all_output.csv
output=pd.read_csv('all_output.csv',sep='\t',header=0)
##define sample groups
samples=['S07','S09','S12','S13','S15','S22','S24','S27','S28','S30']
control_idx=[2,4]+ [7,9]
weak_idx=[0,5]
strong_idx=[1,3,6,8]
def plot_tuning_curve(output,ymax):
#set cols
cols=['lightgreen' for i in xrange(10)]
for i in control_idx:
cols[i]='dodgerblue'
for i in weak_idx:
cols[i]='salmon'
fig,ax=plt.subplots()
fig.set_size_inches(12,9)
for i,col in zip(range(10),cols):
score=output[(output['#sample_name']==samples[i]+'_reduced.bam')].sort(['score'])['score']
ax.plot(score,len(score)-np.arange(len(score)),c=col,lw=4,label='')
ax.set_xlabel(r'score threshold',fontsize=16)
ax.set_ylabel(r'Mutations found',fontsize=16)
ax.set_ylim(1,ymax)
ax.set_xlim(0,4)
ax.grid()
dump=ax.legend(loc='upper right',fancybox='true',fontsize=16)
fig,ax=plt.subplots()
fig.set_size_inches(12,9)
for i,col in zip(range(30),cols):
score=output[(output['#sample_name']==samples[i]+'_reduced.bam')].sort(['score'])['score']
ax.plot(score,len(score)-np.arange(len(score)),c=col,lw=4,label='')
ax.set_xlabel(r'score threshold',fontsize=16)
ax.set_ylabel(r'Mutations found',fontsize=16)
ax.set_ylim(1,ymax)
ax.set_xlim(0,4)
ax.set_yscale('log')
ax.grid()
#dump=ax.legend(loc='upper right',fancybox='true',fontsize=16)
plot_tuning_curve(output[output['type']=='SNV'],50)
plot_tuning_curve(output[output['type']=='INS'],30)
plot_tuning_curve(output[output['type']=='DEL'],40)
def plot_roc(output,score0,score_sel):
fig,ax=plt.subplots()
fig.set_size_inches(12,9)
fp, tp = [0 for i in xrange(100) ],[0 for i in xrange(100) ]
for score_lim,j in zip(np.linspace(score0,10,100),range(100)):
muts=[]
for i in xrange(10):
filt_idx = output['#sample_name'] == samples[i]+'_reduced.bam'
filt_idx = filt_idx & ((output['score']>score_lim))
muts.append(len(output[filt_idx]))
muts=np.array(muts)
fp[j] ,tp[j]=1e-3*np.mean(muts[control_idx]),1e-3*np.mean(muts[weak_idx+strong_idx])
ax.step(fp,tp,c='magenta',lw=4,label='quasy ROC, scannning the tuning parameter')
muts=[]
for i in xrange(10):
filt_idx = output['#sample_name'] == samples[i]+'_reduced.bam'
filt_idx = filt_idx & ((output['score']>score_sel))
muts.append(len(output[filt_idx]))
muts=np.array(muts)
ax.plot(1e-3*np.mean(muts[control_idx]),1e-3*np.mean(muts[weak_idx+strong_idx]),
'o',mec='dodgerblue',mfc='dodgerblue',ms=15,mew=3,label='selected parameter')
ax.legend(fancybox=True,loc='lower right',fontsize=16)
#ax.set_xlim(0,30)
ax.set_ylim(ymin=0)
ax.set_xlim(xmin=-0.0001)
ax.set_xlabel('false positive rate 1/Mbp ',fontsize=16)
dump=ax.set_ylabel('mutation rate 1/Mbp ',fontsize=16)
plot_roc(output[output['type']=='SNV'],score0=1.5,score_sel=2)
plot_roc(output[output['type']=='INS'],score0=0.5,score_sel=1)
plot_roc(output[output['type']=='DEL'],score0=0.5,score_sel=1.35)
#define plotting func
def plot_mutres(table):
#group mutations per samples
sample_counts=table.groupby(['#sample_name']).count().reset_index()[['#sample_name','mut']]
sample_counts.columns=['sample','count']
#add zeroes if a sample is missing
for i in samples:
if (not i+'_reduced.bam' in set(sample_counts['sample'])):
sample_counts=pd.concat([sample_counts,pd.DataFrame({'sample':[i+'_reduced.bam'], 'count' : [0]})])
sample_counts=sample_counts.sort('sample').reset_index()[['sample','count']]
#plot
fig,ax=plt.subplots()
fig.set_size_inches(12,9)
#starting clones and controls
ax.bar(control_idx,[sample_counts.loc[i,'count'] for i in control_idx],
facecolor='dodgerblue',edgecolor='none',label='starting clone and controls')
#weak treatment
ax.bar(weak_idx,[sample_counts.loc[i,'count'] for i in weak_idx],
facecolor='salmon',edgecolor='none',label='weak mutagenic treatment')
#strong treatment
ax.bar(strong_idx,[sample_counts.loc[i,'count'] for i in strong_idx],
facecolor='lightgreen',edgecolor='none',label='strong mutagenic treatment')
#samples labels
names=samples
ax.set_xticks(0.4+np.arange(len(names)))
ax.set_xticklabels(names,rotation='vertical',fontsize=14)
#axis, and legend
ax.set_xlabel(r'samples',fontsize=18)
ax.set_ylabel(r'Mutations detected',fontsize=18)
dump=ax.legend(loc='best',fancybox='true',fontsize=16)
#print the table
sample_counts['code']=names
return sample_counts[['code','count']]
plot_mutres(output[(output['type']=='SNV' ) & (output['score']> 2)] )
plot_mutres(output[(output['type']=='INS' ) & (output['score']> 1)] )
plot_mutres(output[(output['type']=='DEL' ) & (output['score']> 1.35)] )