04叶绿体流程2-注释

0.一些个程序 路径

命令行注释工具 https://github.com/daisieh/plann 还没测试
/share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/
/share/nas1/xul/program/chloroplast/cds_analysis/src/

1.查找单个基因/share/nas6/xul/program/chloroplast/bin/cp_annotation_one_gene_by_ref_gbk2.pl
cp_annotation_one_gene_by_ref_gbk2.pl -i1 fasta/MW148820.1.fasta  -i2 ref/gbk/MH042531.1.gbk -g ndhf

2.蕨类生成gbk
perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/to_genbank_for_juelei.pl

3.显示gbk有多少基因
st ref.gbk

4.查找位置
/share/nas6/xul/program/chloroplast/bin/cp_find_seq_in_fa2.pl  -i /share/nas1/yuj/project/chloroplast/GP-20230717-6432_20230807/ref_ann/fasta/NC_008588.1.fasta -s  TGATTATTGGTGAAAGTCAAAAAATTGGTTTCCAACCAATTAAATATCAACAAAATGTTGATGGTTCGAATACATTAATATTAG -c
79688-79771:+;

一.Reference

参考挑选标准

cp_Find_fewer_genes.pl -i xxxx.ann # 查看关键信息
rps12是3个外显子
ycf1是2个基因
rpl2可能有3段,也就是有内含子,但是有时候两段中间没有碱基,这个时候三段其实就是两段
注意参考cds trna个数 做过的陆生植物一般37 38 39trna 86 87 88cds

0.ncbi在线blast比对

1.下载参考
perl /share/nas6/xul/program/chloroplast/bin/cp_get_genbank_form_ncbi_with_ACCESSION.pl -i   -o

2.挑选注释
cd ref_ann/gbk
2ann- # for i in *.gbk ;do echo $i; cp_gbk2ann.pl -i $i -o $i.ann ;done 的别名
grep rps12 *.ann # 筛选rps12有3个外显子

grep LOCUS * | grep 2023 # 简单根据年份筛选一下
## for i in *.gbk;do st $i;done # 查看数量
for i in *.gbk;do /share/nas1/yuj/program/chloroplast/annotation/statistic.sh $i;done
优先87   8  37,标注了pseudo更好
csplit -z -f part_ -b "%02d.gbk" sequence.gb '/^\/\/$/' '{*}'  && for file in part_*.gbk; do   sed -i '/^\/\/$/d;N;/^\n$/d' "$file";      locus=$(grep -m 1 "^LOCUS" "$file" | awk '{print $2}');      mv "$file" "${locus}.gbk"; done
/share/nas6/xul/program/chloroplast/bin/cp_gbk2fasta.pl -d gbk/ -o fasta

# 读取原始文件并按“//”分隔
csplit -z -f part_ -b "%02d.gbk" sequence.gb '/^\/\/$/' '{*}'  

# 删除每个新文件中的“//”分隔符行及其下面的一行空行
for file in part_*.gbk; do
  # 删除“//”分隔符行及其下面的一行空行
  sed -i '/^\/\/$/d;N;/^\n$/d' "$file"
  
  # 提取登录号
  locus=$(grep -m 1 "^LOCUS" "$file" | awk '{print $2}')
  
  # 重命名文件
  mv "$file" "${locus}.gbk"
done

二.注释

2.1 第一遍注释

1.配置文件,打开修改
python3 /share/nas1/yuj/script/chloroplast/get_ann_cfg.py # 在项目编号目录下直接运行
cp /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/cp.anno.config.yaml ann.cfg # 备用

2.注释
perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/bin/chloroplast_annotaion.pl.v2.1.pl -d analysis/assembly/ -i ann.cfg
藻类
## 藻类植物
perl /share/nas6/xul/program/chloroplast/annotation/Annotation.pl -i1 Phaeodactylum_tricornutum_FULLCP.fsa -i2 ref_ann/gbk/MN937452.1.gbk -i3 gene_anno/gene_annotaion.info.bak -o gene_anno/ -no -orf 1>gene_anno/anno.log 2>gene_anno/anno.log2

-no 不删除中间文件
-orf 使用最全的基因列表 /share/nas6/xul/program/chloroplast/annotation/ref_normal_gene_all_other.txt

2.2 (可选)根据参考修改注释

2.2.1 cds

批量blast
## 每组cds单独blast

mkdir ann_tmp 
1.提取出cds及对应蛋白序列
python3 /share/nas1/yuj/script/chloroplast/annotation/cp_from_gbk_get_cds.py -i ref_ann/gbk/ -o ref_ann/out -d
cp ref_ann/out/cds/*.fasta ann_tmp/ && python3 /share/nas1/yuj/script/mitochondrion/phytree/mt_extract2mafft_V1.5.py -i ann_tmp/ -o1 gene/extract -o2 gene/mafft -c2
停顿几秒等比对完
cp *.fsa ann_tmp && cd ann_tmp
for i in cds*.fasta;do ir.py -s7 -n 11 -i $i -o ${i%.fasta}.pep;done

2.利用核酸和蛋白序列blast比对
for i in cds*.fasta;do blastn -query $i  -subject *.fsa -outfmt 6 -max_hsps 1 > $i.log;done
for i in cds*.fasta;do tblastn -query ${i%.fasta}.pep -subject *.fsa -outfmt 6 -max_hsps 1 > ${i%.fasta}.pep.log;done
for i in cds_*.log;do echo $i;cut -f 1,7,8,9,10 $i > ${i%.log}.txt;done

3.进一步处理这些比对出的位置
## 比如这个是查找nd4基因
for i in cds*txt;do grep -i -v nd4l $i| grep -i nd4;done | cut -f 2,3,4,5 | sort -k 3

2.2.2 trna

1.
cp ref_ann/out/trna/*.fasta ann_tmp/ && cd ann_tmp

2.
for i in trna*.fasta;do blastn -task blastn-short -query $i  -subject *.fsa -outfmt 6 -max_hsps 1 > $i.log;done
for i in trna_*.log;do echo $i;cut -f 1,7,8,9,10 $i > ${i%.log}.txt;done

3.
## 单独找难找的trna
name="tRNA-Ala"
for i in trna*txt;do grep -i $name $i;done | cut -f 2,3,4,5 | sort -k 3
for i in trna*fasta;do grep -A 1 -i  $name  $i;done > $name.fa
fl  $name.fa
for i in trna*fasta;do grep -A 1 -i   $name  $i;done| grep -v ">"

2.2.3 rrna

1.
cp ref_ann/out/rrna/*.fasta ann_tmp && cd ann_tmp

2.
for i in rrna*.fasta;do blastn -query $i  -subject *.fsa -outfmt 6 -max_hsps 1 > $i.log;done
for i in rrna_*.log;do echo $i;cut -f 1,7,8,9,10 $i > ${i%.log}.txt;done

3.
for i in rrna*txt;do grep -i 16s $i;done | cut -f 2,3,4,5 | sort -k 3
## for循环直接全部处理了
for j in {rrsA,rrlA,rrfA,rrsB,rrlB,rrfB}; do echo $j;for i in rrna*txt; do grep -i  $j $i ; done | cut -f 2,3,4,5 | sort -k 3 ;done > ../allrrna.blast.txt && cd ../
1	1481	6276	7756
1	1481	6276	7756
从结果里挑选最符合的

2.2.4 单个基因

单个基因blast
## cds
gene=psbL && rm $gene.*
for i in out/cds/*;do grep -A 1 -i "\b$gene\b" $i >> $gene.all.fa ;done && ir.py -s7 -n 11 -i $gene.all.fa && blastn -query $gene.all.fa -subject *_FULLCP.fsa -outfmt 6 -max_hsps 1 >$gene.all.blast && tblastn -query $gene.pep -subject *_FULLCP.fsa -outfmt 6 -max_hsps 1 >>$gene.all.blast && cat $gene.all.blast  | cut -f 7,8,9,10 | sort  -k 3

## rrna
gene=rrlA && rm $gene.*
for i in out/rrna/*;do grep -A 1 -i "\b$gene\b" $i >> $gene.all.fa ;done && blastn -query $gene.all.fa -subject *_FULLCP.fsa -outfmt 6 -max_hsps 1 >$gene.all.blast  && cat $gene.all.blast  | cut -f 7,8,9,10 | sort  -k 3

## 下面这个没必要
for i in {1..20};do if [ $(($i%2)) -eq 0 ];then sed -n ""$i"p" ndhK_all.fa > $i"_ndhK";blastn -query $i"_ndhK" -subject *.fsa -outfmt 6;fi;done # 把每个物种的基因写入单个文件,然后blast

2.3 修改注释

new gene name no in ref_all:	chlB   1
new gene name no in ref_all:	chlL   1
new gene name no in ref_all:	chlN   1
cysA	Sulfate/thiosulfate import ATP-binding protein CysA
cysT	sulfate/thiosulfate import ATP-binding protein CysT
new gene name no in ref_all:	g7   1
new gene name no in ref_all:	tic214   1
new gene name no in ref_all:	ycf12   1
注释检查

  • trnH-GUG 没有内含子,去掉汇总后表格里的星号,出现该情况是因为起点原因导致此基因跨首尾多了;分号
  • 注意ir区基因是否对称,如3431里面trna-ile在ira区没了,只在irb有
  • ndhD、rpl2、psbL 有时候起始子不是常规起始子,转录时是把中间的c变成u/t,因此在注释结果里加标识符1以便通过流程
  • ycf15应该是2个或没有
  • rps19跨反向重复区很多的话,末尾有对应的假基因,这种情况下共2个,否则1个;也存在俩rps19基因的情况(如3937 exl龙血树 俩rps19完整位于反向重复区)
  • rps15应该有1个,也存在俩rps15(如4086 小黑麦 俩rps15完整位于反向重复区)
  • ycf1两个,一真5000一假1000
  • pafII改成ycf4
  • pbf1改成psbN
  • pafI改成ycf3
  • clpP1改成clpP
  • lhbA改成psbZ
  1. 如果没有注释结果,看看是不是给的gbk有问题,给成了ann文件
  2. correct_ann.log 对ir区的矫正不一定对,通过看注释结果
  3. anno.log2标准含错误输出,看看有没有没找到的基因,一般没有

通用步骤
0.xul版
# 取序列看蛋白
cp_add_gene_sequence.pl  -i *.fsa -p 74-1039:+  
cp_add_gene_sequence.pl  -s ATCGTTGGAACC            
# 查找单个基因
cp_annotation_one_gene_by_ref_gbk2.pl -i1 fasta/MW148820.1.fasta  -i2 ref/gbk/MH042531.1.gbk -g ndhf
blastn -query t  -subject *.fsa -outfmt 6 
tblastn -query t  -subject *.fsa -outfmt 6 

1.根据ann.log先去查看final,取序列看蛋白
cp_add.py  -i *.fsa -p 134809-135306:+
cp_add_gene_sequence.pl  -i *.fsa -p 

2.不对的话再查看tmp查找  查看基因排列
$ grep -i 'psbI' final_gene_annotaion.info2
$ for i in `grep -i 'psbI' final_gene_annotaion.info2 | awk '{print $2}'` ;do cp_add.py  -i *.fsa -p $i;done   > psbI_ann.log
# $ grep 'CDS' ref_ann/gbk/登录号.gbk.ann
# $ for i in `grep 'CDS' ref_ann/gbk/登录号.gbk.ann | awk '{print $2}'` ;do mt_add.py -n 5 -i ref_ann/fasta/登录号.fasta -p $i;done   > ref_ann.log
$ vim cox1
$ tblastn -query cox1 -subject *.fsa -outfmt 6

3.根据记录的笔记检查数量
提取假基因脚本或者gbk搜索pseudo
cp_get_some_gene_seq.pl  -i gbk/xxx.gbk -p  > pseudo.fa

4.检查结果
cp_change_annotation.v2.pl -i final_gene_annotaion.info2 -o final_gene_annotaion.info && cp_Find_fewer_genes.pl -i final_gene_annotaion.info

sed -i 's/rpl2\tribosomal protein L2/rpl2\tribosomal protein L2\t1/g' final_gene_annotaion.info
sed -i 's/psbL\tphotosystem II protein L/psbL\tphotosystem II protein L\t1/g' final_gene_annotaion.info


# 排序 perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/change_annotation.v2.pl -i final_gene_annotaion.info2 -o final_gene_annotaion.info			
# 检查 perl /share/nas6/xul/program/chloroplast/bin/cp_Find_fewer_genes.pl -i final_gene_annotaion.info

2.4 第二遍注释

perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/bin/chloroplast_annotaion.pl.v2.1.pl -i ann.cfg -d analysis/assembly/

/share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/to_genbank.pl # 默认
perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/to_genbank_for_juelei.pl # 蕨类生成gbk

未解之谜

---20240605 第一行解析错误,可能是样品名中间有空格,因为通常第一行用的登录号,不会出现空格

bp    DNA     circular     13-Jun-2023

     misc_feature    1..84350
                     /note="LSC"

     misc_feature    84351..114079
                     /note="IRb"

     misc_feature    114080..133964
                     /note="SSC"

     misc_feature    133965..163693
                     /note="IRa"

三.整理结果

perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/cp_pip_dir.pl -i analysis  -o complete_dir

:set nu  vim显示行数

四.生成报告

1.修改报告配置文件
cp /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/html_report/cp.report.cfg report.cfg

2.标准+高级分析
perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/html_report/report2xml.yelvti.pl -id complete_dir -cfg report.cfg

3.三代辅助组装
## 先单独生成3代统计
perl /share/nas6/xul/program/mt2/tgs_reads_statistic/multi_tgs_qc.pl -i analysis/samples.reads.txt -o complete_dir/01Rawdata/
# samples.reads.txt第四列填上3代路径

perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/html_report/report2xml.yelvti_sandai.pl -id complete_dir -cfg report.cfg

五.补充

注释
$ perl /share/nas6/xul/program/chloroplast/annotation/Annotation.pl -i1 /share/nas1/yuj/project/GP-20221018-5070_20221031/analysis/annotation/Corynandra_viscosa/gene_anno/Corynandra_viscosa_FULLCP.fsa -i2 /share/nas1/yuj/project/GP-20221018-5070_20221031/analysis/annotation/Corynandra_viscosa/gene_anno/NC_053524.1.gbk -i3 /share/nas1/yuj/project/GP-20221018-5070_20221031/analysis/annotation/Corynandra_viscosa/gene_anno/gene_annotaion.info.bak -o /share/nas1/yuj/project/GP-20221018-5070_20221031/analysis/annotation/Corynandra_viscosa/gene_anno -no  1>/share/nas1/yuj/project/GP-20221018-5070_20221031/analysis/annotation/Corynandra_viscosa/gene_anno/anno.log 2>/share/nas1/yuj/project/GP-20221018-5070_20221031/analysis/annotation/Corynandra_viscosa/gene_anno/anno.log2
# 会生成 _gene_annotaion.info2  _gene_annotaion.info 俩文件

5.0 ir区查找

for i in *.fasta;do perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/IR_SC_RegionFinder.pl  -i $i  -p ${i%.fasta}   -o ./  & done

5.1 rscu

5.1.1 rscu出问题

有可能是基因名字有更改,product没有

5.1.2 单样品绘图(有数字折线)

cd rscu_analysis
perl /share/nas6/xul/program/chloroplast/rscu/draw.rscu.nu.pl -i rscu.stat.xls  -o rscu.svg
svg2xxx -t png -dpi 600 rscu.svg && svg2xxx -t pdf -dpi 600 rscu.svg

5.1.3 单样品绘图(流程图无折线)

cd rscu_analysis
perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/plot_rscu_table.pl -i  rscu.stat.xls -o  rscu.plot.circle.svg
perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation//src/plot.rscu.bar.pl -i  rscu.stat.xls -o rscu.plot.bar.svg

5.1.4 多样品热图

cd ref_adv/gbk
for i in *.gbk;do perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/bin/../src/extrac_cds.pl -i ${i%.gbk}.gbk -o features -p ${i%.gbk} &&  perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/bin/../src/rscu.pl -i1  features/${i%.gbk}.cds -i2  features/${i%.gbk}.pep -o  ${i%.gbk}.rscu.stat.xls  ;done

 python "E:\OneDrive\jshy信息部\Script\chloroplast\annotation\rscu_plot\4558_all_rscu_stat.py"   ./  #当前文件夹中的rscu统计合并成一个rscu统计
 tbtools导入数据,导入配置"E:\OneDrive\jshy信息部\Script\heatmap.cfg"绘图
#"E:\OneDrive\jshy信息部\Script\chloroplast\annotation\rscu_plot\heatmap.py" 热图
#"E:\OneDrive\jshy信息部\Script\chloroplast\annotation\rscu_plot\plot.py"折线图
#"E:\OneDrive\jshy信息部\Script\chloroplast\annotation\rscu_plot\plot_line.py"条形图

5.2 圈图

5.2.1 ogdraw

## 主程序
perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/bin/../etc/plot_ogdraw_cpgenome_genbank/ogdraw.pl -i Medicago_sativa.gbk -p Medicago_sativa -o ogdraw

## 圈图 png格式 dpi600
perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/etc/plot_ogdraw_cpgenome_genbank/etc/GeneMap-1.1.1/bin/drawgenemap --infile gbk/MT712154.1.gbk --format png --density 600 --outfile ogdraw/test.dpi600.circular.png --useconfig /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/etc/plot_ogdraw_cpgenome_genbank/conf/OGDraw_plastid_set.xml --ir_region 85073-111624,130260-156811

5.2.2 Chloroplot

1.linux分步骤(得先手动创建ogdraw目录)
$ Rscript /share/nas1/yuj/script/chloroplast/R/Chloroplot.R $outdir/$sample/ogdraw $outdir/$sample/gene_anno/$sample.gbk $sample && convert -density 300 $outdir/$sample/ogdraw/$sample.circular.pdf  $outdir/$sample/ogdraw/$sample.dpi300.circular.png
# 图片转换单拎出来
convert -density 300 *.circular.pdf  $sample.dpi300.circular.png

2.windows分步骤
# 以线粒体为例,cp/mt步骤一样
# Rscript E:\OneDrive\jshy信息部\Script\chloroplast\R\Win_Chloroplot.R $outdir/$sample/ogdraw $outdir/$sample/gene_anno/$sample.gbk $sample cp/mt
$ Rscript E:\OneDrive\jshy信息部\Script\chloroplast\R\Win_Chloroplot.R F:\  F:\4939_202301\Ustilago_esculenta_MT10.gbk Ustilago_esculenta_MT10 mt
$ 复制pdf到服务器
$ convert -density 600 *.circular.pdf  Ustilago_esculenta_MT10.dpi300.circular.png

3.圈图见4086项目txt(暂时不用)
圈图20220506
$ ir xxx.gbk
$ python3 /share/nas1/yuj/script/chloroplast/annotation/ogdraw.py -h
dpi300那张

4.流程里对这部分的文字描述
#-----4.3.2 叶绿体基因组图谱
&EMPTY_TAG('h3','4.3.2 叶绿体基因组图谱','','type1');
&EMPTY_TAG('p','使用Chloroplot(https://irscope.shinyapps.io/Chloroplot/)制作叶绿体基因组图谱,如下图所示:','type1');
if(glob "$indir/04Annotation/*/ogdraw/*dpi300.circular.png"){
	&piclist("图$pid 叶绿体基因组图谱",'注:正向编码的基因位于圈内侧,反向编码的基因位于圈外侧。内部的灰色圈代表GC含量。',"$indir/04Annotation/*/ogdraw/*dpi300.circular.png");    
}else{
	&piclist("图$pid 叶绿体基因组图谱",'注:正向编码的基因位于圈内侧,反向编码的基因位于圈外侧。内部的灰色圈代表GC含量。',"$indir/04Annotation/*/ogdraw/*.circular.png");
}

5.3 asmqc

5.3.1 主程序

# 主程序
$ perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/asmqc/cpDNA_asmqc.pl -i 组装结果  -1 clean*1 -2 clean*2 -r 参考.gbk -p sample -q 物种.gbk  -o asmqc_dir
---- 20231010 修改samtools路径


Bin="/share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/asmqc"

# cmd2 perl $Bin/src/gbk2fa.pl -i $refgbk -o $outdir/refseq.fa
$ perl $Bin/src/gbk2fa.pl -i $refgbk -o refseq.fa
# 生成refsseq.fa 命令3要用

# cmd3 perl $Bin/src/dotplot.pl -r $outdir/refseq.fa -q $infile -o $outdir -p cmp
$ perl $Bin/src/dotplot.pl -r refseq.fa -q $infile -o . -p cmp
# 生成共线性图

# cmd4 perl $Bin/src/annotation_cpDNA_genbank.pl -i $infile -r $refgbk -e sample -o $outdir/cmp.gbk
$ perl $Bin/src/annotation_cpDNA_genbank.pl -i $infile -r $refgbk -e sample -o cmp.gbk
# 生成cmp.gbk 和组装的gbk相比里面名称变成了sample

# cmd5 perl $Bin/src/cmp_genebank.pl -r $refgbk -q $querygbk -o $outdir/cmp.genecov_stat.xls
$ perl $Bin/src/cmp_genebank.pl -r $refgbk -q $querygbk -o cmp.genecov_stat.xls

# cmd8 perl $Bin/src/cp2circos.pl -g $outdir/cmp.gbk -b $outdir/mapping/$prefix.sort.bam -o $outdir/coverage_plot
$ perl $Bin/src/cp2circos.pl -g cmp.gbk -b mapping/*.sort.bam -o coverage_plot
# 如果只生成了mapping,运行这个程序,生成覆盖度图片  核心程序

# cmd9 perl $Bin/src/coverage_stat.pl -i $outdir -o $outdir -p $prefix
$ perl $Bin/src/coverage_stat.pl -i . -o . -p 拉丁名(同这部分主程序的$prefix)
# 生成一些统计文件 如果提示缺少文件,就是缺该组装物种的gbk

5.3.2 命令3 共线性图

cd asmqc_dir/

1.主程序
Bin="/share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/asmqc"
perl $Bin/src/dotplot.pl -r refseq.fa -q *.fsa -o . -p cmp

2.修改共线图颜色
sed -i '3s/repetitive$/unique/' cmp.coords.csv
Rscript /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/asmqc/src/coor_dotplot.R      cmp.coords.csv     cmp
## 批量
for i in `ls analysis/assembly/`;do echo $i;cd analysis/annotation/$i/asmqc_dir && sed -i '3s/repetitive$/unique/' cmp.coords.csv && Rscript /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/asmqc/src/coor_dotplot.R      cmp.coords.csv     cmp && cd - ;done

5.3.3 命令8 覆盖深度

cmd8 perl $Bin/src/cp2circos.pl -g $outdir/cmp.gbk -b outdir/mapping/prefix.sort.bam -o $outdir/coverage_plot
cd asmqc_dir
perl $Bin/src/cp2circos.pl -g cmp.gbk -b mapping/*.sort.bam -o coverage_plot # 会生成覆盖度的svg png
# cmp.gbk里的样品名是sample,可以把第一行相应位置修改为拉丁名	也可以使用gene_info文件夹内的gbk

FAQ:

1.图片中间物种名过长
把asmqc_dir/coverage_plot/sample.karyotype.txt中的名字改短,用下划线连接,不要有空格!
cd coverage_plot
circos="/share/nas6/zhouxy/biosoft/circos/current/bin/circos" && $circos -conf *.circos.conf && svg2xxx -t png -dpi 600 *.svg

2.图形外侧有一整圈,是因为trna跨首尾,需要修改一下
建议直接使用cmp.gbk来绘图
分步运行
1.bowtie
cp_bowtie.align.pl -i Lanmaoa_macrocarpa_FULLMT.fsa  -1  /share/nas6/xul/tmp/liulei/mogu/1444.cleaned_1.fastq.gz  -2 /share/nas6/xul/tmp/liulei/mogu/1444.cleaned_2.fastq.gz   -o analysis/annotation/Lanmaoa_macrocarpa/asmqc/mapping

2.筛选
samtools view -h -F 4 sample.sam > sample.map.sam

3.排序
samtools sort -@ 8 -o sample.sort.bam -T sample.sss -O bam sample.map.sam

4.深度统计
samtools depth sample.sort.bam |awk '{print "mt1",$2,$2,$3}' > sample.depth.txt # 最大值默认8000
samtools depth -m 100000 -a sample.sort.bam |awk '{print "mt1",$2,$2,$3}' > sample.depth.txt # 输出包括0的位置,最大值100000

5.找出深度最大值
awk '{max = ($NF > max) ? $NF : max} END {print max}' sample.depth.txt

6.修改配置文件 
深度最大值/深度绝对路径/输出文件路径
原来文件中的长度
$ cp XXX/J5.circos.conf ./sample.circos.conf

7.画图
## 默认圈图
/share/nas6/zhouxy/biosoft/circos/current/bin/circos -conf sample.circos.conf

## 线性图1
perl /share/nas6/xul/program/chloroplast/assembly/draw_line_depth.pl *.depth.txt depth.svg && svg2xxx -t png -dpi 600 depth.svg && svg2xxx -t pdf depth.svg
# 需要手动输入显示的最大值

for i in `ls analysis/assembly/`;do echo $i-------------------------------;cd analysis/annotation/$i/asmqc_dir/coverage_plot/ && perl /share/nas1/yuj/program/chloroplast/assembly/draw_line_depth.pl *.depth.txt $i.depth.svg && svg2xxx -t png -dpi 600 $i.depth.svg && svg2xxx -t pdf $i.depth.svg && cd - & done
# 自动设置y轴最大值为最大深度+500

## 线性图2
python3 /share/nas1/yuj/script/chloroplast/annotation/draw_line_depth_v2.py *.depth.txt  depth2.png

for i in `ls analysis/assembly/`;do echo $i-------------------------------;cd analysis/annotation/$i/asmqc_dir/coverage_plot/ && python3 /share/nas1/yuj/script/chloroplast/annotation/draw_line_depth_v2.py *.depth.txt  $i.depth2.png && cd - & done 

5.4 GC统计/基因统计

5.4.1 GC

1.有IR
perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/genome.gc.stat.pl -i $genome -o $outdir/$sample/$sample.genome.gc.stat.xls
## 批量
for i in *.fasta;do perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/genome.gc.stat.pl -i $i  -o ./${i%.fasta}.gc.stat.xls;done

2.没有IR
perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/asmqc/src/genome_gc.pl -i $genome -o $outdir/$sample/$sample.genome.gc.stat.xls

5.4.2 基因统计

1.有IR
## 批量
for i in *.gbk;do perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/gb2tbl.pl -i $i -o ${i%.gbk}/ && perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/tbl_stat.ir.pl -i ${i%.gbk}/*.tbl -f ${i%.gbk}.fasta -o ${i%.gbk}.gene_annotation_stat.xls;done

2.没有IR
perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/tbl_stat.pl -i $outdir/$sample/tbl/$sample.tbl -o $outdir/$sample/$sample.gene_annotation_stat.xls

5.5 重复序列

5.5.1 散在重复序列柱状图

image.png

## 这一步的程序是在注释流程里该步调用的程序中出现的
## /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/reputer 程序路径
perl /share/nas6/xul/program/chloroplast/reputer/draw_reputer_len_and_tye.pl -i $outdir/$prefix.dispered.repeat.stat.xls -o $outdir/repeat.bar.svg

perl /share/nas1/yuj/program/chloroplast/reputer/draw_reputer_len_and_tye.pl -i *.dispered.repeat.stat.xls -o repeat.bar.svg # 去除图例的斜体

5.5.2 SSR 简单重复序列(流程图)

有ir: 
perl $Bin/../etc/misa/src/stat.ssr.pl -i $outdir/$sample/ssr_analysis/cpSSR.misa -t $outdir/$sample/tbl/$sample.tbl  -ir $ir_string -o1 $outdir/$sample/ssr_analysis/ssr.info.xls -o2 $outdir/$sample/ssr_analysis/ssr.stat.xls 
perl $Bin/../etc/misa/src/plot.ssr.stat.pl -i $outdir/$sample/ssr_analysis/ssr.stat.xls -o $outdir/$sample/ssr_analysis/ 
perl /share/nas6/xul/program/chloroplast/ssr/draw_ssr.pl -i $outdir/$sample/ssr_analysis/number.tmp -o $outdir/$sample/ssr_analysis/ssr.number.bar 

无ir:
 perl $Bin/../etc/misa/src/stat.ssr.pl -i $outdir/$sample/ssr_analysis/cpSSR.misa -t $outdir/$sample/tbl/$sample.tbl  -o1 $outdir/$sample/ssr_analysis/ssr.info.xls -o2 $outdir/$sample/ssr_analysis/ssr.stat.xls  
 perl $Bin/../etc/misa/src/plot.ssr.stat.noir.pl -i $outdir/$sample/ssr_analysis/ssr.stat.xls -o $outdir/$sample/ssr_analysis/ 
 perl /share/nas6/xul/program/chloroplast/ssr/draw_ssr.pl -i $outdir/$sample/ssr_analysis/number.tmp -o $outdir/$sample/ssr_analysis/ssr.number.bar 

5.5.3 SSR 简单重复序列(非流程图)

多样品
1.
for i in *.gbk;do  mkdir ssr_analysis/${i%.gbk}  -p;done

2.
for i in *.gbk;do perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/etc/misa/misa.pl ${i%.gbk}.fasta ssr_analysis/${i%.gbk}/cpSSR ;done

3.
for i in *.gbk;do perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/gb2tbl.pl -i $i -o ssr_analysis/${i%.gbk}/;done

4.
有IR:
perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/etc/misa/src/stat.ssr.pl -i ssr_analysis/Lagerstroemia_caudata_NC_060355.1/cpSSR.misa -t ssr_analysis/Lagerstroemia_caudata_NC_060355.1/*.tbl -ir 84026-109650,126570-152194 -o1 ssr_analysis/Lagerstroemia_caudata_NC_060355.1/Lagerstroemia_caudata_NC_060355.1_ssr.info.xls -o2 ssr_analysis/Lagerstroemia_caudata_NC_060355.1/Lagerstroemia_caudata_NC_060355.1_ssr.stat.xls
无IR:
perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/etc/misa/src/stat.ssr.pl -i ssr_analysis/${i%.gbk}/cpSSR.misa -t ssr_analysis/${i%.gbk}/*.tbl -o1 ssr_analysis/${i%.gbk}/ssr.info.xls -o2 ssr_analysis/${i%.gbk}/ssr.stat.xls

5.生成几个物种的ssr类型统计表
cd ssr_analysis && for i in *;do cp $i/ssr.stat.xls $i.ssr.stat.xls;done && cd -
python3 /share/nas1/yuj/script/chloroplast/annotation/repeats_ssr_lsr_plot/4558_only_ssr_type_stat.py 输入目录
# cd ssr_analysis && for i in *;do cp $i/ssr.info.xls $i.ssr.info.xls;done && cd -
# "E:\OneDrive\jshy信息部\Script\chloroplast\annotation\repeats_ssr_lsr_plot\4558_only_ssr_type_stat.py"

6.excel绘制三维柱状图
单样品
只针对某个物种绘制
cp ssr.stat.xls ./
python3 /share/nas1/yuj/script/chloroplast/annotation/repeats_ssr_lsr_plot/4558_only_ssr_type_stat.py    ./ # 获得only_ssr_type_stat.txt
python3 /share/nas1/yuj/script/chloroplast/annotation/repeats_ssr_lsr_plot/ssr_plot.py -i  only_ssr_type_stat.txt -o ssr.bar.png

5.5.4 LSR 长散在重复序列

1.生成dispered.repeat.stat.xls
for i in *.fasta;do perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/reputer/find.dispered.repeat.pl -i $i -p ${i%.fassta} -o reputer;done

2.整理到一块

3.绘图
E:\OneDrive\jshy信息部\Script\chloroplast\annotation\repeats_ssr_lsr_plot\lsr_plot.py

5.5.5 合并信息绘图

需要前置文件$sample.dispered.repeat.region.xls

把散在重复文件下的分类统计表以区域拆分
"E:\OneDrive\jshy信息部\Script\chloroplast\annotation\repeats_ssr_lsr_plot\4558_repeat_region_stat.py"

生成ssr lsr柱状堆叠统计图片
"E:\OneDrive\jshy信息部\Script\chloroplast\annotation\repeats_ssr_lsr_plot\4558_all_repeat_plot.py"

5.5.6 重复序列圈图(线粒体流程)

/share/nas6/xul/program/mt2/annotation_for_multi_seq/pipline/repeat/mt2_repeat.pl 原版 线粒体参数
cd analysis/annotation/样品
mkdir repeat_circos && cd repeat_circos
perl /share/nas1/yuj/program/mt2/annotation_for_multi_seq/pipline/repeat/mt2_repeat.pl -mt *.gbk -o ./ # 与叶绿体ssr参数一致

## 仅绘图
circos="/share/nas6/xul/soft/circos/circos-0.69-6/bin/circos"
$circos  -conf circos.conf
svg2xxx -t pdf repeat.svg

5.6 RSCU_PLOT

1.生成rscu表格
mkdir rscu_analysis && for i in *.gbk;do perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/extrac_cds.pl -i $i -o features -p ${i%.gbk} && touch rscu_analysis/${i%.gbk}.rscu.stat.xls && perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/annotation/src/rscu.pl -i1 features/${i%.gbk}.cds -i2 features/${i%.gbk}.pep -o rscu_analysis/${i%.gbk}.rscu.stat.xls;done

2.合并xls
# "E:\OneDrive\jshy信息部\Script\chloroplast\annotation\repeats_ssr_lsr_plot\4558_rscu_stat..py"
python3 /share/nas1/yuj/script/chloroplast/annotation/rscu_plot/4558_all_rscu_stat.py rscu_analysis

示例:
codon type	Ormosia_henryi_OP450827.1	Ormosia_hosiei_OP450825.1	Ormosia_semicastrata_OP450824.1	Ormosia_xylocarpa_OP450826.1
UAA	1.7628	1.7628	1.8246	1.7628
UAG	0.6495	0.6495	0.6495	0.6495

3.绘图
tbtools绘制热图

六.批量操作

# 批量显示测序数据量
for i in $( ls /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/analysis/assembly/) ;do echo $i; ll /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/analysis/assembly/$i/pseudo/$i/1_Trimmed_Reads/$i.trimmed_P1.fq && ll /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/analysis/assembly/$i/pseudo/$i/1_Trimmed_Reads/$i.trimmed_P2.fq ;done

# 批量显示比对结果
for i in $( ls /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/analysis/assembly/) ;do echo $i; blastn -query /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/ref_ass/all_ref.fa  -subject /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/analysis/assembly/$i/pseudo/$i/1_Trimmed_Reads/uni/assembly.fasta -outfmt 6 ;done

# 测序数据质控
perl /share/nas6/pub/pipline/genome-assembly-seq/mitochondrial-genome-seq/v2.0/annotation/src/fastq_qc.pl -i analysis/samples.reads.txt -o complete_dir/01Rawdata 

# 批量质控 线粒体
for i in $( ls /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/analysis/assembly/) ;do echo $i; perl /share/nas6/pub/pipline/genome-assembly-seq/mitochondrial-genome-seq/v2.0/asmqc/mtDNA_asmqc.pl -i /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/ref_ass/fasta/HQ857211.1.fasta  -1 /share/nas1/seqloader/xianliti/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/2.clean_data/$i/$i_*_R1_001.fastq.gz -2 /share/nas1/seqloader/xianliti/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/2.clean_data/$i/$i_*_R2_001.fastq.gz -p Gallus_gallus -q /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/ref_ass/gbk/HQ857211.1.gbk  -o /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/analysis/annotation/$i/asmqc -g 2 -r /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/ref_ass/gbk/MT800385.1.gbk;done

# 自动换行查看
perl /share/nas6/pub/pipline/genome-assembly-seq/mitochondrial-genome-seq/v2.0/asmqc/mtDNA_asmqc.pl -i ref_ass/fasta/HQ857211.1.fasta -1 /share/nas1/seqloader/xianliti/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/2.clean_data/$i/$i_*_R1_001.fastq.gz -2 /share/nas1/seqloader/xianliti/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/2.clean_data/$i/$i_*_R2_001.fastq.gz -p Gallus_gallus -q /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/ref_ass/gbk/HQ857211.1.gbk -o /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/analysis/annotation/$i/asmqc -g 2 -r /share/nas1/yuj/project/20211229/GP-20211206-3811-1_longyanxueyuan_9samples_dongwu_xianliti/ref_ass/gbk/MT800385.1.gbk

# 批量质控 叶绿体
for i in $(ls /share/nas1/yuj/project/GP-20210924-3463-1_20220413/analysis/assembly/1/pseudo/1/Final_Assembly/ref_ass/fasta);do echo $i;perl /share/nas6/pub/pipline/genome-assembly-seq/chloroplast-genome-seq/v1.2/asmqc/cpDNA_asmqc.pl -i /share/nas1/yuj/project/GP-20210924-3463-1_20220413/analysis/assembly/1/pseudo/1/Final_Assembly/ref_ass/fasta/$i -1 /share/nas1/seqloader/yelvti/GP-20210924-3463-1_qinghaiminda_zhoulaosh_5samples_yelvti/data/2.clean_data/1/1_S0_L000_R1_000.fastq.gz -2 /share/nas1/seqloader/yelvti/GP-20210924-3463-1_qinghaiminda_zhoulaosh_5samples_yelvti/data/2.clean_data/1/1_S0_L000_R2_000.fastq.gz -p ${i%.1.fasta} -q /share/nas1/yuj/project/GP-20210924-3463-1_20220413/analysis/assembly/1/pseudo/1/Final_Assembly/ref_ass/gbk/${i%.fasta}.gbk  -o /share/nas1/yuj/project/GP-20210924-3463-1_20220413/analysis/assembly/1/pseudo/1/Final_Assembly/${i%.1.fasta}asmqc -r /share/nas1/yuj/project/GP-20210924-3463-1_20220413/analysis/assembly/1/pseudo/1/Final_Assembly/ref_ass/gbk/NC_053552.1.gbk;done

# 批量复制
for i in $(ls /share/nas1/yuj/project/GP-20210924-3463-1_20220413/analysis/assembly/1/pseudo/1/Final_Assembly/ref_ass/fasta);do echo $i;cp /share/nas1/yuj/project/GP-20210924-3463-1_20220413/analysis/assembly/1/pseudo/1/Final_Assembly/${i%.1.fasta}asmqc/coverage_plot/sample.png /share/nas1/yuj/project/GP-20210924-3463-1_20220413/analysis/assembly/1/pseudo/1/Final_Assembly/png/${i%.1.fasta}.png;done

七.基因分类

基因分类 基因分组 基因名称 基因数量
表达相关基因 核糖体 RNA 基因 rrn16(×2)、23(×2)、4.5(×2)、5(×2) 4
转运 RNA 基因 trnA-CAU、trnA-UGC(×2)、trnC-GCA、trnD-GUC、trnE-UUC、trnF-GAA、trnG-GCC、trnG-GUG、trnH-CAU、trnI-CAU(×2)、trnI-GAU(×2)、trnK-UUU、trnL-CAA(×2)、trnL-UAG、trnL-UAA、trnM-CAU(×2)、trnN-GUU(×2)、trnP-UGG、trnQ-UUG、trnR-ACG(×2)、trnR-UCU、trnS-UGA、trnS-GCU、trnS-GGA、trnT-UGU、trnT-GGU、trnV-UAC、trnW-CCA、trnY-GUA 29
核糖体小亚基基因(SSU) rps2、3、4、7(×2)、8、11、12(×2)、14、15(×2)、16、18、19 12
核糖体大亚基基因(LSU) rpl2(×2)、14、16、20、22、23(×2)、32、33、36 9
RNA 聚合酶亚基基因 rpoA、B、C1、C2 4
光合作用相关基因 光系统Ⅰ基因 psaA、B、C、I、J 5
光系统Ⅱ基因 psbA、B、C、D、E、F、H、I、J、K、L、M、T、Z 14
细胞色素复合物基因 petA、B、D、G、L、N 6
ATP 合酶基因 atpA、B、E、F、H、I 6
依赖 ATP 的蛋白酶单元 p 基因 clpP 1
二磷酸核酮糖醛化酶大亚基基因 rbcL 1
NADH 脱氢酶基因 ndhA、B(×2)、C、D、E、F、G、H、I、J、K 11
其他基因 成熟酶基因 matK 2
包裹膜蛋白基因 cemA 1
乙酰辅酶 A 羧化酶亚基基因 accD 1
c 型细胞色素合成基因 ccsA 1
转录起始因子基因 infA 1
未知功能基因 保守开放阅读框 ycf1(×2)、ycf2(×2)、ycf15(×2) 3