06T-DNA（外源基因）插入位点

一.T-LOC软件

Cite

文章：T-LOC: A comprehensive tool to localize and characterize T-DNA integration sites | Plant Physiology | Oxford Academic
公众号：微信公众平台
Github：https://github.com/sfli001/T-LOC

1.1 安装

github下载解压就行
注意一下先决条件
几个软件包

1.2 使用

/share/nas1/yuj/software/T-LOC/T-LOC.py

for i in *;do echo $i----------------------------------;python  /share/nas1/yuj/software/T-LOC/T-LOC.py  --fastq  /share/nas2/seqloader/reseq/GP-20240708-8721_xinjiangnongkeyuan_yuanyizuowusuo_4samples_reseq/2.clean_data/$i/$i*R1.fastq.gz,/share/nas2/seqloader/reseq/GP-20240708-8721_xinjiangnongkeyuan_yuanyizuowusuo_4samples_reseq/2.clean_data/$i/$i*R2.fastq.gz  --genome   /share/nas6/pub/genome/tomato/ITAG4.0_release/S_lycopersicum_chromosomes.4.00.fa  --TDNA  /share/nas2/yuj/project/2024/re-sequencing/GP-20240708-8721_20240726/T-DNA/target.fa  --output  $i  & done

## python2.7
1.带T-DNA 插入的测序 fastq 文件、基因组 fasta 文件、TDNA fasta 文件
python  /share/nas1/yuj/software/T-LOC/T-LOC.py  --fastq samples_1.fq.gz,samples_2.fq.gz --genome Reference.fa  --TDNA TDNA.fa --output outputfolder

2.带T-DNA的reads比参考的bam、带T-DNA的reads比TDNA的bam、基因组 fasta 文件、TDNA fasta 文件
python T-LOC.py  --bamR samples_Ref.bam --bamT samples_TDNA.bam --genome Reference.fa  --TDNA TDNA.fa --ouput outputfolder

3.带T-DNA插入的测序 fastq 文件、不带T-DNA插入的测序 fastq 文件、基因组 fasta 文件、TDNA fasta 文件
python T-LOC.py  --fastq samples_1.fq.gz,samples_2.fq.gz --genome Reference.fa  --TDNA TDNA.fa --control_fastq control_s1_1.fq.gz,control_s1_2.fq.gz:control_s2_1.fq.gz,control_s2_2.fq.gz --ouput outputfolder

4.带T-DNA的fastq 文件、不带T-DNA的reads比对ref的bam、基因组 fasta 文件、TDNA fasta 文件
python T-LOC.py  --fastq samples_1.fq.gz,samples_2.fq.gz --genome Reference.fa  --TDNA TDNA.fa --control_bamR control_s1.sam,control_s2.sam --ouput outputfolder

1.3 输出文件

1.Result
MosaicFromTDNA.pdf
   - 包含每个 T-DNA 插入位点的 T-DNA 插入图和两侧的镶嵌读数，这些读数是从 T-DNA 剪切对齐读数计算得出的。
MosaicFromREF.pdf
   - 包含每个 T-DNA 插入位点的 T-DNA 插入图和两侧的镶嵌读数，这些读数是从参考剪切对齐读数计算得出的。
MosaicFromTDNA.fasta
   - 包含每个 T-DNA 插入位点的侧翼参考序列以及 T-DNA 序列，这些序列是从 T-DNA 剪切对齐读数计算得出的。
MosaicFromREF.fasta
   - 包含每个 T-DNA 插入位点的侧翼参考序列以及 T-DNA 序列，这些序列是从参考剪切对齐读数计算得出的。
CovREF.pdf
   - T-DNA 插入位点的每个侧翼区域的参考基因组覆盖率。

2.log.T-LOC
   - 运行 T-LOC 管道的日志文件。

3.Bwa_genome
   - 包含参考基因组的 bwa 索引文件的文件夹。

4.TDNA_genome
   - 包含 T-DNA 序列的 bwa 索引文件的文件夹。

5.REF
   - 包含使用参考剪切读数生成的中间文件的文件夹。

6.TDNA
   - 包含使用 T-DNA 剪切读数生成的中间文件的文件夹。

7.Control
   - 包含从没有 T-DNA 插入的植物生成控制剪切 ID 的中间文件的文件夹。

1.4 参数

必需参数：
--genome:
	The fasta files of the reference genome
--TDNA:
	The fasta files of the TDNA  genome
--fastq: 
	s1_1.fq.gz[,s1_2.fq.gz]. The raw sequencing reads in fastq format from plant with TDNA insertions that are used to align to the reference genome and TDNA genomes. It can be omitted if both bamR and bamT were provided.
--bamR:
	Sorted bam file alignment to the reference genome from sequencing reads on plant with TDNA insertions. It can be omitted if fastq files  were provided.
--bamT:
    Sorted Bam file alignment to the TDNA genome from sequencing reads on plant with TDNA insertions. It can be omitted if fastq files  were provided.

可选参数：
--control_fastq:
	c1_1.fq.gz[,c1_2.fq.gz][:c2_1.fq.gz[,c2_2.fq.gz],...]. The raw sequencing reads in fastq format from plant without TDNA insertions that are used to align to the reference genome to produce control clipped read ID
--control_bamR:
	c1.bam[,c2.bam,...]. Bam/sam file alignment to the reference genome from sequencing reads on plant without TDNA insertions that are used to produce control clipped read ID
--control_ID:
	A txt file that are clipped read ID from plant without TDNA insertions. If control_fastq, control_bamR and control_ID were all missed, TDNA insertion sites were localized through TDNA clipped reads only.
--anchor:
	The minimal anchor length required for the clipped reads. The default value is 20.
--insert:
	The maximal insertion sequence length at the TDNA insertion sites. The default value is 30.
--genome_Bwa: 
	bwa index of given reference genome
--TDNA_Bwa:    
    bwa index of given TDNA sequence
--read_min_TDNA:
	The minimal number of mosaic reads required for each TDNA insertion sites as well as mosaic reads between two TDNA positions. The default value is 2. It can be set to any integer larger or equal to 2. 
--read_min_REF:
	The minimal number of mosaic reads bwteen two reference positions. The default value is 4. It can be set to any integer larger or equal to 2. If the program run slow, please increase the number.
--Sample_name:
	sample names will be used to label output files,defalut is ss
--resume:
	Whether to resume previous run. The default is true
--Mosaic_length: 
	the length of output sequences of each mosaic side
--output, -o
    The output folder. The default is current directory

二.外源基因插入位点流程

2.1 流程

"7.从比对结果中找到候选的插入位点。要求是S的部分要和比对上基因组部分的长度相等（+-9bp）"，默认误差是9bp，改大一点就是放宽筛选

## target.fa为包含目标基因的载体序列
## ref_genome.fa为参考基因

bowtie2-build target.fa target && bowtie2 -x target -1 *R1*.f*q.gz -2 *R2*.f*q.gz -S sample.sam -p 10 --very-sensitive-local && samtools view -h -F 4 sample.sam > map.sam && samtools sort map.sam > 01_map.sort.bam && samtools depth -a  01_map.sort.bam  > depth.txt && perl /share/nas1/yuj/program/chloroplast/assembly/draw_line_depth.pl depth.txt depth.svg && convert depth.svg  02_depth.png && perl /share/nas6/xul/program/reseq/find_junction_reads_from_sam.pl -i 01_map.sort.bam && blastn -query junction_reads.fa -subject  ref_genome.fa -outfmt 6 > 04_junction_reads_map2genome.xls  && perl /share/nas6/xul/program/reseq/get_insert_site_from_blast.pl -i 04_junction_reads_map2genome.xls  -m 9 &

1.使用bowtie2将测序片段比对外源基因，注意一定要用 --very-sensitive-local 比对
bowtie2-build target.fa target
bowtie2 -x target -1 *R1*.f*q.gz -2 *R2*.f*q.gz -S sample.sam -p 10 --very-sensitive-local

2.使用samtools view -h -F 得到比对后的sam文件
samtools view -h -F 4 sample.sam > map.sam

3.对sam文件进行排序
samtools sort map.sam > 01_map.sort.bam # 默认bam
# samtools view -h 01_map.sort.bam -o 01_map.sort.sam # 转sam

4.
## 统计下覆盖情况
samtools depth -a 01_map.sort.bam > depth.txt
## 可视化覆盖度
# perl /share/nas6/xul/program/chloroplast/assembly/draw_line_depth.pl depth.txt && convert test.svg  depth.png 需要手动输入Y轴最大值
perl /share/nas1/yuj/program/chloroplast/assembly/draw_line_depth.pl depth.txt depth.svg && convert depth.svg  depth.png # 自动+500

## 多载体检测
samtools depth -m 100000 -a 01_map.sort.bam |awk '{print $1,$2,$2,$3}' > sample.depth.txt # 输出包括0的位置，最大值100000
python3 /share/nas1/yuj/script/ddradANDreseq/T-DNA/filter_zero_coverage.py  sample.depth.txt filter_zero_coverage.txt | sort -k 2 -V  -r > coverage.stat.txt # 输出平均覆盖度
for i in `grep -v "#"  coverage.stat.txt | head -n 4 | cut -f 1`;do echo $i;grep $i sample.depth.txt >> filtered_ids_depth.txt;done # 根据平均覆盖从sample.depth.txt提取高覆盖度的信息到filtered_ids_depth.txt
python3 /share/nas1/yuj/script/chloroplast/annotation/draw_line_depth_v3.py filtered_ids_depth.txt  filtered_ids_depth.png # v3多个子图，效果好
# python3 /share/nas1/yuj/script/chloroplast/annotation/draw_line_depth_v2.py  sample.depth.txt  depth2.png # v2画在一张图上，差异过大时，看不清覆盖度低的

5.对map.bam文件进行筛选,得到junction_reads
perl /share/nas6/xul/program/reseq/find_junction_reads_from_sam.pl -i 01_map.sort.bam

6.将junction_reads和基因组比对
blastn -query junction_reads.fa -subject  ref_genome.fa -outfmt 6 > junction_reads_map2genome.xls
# cut -f 1,2,3,4,7,8,9,10  junction_reads_map2genome.xls > 04_junction_reads_map2genome.xls

7.从比对结果中找到候选的插入位点。要求是S的部分要和比对上基因组部分的长度相等（+-2bp）
perl /share/nas6/xul/program/reseq/get_insert_site_from_blast.pl -i junction_reads_map2genome.xls  -m 9

8.获取插入位置附近的基因
python3 /share/nas1/yuj/script/ddradANDreseq/BLAST_Read_and_Gene_Neighborhood_Reporter.py msu.gff3 --blast_file 05_final_junction_blast.xls > 06gene_chr_pos.txt
python3 /share/nas1/yuj/script/ddradANDreseq/BLAST_Read_and_Gene_Neighborhood_Reporter.py msu.gff3 --blast_file 05_final_junction_blast.xls --query "Chr2:300000-300500"> gene_chr_pos.txt

2.2 word报告

客户单位：江苏省农科院                          
报告单位：南京集思慧远生物科技有限公司          
联 系 人：                                      
联系电话：                                      
报告日期：2023年11月14日                        
 
目  录
1 信息分析方法	3
2 分析结果	4
3 结果文件说明	11


1 信息分析方法
外源载体插入位点是指转基因植物中载体序列插入到染色体的位置。在植物功能基因组学研究中，明确插入位点的信息对相关研究的顺利开展非常重要。
使用重测序的手段可以获得外源载体的插入位点，生物信息学分析具体步骤如下：

(1)	原始数据质控：
使用fastp（version 0.20.0，https://github.com/OpenGene/fastp）软件对原始数据进行过滤，过滤标准如下：
	截除Reads中的测序接头以及引物序列
	过滤掉平均质量值小于Q5的reads
	过滤掉N个数大于5的reads

(2)	原始数据比对外源载体：
使用bowtie2(v2.2.4,http://bowtie-bio.sourceforge.net/bowtie2/index.shtml）very-sensitive-local模式比对给定的载体序列。

(3)	获取交界处的序列：
对生成的sam文件进行过滤，获取比对结果为：()M()S和()S()M的序列。（）中为数值，M表示比对上的部分，S表示未比对上的部分。满足条件的序列及为交界处的序列。

(4)	将交界处的序列比对基因组：
使用BLAST 2.10.1+软件将上述得到的序列比对到基因组。

(5)	筛选合适的交界处序列，确定插入位点：
筛选S前对应的数值与比对上基因组的长度相近的序列，这些序列即为合适的交界处序列。结合交界处序列出现的次数确定最终的插入位点。
 
2 分析结果

(1) 使用bowtie2将原始数据比对到载体序列上，统计比对上的reads情况，结果如下：
表1 原始数据比对载体序列结果
序列	条数
Total	1340
使用samtools统计覆盖的深度，对覆盖深度进行可视化，结果如下：
图1 测序数据在质粒上的覆盖度及深度情况
由覆盖度可知，质粒左端的插入位点应该在3000bp附近，右端位点应该在9600bp附近。

(2) 筛选交界处的reads，共得到87条reads，每条reads以如下方式命名：
A00133:885:HN52MDSX7:4:1218:19741:5995_1(1)_61S89M
该名称以下划线分隔，共三段。第一段表示该reads的ID；第二段表示该reads比对到载体上的位置，括号中的位置是交界处的位点；第三段表示该reads和载体序列的具体比对情况，S表示比对不上，M表示比对的上，前面的数值代表长度。

(3) 交界处reads比对参考基因组，结果如下（部分）：
表2 交界处reads比对参考基因组结果
A00133:885:HN52MDSX7:4:1140:13449:14544_5(5)_57S93M	A04	100.000	43	1	43	87221801	87221843
注：第一列是测序序列ID，第二列是比对上参考基因组的染色体编号，第三列是比对的相似度，第四列是比对上的长度，第五、六列是测序序列比对的起止位置，第七、八列是比对上染色体的起止位置。
 
(4) 交界处reads比对参考基因组，存在长度不匹配的情况，这种序列不属于插入位点，需要过滤掉，最终过滤结果如下：
表3 交界处reads比对参考基因组最终结果
A00133:885:HN52MDSX7:4:2236:3025:18286_223(327)_104M46S	A04	100.000	39	112	150	87221290	87221252
注：第一列是测序序列ID，第二列是比对上参考基因组的染色体编号，第三列是比对的相似度，第四列是比对上的长度，第五、六列是测序序列比对的起止位置，第七、八列是比对上染色体的起止位置。

3 结果文件说明
01_map.sort.bam：原始数据比对载体序列的文件
02_depth.png：原始数据在载体上的覆盖情况图
03_junction_reads.fa：交界处的reads
04_junction_reads_map2genome.xls：交界处的reads与基因组的比对结果
05_final_junction_blast.xls：交界处的reads与基因组的比对结果（过滤后）

2.3 txt文本说明

3.结果文件说明
01_map.sort.bam：原始数据比对载体序列的文件
02_depth.png：原始数据在载体上的覆盖情况图
03_junction_reads.fa：交界处的reads
04_junction_reads_map2genome.xls：交界处的reads与基因组的比对结果
05_final_junction_blast.xls：交界处的reads与基因组的比对结果（过滤后）

-----------------------------------------------------------------------------------------------------------------------------------
1.信息分析方法
外源载体插入位点是指转基因植物中载体序列插入到染色体的位置。在植物功能基因组学研究中，明确插入位点的信息对相关研究的顺利开展非常重要。
使用重测序的手段可以获得外源载体的插入位点，生物信息学分析具体步骤如下：

(1)	原始数据质控：
使用fastp（version 0.20.0，https://github.com/OpenGene/fastp）软件对原始数据进行过滤，过滤标准如下：
	截除Reads中的测序接头以及引物序列
	过滤掉平均质量值小于Q5的reads
	过滤掉N个数大于5的reads

(2)	原始数据比对外源载体：
使用bowtie2(v2.2.4,http://bowtie-bio.sourceforge.net/bowtie2/index.shtml）very-sensitive-local模式比对给定的载体序列。

(3)	获取交界处的序列：
对生成的sam文件进行过滤，获取比对结果为：()M()S和()S()M的序列。（）中为数值，M表示比对上的部分，S表示未比对上的部分。
满足条件的序列及为交界处的序列。

(4)	将交界处的序列比对基因组：
使用BLAST 2.10.1+软件将上述得到的序列比对到基因组。

(5)	筛选合适的交界处序列，确定插入位点：
筛选S前对应的数值与比对上基因组的长度相近的序列，这些序列即为合适的交界处序列。结合交界处序列出现的次数确定最终的插入位点。

-----------------------------------------------------------------------------------------------------------------------------------
2.分析结果

(1) 使用bowtie2将原始数据比对到载体序列上，统计比对上的reads情况。
使用samtools统计覆盖的深度，对覆盖深度进行可视化，结果见”02_depth.png“（测序数据在质粒上的覆盖度及深度情况）。


(2) 筛选交界处的reads，每条reads以如下方式命名，结果见”03_junction_reads.fa“。
LH00270:300:22CNCKLT4:6:2404:42254:7724_424(424)_71S79M
该名称以下划线分隔，共三段。
第一段表示该reads的ID；
第二段表示该reads比对到载体上的位置，括号中的位置是交界处的位点；
第三段表示该reads和载体序列的具体比对情况，S表示比对不上，M表示比对的上，前面的数值代表长度。


(3) 交界处reads比对参考基因组，结果见”04_junction_reads_map2genome.xls“。
LH00270:300:22CNCKLT4:6:2376:36897:2849_1678(1678)_105S45M	SL4.0ch05	100.000	104	0	0	1	104	40593984	40594087	6.86e-48	193
注：
第1列是测序序列ID，第2列是比对上参考基因组的染色体编号，第3列是比对的相似度，第4列是比对上的长度，
第7、8列是测序序列比对的起止位置，第9、10列是比对上染色体的起止位置。


(4) 交界处reads比对参考基因组，存在长度不匹配的情况，这种序列不属于插入位点，需要过滤掉，最终过滤结果见“05_final_junction_blast.xls”。
LH00270:300:22CNCKLT4:6:2376:36897:2849_1678(1678)_105S45M	SL4.0ch05	100.000	104	1	104	40593984	40594087	6.86e-48
注：
第1列是测序序列ID，第2列是比对上参考基因组的染色体编号，第3列是比对的相似度，第4列是比对上的长度，
第5、6列是测序序列比对的起止位置，第7、8列是比对上染色体的起止位置。