GSE47626 Processing Pipeline
Publication
Differential L1 regulation in pluripotent stem cells of humans and apes.Nature (2013) — PMID 24153179
Dataset
GSE47626Differential LINE-1 retrotransposition in induced pluripotent stem cells between humans and great apes
Processing Steps
Generate Jupyter Notebook-
1
Paired end reads from all libraries were mapped to both the human (hg19, GRCh37) and chimpanzee (panTro3, CGSC 2.1.3) genomes using STAR (v2.2.0c).
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # --- Reference Genome Preparation (Conceptual) --- # For mapping to "both" human (hg19) and chimpanzee (panTro3) genomes, # a common strategy is to concatenate their FASTA files and build a single STAR index. # This allows reads to map to either genome within the same alignment run. # Placeholder paths for reference genomes and annotations. # Replace these with actual paths to your hg19 and panTro3 files. # HUMAN_GENOME_FASTA="path/to/hg19.fa" # CHIMP_GENOME_FASTA="path/to/panTro3.fa" # HUMAN_GTF="path/to/hg19.gtf" # Optional: For splice junction annotation # CHIMP_GTF="path/to/panTro3.gtf" # Optional # Define directory for the combined genome index # COMBINED_GENOME_DIR="combined_hg19_panTro3_index" # mkdir -p "${COMBINED_GENOME_DIR}" # Concatenate FASTA files to create a combined reference # cat "${HUMAN_GENOME_FASTA}" "${CHIMP_GENOME_FASTA}" > "${COMBINED_GENOME_DIR}/combined_genome.fa" # Concatenate GTF files (if available and desired for splice-aware alignment) # If using GTF, ensure chromosome names in GTF match those in FASTA. # If GTF is not used, remove the --sjdbGTFfile parameter from genomeGenerate. # cat "${HUMAN_GTF}" "${CHIMP_GTF}" > "${COMBINED_GENOME_DIR}/combined_annotations.gtf" # GENOME_ANNOTATION_PARAM="--sjdbGTFfile ${COMBINED_GENOME_DIR}/combined_annotations.gtf" # Build STAR index for the combined genome (example command) # This step needs to be run once before alignment. # STAR --runMode genomeGenerate \ # --genomeDir "${COMBINED_GENOME_DIR}" \ # --genomeFastaFiles "${COMBINED_GENOME_DIR}/combined_genome.fa" \ # --runThreadN 8 \ # ${GENOME_ANNOTATION_PARAM} # Uncomment if using GTF # --- STAR Alignment (v2.2.0c) --- # Assuming the combined genome index has been pre-built at 'combined_hg19_panTro3_index'. # Define input files and output directory. # Replace with your actual paired-end FASTQ.gz files. INPUT_READ1="sample_R1.fastq.gz" INPUT_READ2="sample_R2.fastq.gz" GENOME_INDEX_DIR="combined_hg19_panTro3_index" # Path to the pre-built combined genome index OUTPUT_DIR="star_output" OUTPUT_PREFIX="sample_aligned" # Prefix for output files mkdir -p "${OUTPUT_DIR}" STAR --genomeDir "${GENOME_INDEX_DIR}" \ --readFilesIn "${INPUT_READ1}" "${INPUT_READ2}" \ --readFilesCommand zcat \ --runThreadN 8 \ --outFileNamePrefix "${OUTPUT_DIR}/${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 20 \ --outFilterType BySJout \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.1 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 -
2
Due to the lack of annotation in the chimpanzee genome, human gene models (RefSeq) were used to quantify gene expression.
RefSeq v2.0.6$ Bash example
# Install featureCounts (part of Subread package) # conda install -c bioconda subread # Define reference human RefSeq gene models (e.g., for hg38) # This GTF file contains gene annotations derived from RefSeq for the human genome. # Source: UCSC Genome Browser, refGene track for hg38 REFSEQ_GTF="https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.gtf.gz" # Placeholder for input BAM file containing chimpanzee reads aligned to the human genome. # The description implies chimpanzee reads are quantified against human gene models. INPUT_BAM="chimpanzee_sample_aligned_to_human_genome.bam" # Output file for gene expression counts OUTPUT_COUNTS="gene_expression_counts.txt" # Quantify gene expression using featureCounts # -a: Annotation file (GTF/GFF) # -F GTF: Specify annotation file format as GTF # -t exon: Count reads mapping to 'exon' features # -g gene_id: Aggregate counts by 'gene_id' attribute in the GTF # -o: Output file for counts # Assuming single-end reads for this example; add -p for paired-end reads if applicable. featureCounts -a "${REFSEQ_GTF}" \ -F GTF \ -t exon \ -g gene_id \ -o "${OUTPUT_COUNTS}" \ "${INPUT_BAM}" -
3
To avoid bias introduced by genome insertions and deletions, only reads mapping to both the human and chimpanzee genomes uniquely were used from each sample when comparing gene expression values (~4% of reads mapped to only one genome per sample).
$ Bash example
# This pipeline assumes that raw reads have already been aligned to both the human (e.g., hg38) and chimpanzee (e.g., panTro6) genomes using a suitable aligner like STAR, and the resulting BAM files are sorted and indexed. # The goal is to identify reads that map uniquely to *both* genomes and then filter one of the alignments (e.g., human) to retain only these common uniquely mapped reads for downstream gene expression analysis. # Assume the following input files exist from prior alignment steps: # human_alignment/Aligned.sortedByCoordinate.out.bam (full alignment to human genome) # chimp_alignment/Aligned.sortedByCoordinate.out.bam (full alignment to chimpanzee genome) # --- Installation (commented out) --- # # conda install -c bioconda samtools # --- Step 1: Extract read IDs that map uniquely to the human genome --- # A read is considered uniquely mapped if it has a high mapping quality (implicitly handled by STAR's NH:i:1 tag) # and is not a secondary (0x100) or unmapped (0x4) alignment. For STAR, NH:i:1 indicates a unique alignment. # We extract the query name (read ID) for these reads. samtools view -F 0x4 -F 0x100 human_alignment/Aligned.sortedByCoordinate.out.bam | \ awk '($0 ~ /NH:i:1/) {print $1}' | sort | uniq > human_unique_read_ids.txt # --- Step 2: Extract read IDs that map uniquely to the chimpanzee genome --- samtools view -F 0x4 -F 0x100 chimp_alignment/Aligned.sortedByCoordinate.out.bam | \ awk '($0 ~ /NH:i:1/) {print $1}' | sort | uniq > chimp_unique_read_ids.txt # --- Step 3: Find common read IDs that are present in both unique lists --- # These are the reads that mapped uniquely to *both* the human and chimpanzee genomes. comm -12 human_unique_read_ids.txt chimp_unique_read_ids.txt > common_unique_read_ids.txt # --- Step 4: Filter the human alignment BAM to keep only the common uniquely mapped reads --- # This creates the final BAM file containing only the reads that satisfy the criteria, # aligned to the human genome, ready for gene expression quantification. samtools view -N common_unique_read_ids.txt -b human_alignment/Aligned.sortedByCoordinate.out.bam > final_filtered_reads_human_alignment.bam # --- Step 5: Index the final filtered BAM file --- samtools index final_filtered_reads_human_alignment.bam # --- Step 6: Clean up intermediate files --- rm human_unique_read_ids.txt chimp_unique_read_ids.txt common_unique_read_ids.txt # Reference genomes used for initial alignment (not directly in this script, but implied): # Human: GRCh38 (hg38) - Source: UCSC Genome Browser, NCBI # Chimpanzee: panTro6 - Source: UCSC Genome Browser, NCBI -
4
To calculate gene expression, read counts in the exons of RefSeq transcripts where calculated using HOMER
$ Bash example
# Install HOMER (if not already installed) # conda install -c bioconda homer # Define variables INPUT_BAM="input.bam" # Replace with your input BAM file (e.g., alignment file) GENOME_ASSEMBLY="hg38" # Replace with your genome assembly (e.g., hg19, mm10, hg38) OUTPUT_TAG_DIRECTORY="homer_tags_for_expression" OUTPUT_FILE="gene_expression_counts.txt" NUM_CPUS=8 # Adjust the number of CPUs as needed for parallel processing # 1. Create a HOMER Tag Directory from the BAM file # This step processes the BAM file into a HOMER-specific format for efficient downstream analysis. # It's crucial for HOMER's quantification functions. makeTagDirectory "${OUTPUT_TAG_DIRECTORY}" "${INPUT_BAM}" -cpu "${NUM_CPUS}" # 2. Calculate gene expression (read counts in RefSeq transcripts/genes) using annotatePeaks.pl # The -gene option quantifies reads over gene bodies (transcription start site to transcription end site). # The -refseq option instructs HOMER to use its built-in RefSeq annotations for the specified genome. # The output will be a tab-separated file containing gene-level read counts and other annotations. annotatePeaks.pl "${OUTPUT_TAG_DIRECTORY}" "${GENOME_ASSEMBLY}" -gene -refseq -cpu "${NUM_CPUS}" > "${OUTPUT_FILE}"
Raw Source Text
Paired end reads from all libraries were mapped to both the human (hg19, GRCh37) and chimpanzee (panTro3, CGSC 2.1.3) genomes using STAR (v2.2.0c). Due to the lack of annotation in the chimpanzee genome, human gene models (RefSeq) were used to quantify gene expression. To avoid bias introduced by genome insertions and deletions, only reads mapping to both the human and chimpanzee genomes uniquely were used from each sample when comparing gene expression values (~4% of reads mapped to only one genome per sample). To calculate gene expression, read counts in the exons of RefSeq transcripts where calculated using HOMER Genome_build: GRCh37, panTro3 Genome_build: Pan_troglodytes-2.1.3 Supplementary_files_format_and_content: File "rpkm.txt.gz" (linked as a supplementary file to the Series record) is a tab-delimited text file with RPKM values from all experiments and gene annotation information for RefSeq protein-coding genes (hg19 assembly)