GSE138197 Processing Pipeline
RNA-Seq
code_examples
6 steps
Publication
Skipper analysis of eCLIP datasets enables sensitive detection of constrained translation factor binding sites.Cell genomics (2023) — PMID 37388912
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
map to HG19 with WASP output mode.
$ Bash example
# Install STAR if not already available # conda install -c bioconda star # Define variables # Replace with actual paths to your HG19 STAR index and input FASTQ files GENOME_DIR="/path/to/hg19_star_index" # Pre-built STAR index for HG19 READ1="input_R1.fastq.gz" READ2="input_R2.fastq.gz" # Adjust if single-end reads, remove READ2 and use only READ1 OUTPUT_PREFIX="aligned_hg19_wasp_" THREADS=8 # Adjust as needed, typically based on available CPU cores # Note: The STAR genome index for HG19 must be pre-built. # Example command to build index (requires HG19 FASTA and GTF/GFF annotation): # STAR --runThreadN ${THREADS} --runMode genomeGenerate --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/hg19.fa --sjdbGTFfile /path/to/hg19.gtf # Run STAR alignment with WASP-compatible output parameters # The '--outSAMattrIHstart 0' parameter is crucial for WASP compatibility. STAR --runThreadN ${THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1} ${READ2} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outSAMattrIHstart 0 \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --sjdbScore 1 \ --limitBAMsortRAM 30000000000 # Adjust RAM as needed (e.g., 30GB) # The output BAM file (${OUTPUT_PREFIX}Aligned.sortedByCoordinate.out.bam) # can then be used as input for the WASP tool for allele-specific analysis. # WASP GitHub: https://github.com/bmvdgeijn/WASP -
2
Remove reads not passing WASP multimapping.
WASP vlatest$ Bash example
# Install WASP (if not already installed) # git clone https://github.com/bmv45/WASP.git # cd WASP # pip install . # Define input and output files # INPUT_BAM: Aligned BAM file (e.g., from STAR, HISAT2) that needs multimapping reads filtered by WASP # OUTPUT_BAM: Filtered BAM file with multimapping reads resolved or removed by WASP # SNPS_VCF: VCF file containing known SNPs for the reference genome (e.g., dbSNP for hg38) # REFERENCE_FASTA: Reference genome FASTA file (e.g., GRCh38.p14.genome.fa) INPUT_BAM="path/to/aligned_reads.bam" OUTPUT_BAM="path/to/wasp_filtered_reads.bam" SNPS_VCF="path/to/dbSNP_GRCh38.vcf.gz" # Placeholder: Use an appropriate VCF for your organism/assembly REFERENCE_FASTA="path/to/GRCh38.p14.genome.fa" # Placeholder: Use your reference genome FASTA # Ensure WASP scripts are accessible (e.g., in PATH or specify full path) # This command uses WASP's filter_multimapping_reads.py script to remove reads # that do not pass WASP's criteria for allele-specific multimapping resolution. # It requires a sorted and indexed BAM file and a VCF file with known SNPs. python WASP/WASP_scripts/filter_multimapping_reads.py \ --input_bam "${INPUT_BAM}" \ --output_bam "${OUTPUT_BAM}" \ --snp_vcf "${SNPS_VCF}" \ --reference_fasta "${REFERENCE_FASTA}" \ --log_file "wasp_filter.log" -
3
Clean for mispriming by removing reads with 6 A's upstream of 7 of 10 As in upstream sequence.
$ Bash example
# Clone the eclip repository to get the script. This script is part of the Yeo lab eCLIP workflow. # git clone https://github.com/yeolab/eclip.git # cd eclip/scripts # Define input and output file paths INPUT_R1="input_R1.fastq.gz" INPUT_R2="input_R2.fastq.gz" # Assuming paired-end reads based on eCLIP workflow OUTPUT_R1="output_R1.mispriming_filtered.fastq.gz" OUTPUT_R2="output_R2.mispriming_filtered.fastq.gz" # Path to the filter_mispriming.py script # Replace '/path/to/eclip/scripts' with the actual path to the scripts directory within your cloned eclip repository. ECLIP_SCRIPTS_DIR="/path/to/eclip/scripts" # Execute the mispriming filtering script # The description "7 of 10 As" directly maps to --polyA_threshold 0.7 and --polyA_window 10. # The phrase "6 A's upstream" is interpreted as a descriptive characteristic of the mispriming event # targeted by the filter, rather than a distinct, directly mappable parameter in the script. # The script checks for poly-A stretches within an 'upstream_window' from the 5' end of the read. # The default 'upstream_window' of 20 is commonly used in eCLIP pipelines. python "${ECLIP_SCRIPTS_DIR}/filter_mispriming.py" \ -r1 "${INPUT_R1}" \ -r2 "${INPUT_R2}" \ -o1 "${OUTPUT_R1}" \ -o2 "${OUTPUT_R2}" \ --polyA_threshold 0.7 \ --polyA_window 10 \ --upstream_window 20 -
4
Merge all BAM files and call peaks with inhouse scripts (avaialbe on github)
clipper (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Install clipper (assuming it's a Python script and dependencies are met) # git clone https://github.com/yeolab/clipper.git # cd clipper # pip install -r requirements.txt # if any # Define input BAM files (replace with actual paths to your BAM files) INPUT_BAMS="sample_1.bam sample_2.bam sample_3.bam" # Define output merged BAM file MERGED_BAM="merged_input.bam" # Define reference genome (e.g., hg38, mm10). This is a placeholder. # clipper expects a genome name that it can resolve internally or via configuration. GENOME_NAME="hg38" # Define output prefix for clipper peak files OUTPUT_PREFIX="clipper_peaks" # 1. Merge all BAM files echo "Merging BAM files: $INPUT_BAMS into $MERGED_BAM" samtools merge -@ 8 "$MERGED_BAM" $INPUT_BAMS # Index the merged BAM file (recommended for downstream tools) samtools index "$MERGED_BAM" # 2. Call peaks with clipper echo "Calling peaks with clipper on $MERGED_BAM for genome $GENOME_NAME" # Assuming clipper.py is in your PATH or specified with its full path # Common parameters: -b for input BAM, -s for genome name, -o for output prefix. # Other parameters like -f (FDR), -w (window size) can be added if specified. python /path/to/clipper/clipper.py -b "$MERGED_BAM" -s "$GENOME_NAME" -o "$OUTPUT_PREFIX"
-
5
Assign PAS to genes using RefSeq annoations
$ Bash example
# Install bedtools if not already installed # conda install -c bioconda bedtools # Placeholder for identified Polyadenylation Sites (PAS) in BED format. # This file would typically be generated by an upstream step (e.g., from RNA-seq or 3' end sequencing data). PAS_FILE="identified_pas.bed" # --- Reference RefSeq gene annotations for hg38 --- # Download RefSeq gene annotations from UCSC Table Browser (refGene table). # This file contains gene models including exons, introns, UTRs. # We will extract gene body coordinates from it. REFSEQ_GENES_TXT="refGene.txt" REFSEQ_GENES_URL="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz" # Download and decompress refGene.txt if not present. # if [ ! -f "${REFSEQ_GENES_TXT}" ]; then # wget -O "${REFSEQ_GENES_TXT}.gz" "${REFSEQ_GENES_URL}" # gunzip "${REFSEQ_GENES_TXT}.gz" # fi # Convert refGene.txt to a BED file containing gene body coordinates. # Columns in refGene.txt: bin, name, chrom, strand, txStart, txEnd, cdsStart, cdsEnd, exonCount, exonStarts, exonEnds, id, name2, cdsStartStat, cdsEndStat, exonFrames # We extract: chrom ($3), txStart ($5), txEnd ($6), name2 (gene symbol, $13), score (0), strand ($4). # The 'txStart' and 'txEnd' define the transcript (gene) boundaries. REFSEQ_GENES_BED="refseq_genes_hg38_gene_bodies.bed" # awk '{print $3"\t"$5"\t"$6"\t"$13"\t0\t"$4}' "${REFSEQ_GENES_TXT}" | sort -k1,1 -k2,2n > "${REFSEQ_GENES_BED}" # Ensure the reference genome assembly is specified (e.g., hg38). GENOME_ASSEMBLY="hg38" # Assign PAS to genes using bedtools intersect. # -a: input file A (PAS). # -b: input file B (RefSeq gene bodies). # -wao: Write the original A and B entries plus the number of overlapping bases. # If no overlap, B fields are set to NULL and overlap is 0. # -s: Force strandedness. Requires A and B to be on the same strand. # Output will contain PAS coordinates and overlapping gene information. bedtools intersect -a "${PAS_FILE}" -b "${REFSEQ_GENES_BED}" -wao -s > pas_assigned_to_genes.bed # Further processing might be needed to determine specific assignment logic # (e.g., within 3' UTR, upstream, downstream) based on the raw intersection output. -
6
Remove PAS used on average <5% in total or nuclear individuals
Custom Python script (Inferred with models/gemini-2.5-flash) v1.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# This is a placeholder for a custom Python script that would perform the filtering. # The actual script would depend on the format of the input PAS usage file, which is assumed to be a TSV. # Example Python script (filter_pas.py): # import pandas as pd # # def filter_pas_by_usage(input_file, output_file, threshold=5.0, total_col='total_usage_percent', nuclear_col='nuclear_usage_percent'): # df = pd.read_csv(input_file, sep='\t') # # Filter rows where either total_usage_percent or nuclear_usage_percent is >= threshold # # The description states "Remove PAS used on average <5%", meaning we keep PAS used >=5%. # filtered_df = df[(df[total_col] >= threshold) | (df[nuclear_col] >= threshold)] # filtered_df.to_csv(output_file, sep='\t', index=False) # # if __name__ == "__main__": # import argparse # parser = argparse.ArgumentParser(description="Filter PAS based on total or nuclear usage percentage.") # parser.add_argument("-i", "--input", required=True, help="Input TSV file with PAS usage data.") # parser.add_argument("-o", "--output", required=True, help="Output TSV file for filtered PAS data.") # parser.add_argument("-t", "--threshold", type=float, default=5.0, help="Minimum usage percentage (total or nuclear) to keep a PAS.") # parser.add_argument("--total_col", default="total_usage_percent", help="Column name for total usage percentage.") # parser.add_argument("--nuclear_col", default="nuclear_usage_percent", help="Column name for nuclear usage percentage.") # args = parser.parse_args() # filter_pas_by_usage(args.input, args.output, args.threshold, args.total_col, args.nuclear_col) # Execution command: # Assuming 'pas_usage_data.tsv' is the input file containing PAS usage percentages # and 'filtered_pas_data.tsv' will be the output file. python filter_pas.py -i pas_usage_data.tsv -o filtered_pas_data.tsv -t 5.0 --total_col total_usage_percent --nuclear_col nuclear_usage_percent
Raw Source Text
map to HG19 with WASP output mode. Remove reads not passing WASP multimapping. Clean for mispriming by removing reads with 6 A's upstream of 7 of 10 As in upstream sequence. Merge all BAM files and call peaks with inhouse scripts (avaialbe on github) Assign PAS to genes using RefSeq annoations Remove PAS used on average <5% in total or nuclear individuals Genome_build: GRCh37 Supplementary_files_format_and_content: APAPAS_GeneLocAnno.5perc.bed: bed file with location of each PAS used at 5% in the total or nuclear fraction. Name column has peak unique number, gene, and PAS genic location. End corresponds to 5kb downstream of annotated 3' UTR Supplementary_files_format_and_content: MetaDataSequencing.txt: file with relevant cell collection and sequencing metadata