GSE157467 Processing Pipeline

RNA-Seq code_examples 6 steps

Publication

Persistent mRNA localization defects and cell death in ALS neurons caused by transient cellular stress.

Cell reports (2021) — PMID 34496257

Dataset

Persistent mRNA localization defects and cell death in ALS neurons caused by transient cellular stress

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Fastq files were trimmed for adapters using cutadapt

cutadapt v4.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install cutadapt (example using conda)
# conda install -c bioconda cutadapt

# Define input and output file names
INPUT_R1="input_R1.fastq.gz"
INPUT_R2="input_R2.fastq.gz"
OUTPUT_R1="output_R1_trimmed.fastq.gz"
OUTPUT_R2="output_R2_trimmed.fastq.gz"
REPORT="cutadapt_report.txt"

# Define adapter sequences (replace with actual sequences if known)
# Example: Illumina universal adapter for Read 1 (3' end)
ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# Example: Illumina universal adapter for Read 2 (3' end)
ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Or reverse complement of R1 adapter, depending on library prep

# Execute cutadapt for paired-end trimming
# -a ADAPTER_R1: 3' adapter for Read 1
# -A ADAPTER_R2: 3' adapter for Read 2
# -q 20,20: Trim low-quality bases from 5' and 3' ends (quality threshold 20)
# --minimum-length 20: Discard reads shorter than 20 bp after trimming
# -o: Output file for Read 1
# -p: Output file for Read 2
# --cores: Number of CPU cores to use (if available in version)
# --report=file: Write a detailed report to a file
cutadapt -a "${ADAPTER_R1}" -A "${ADAPTER_R2}" \
         -q 20,20 --minimum-length 20 \
         -o "${OUTPUT_R1}" -p "${OUTPUT_R2}" \
         "${INPUT_R1}" "${INPUT_R2}" > "${REPORT}" 2>&1

View on GitHub

Trimmed reads were aligned to the human genome build hg19 using STAR aligner

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
GENOME_DIR="/path/to/star_index/hg19" # Path to the pre-built STAR genome index for hg19
READS_R1="trimmed_reads_R1.fastq.gz" # Placeholder for trimmed forward reads
READS_R2="trimmed_reads_R2.fastq.gz" # Placeholder for trimmed reverse reads (if paired-end)
OUTPUT_PREFIX="star_aligned" # Prefix for output files
THREADS=8 # Number of threads to use

# --- Genome Index Generation (if not already done) ---
# To build the hg19 STAR index, you would typically need:
# 1. The hg19 primary assembly FASTA file (e.g., from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz)
# 2. The hg19 GTF annotation file (e.g., from GENCODE or UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ncbiRefSeq.gtf.gz)
#
# Example command to build index (uncomment and modify if needed):
# mkdir -p ${GENOME_DIR}
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles /path/to/hg19.fa \
#      --sjdbGTFfile /path/to/hg19.gtf \
#      --sjdbOverhang 100 \
#      --runThreadN ${THREADS}

# Perform STAR alignment
# The description implies trimmed reads, assuming paired-end for typical RNA-seq.
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READS_R1} ${READS_R2} \
     --runThreadN ${THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard \
     --quantMode GeneCounts # Optional: output gene counts with RSEM-like quantification

View on GitHub

seCLIP : Binding region peaks were identified based on read pileup densities (Clipper)

CLIPper vlatest (from source) GitHub

$ Bash example

# Clone the CLIPper repository
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# CLIPper is a Python script, ensure Python 3 is available.
# Common Python dependencies include numpy, scipy, pysam.
# pip install numpy scipy pysam

# Placeholder for input BAM files and chromosome sizes
IP_BAM="path/to/your/ip_sample.bam"
CONTROL_BAM="path/to/your/control_sample.bam"
CHROM_SIZES="path/to/your/genome/hg38.chrom.sizes" # Example: from UCSC goldenPath (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes)
OUTPUT_PREFIX="sample_binding_regions"

# Execute CLIPper to identify binding region peaks
# -b: IP BAM file
# -c: Control BAM file
# -s: Chromosome sizes file
# -o: Output prefix (CLIPper will append .bed)
# -p: P-value threshold (default is 0.01)
# -f: FDR threshold (default is 0.05)
# -v: Verbose output
# -t: Number of threads
python clipper.py \
    -b "${IP_BAM}" \
    -c "${CONTROL_BAM}" \
    -s "${CHROM_SIZES}" \
    -o "${OUTPUT_PREFIX}" \
    -p 0.01 \
    -f 0.05 \
    -v \
    -t 8

View on GitHub

seCLIP : Reproducible peaks were identified by comparing replicate read pileups along binding regions

merge_peaks (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install idr (Irreproducible Discovery Rate tool), a dependency for merge_peaks
# conda install -c bioconda idr

# Clone the merge_peaks repository
# git clone https://github.com/yeolab/merge_peaks.git
# cd merge_peaks

# Example: Prepare a genome size file (e.g., for hg38). This file lists chromosome names and their lengths.
# You might need to generate this from a reference genome .fasta file or download it.
# For example, using samtools faidx:
# samtools faidx hg38.fa
# cut -f1,2 hg38.fa.fai > hg38.chrom.sizes

# Run the IDR analysis using run_idr.py from the merge_peaks pipeline.
# This command assumes you have already called peaks for two replicates (rep1_peaks.bed, rep2_peaks.bed)
# and a pooled sample (pooled_peaks.bed), typically in narrowPeak format.
python merge_peaks/run_idr.py \
  --peak_file_rep1 rep1_peaks.bed \
  --peak_file_rep2 rep2_peaks.bed \
  --peak_file_pooled pooled_peaks.bed \
  --output_prefix reproducible_peaks \
  --genome_size hg38.chrom.sizes \
  --idr_threshold 0.05 \
  --peak_type narrowPeak

View on GitHub

wiggle files

deepTools (Inferred with models/gemini-2.5-flash) v3.5.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install deepTools if not available
# conda install -c bioconda deeptools

# Example: Generate bigWig coverage file from a BAM file
# This command generates a bigWig file normalized by RPKM, with a bin size of 10 bp.
# Replace 'aligned_reads.bam' with your input BAM file.
# Replace 'coverage.bw' with your desired output bigWig file name.
# For effective genome size, refer to deepTools documentation or ENCODE guidelines for your specific organism/assembly.
# Example effective genome size for hg38 (from ENCODE): 2913022398

bamCoverage \
    -b aligned_reads.bam \
    -o coverage.bw \
    --normalizeUsingRPKM \
    --binSize 10 \
    --effectiveGenomeSize 2913022398 \
    --numberOfProcessors auto

View on GitHub

peak text files

clipper (Inferred with models/gemini-2.5-flash) v0.0.1 GitHub

$ Bash example

# Install clipper using Docker
# docker pull yeolab/clipper:0.0.1

# Define input files and parameters
IP_BAM="ip.bam" # Placeholder for IP BAM file (e.g., from eCLIP IP sample)
INPUT_BAM="input.bam" # Placeholder for input/control BAM file (e.g., from eCLIP input sample)
SPECIES="hg38" # Placeholder for species (e.g., hg38, mm10). Reference genome source: UCSC/Ensembl
OUTPUT_DIR="." # Output directory
OUTPUT_PREFIX="peaks" # Prefix for output peak files

# Run clipper for peak calling
docker run --rm -v "$(pwd):/data" yeolab/clipper:0.0.1 \
    --species "${SPECIES}" \
    --output-dir "/data/${OUTPUT_DIR}" \
    --output-prefix "${OUTPUT_PREFIX}" \
    "/data/${IP_BAM}" \
    "/data/${INPUT_BAM}"

View on GitHub

Tools Used

STAR

Raw Source Text

Fastq files were trimmed for adapters using cutadapt
Trimmed reads were aligned to the human genome build hg19 using STAR aligner
seCLIP : Binding region peaks were identified based on read pileup densities (Clipper)
seCLIP : Reproducible peaks were identified by comparing replicate read pileups along binding regions
Genome_build: hg19
Supplementary_files_format_and_content: normalized read counts text file
wiggle files
peak text files

← Back to Analysis