GSE178259 Processing Pipeline

RNA-Seq code_examples 6 steps

Publication

Non-microRNA binding competitively inhibits LIN28 regulation.

Cell reports (2021) — PMID 34380031

Dataset

Non-microRNA Binding Competitively Inhibits LIN28 Regulation

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.

cutadapt v1.18 GitHub

$ Bash example

# Install cutadapt (e.g., using conda)
# conda install -c bioconda cutadapt=1.18

# Define input and output files
INPUT_FASTQ="raw_reads.fastq.gz"
OUTPUT_FASTQ="trimmed_reads.fastq.gz"
REPORT_FILE="cutadapt_report.txt"

# Define adapter sequences
# Illumina TruSeq 3' adapter (common for eCLIP)
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# Placeholder for 5' barcode (e.g., 10 random Ns for UMI/randomer in eCLIP)
# Adjust this based on the specific eCLIP library preparation protocol
ADAPTER_5PRIME_BARCODE="N{10}" 

# Execute cutadapt
cutadapt \
  -a "${ADAPTER_3PRIME}" \
  -g "${ADAPTER_5PRIME_BARCODE}" \
  -o "${OUTPUT_FASTQ}" \
  --minimum-length 18 \
  --quality-cutoff 20 \
  --error-rate 0.1 \
  --cores 8 \
  --discard-untrimmed \
  --max-n 0 \
  --trim-n \
  "${INPUT_FASTQ}" > "${REPORT_FILE}" 2>&1

View on GitHub

Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences.

STAR (Inferred with models/gemini-2.5-flash) v2.5.2a GitHub

$ Bash example

# Install STAR and samtools if not available
# conda install -c bioconda star samtools

# Placeholder for RepBase STAR index directory.
# This directory should contain the Genome.fa, SA, SAindex, etc. files for RepBase.
REPBASE_STAR_INDEX="/path/to/repbase_star_index"

# Placeholder for input trimmed reads (e.g., single-end or R1 for paired-end)
TRIMMED_READS_R1="trimmed_reads_R1.fastq.gz"
# If paired-end, also define TRIMMED_READS_R2
# TRIMMED_READS_R2="trimmed_reads_R2.fastq.gz"

# Placeholder for output prefix for STAR alignment files
STAR_OUTPUT_PREFIX="repbase_alignment"

# 1. Align trimmed reads to RepBase using STAR.
# Parameters are chosen to be permissive for repetitive sequences and consistent with eCLIP workflows.
STAR --genomeDir "${REPBASE_STAR_INDEX}" \
     --readFilesIn "${TRIMMED_READS_R1}" \
     # If paired-end, uncomment the line below and add R2
     # --readFilesIn "${TRIMMED_READS_R1}" "${TRIMMED_READS_R2}" \
     --outFileNamePrefix "${STAR_OUTPUT_PREFIX}" \
     --outSAMtype BAM Unsorted \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 100 \
     --outFilterMismatchNmax 10 \
     --outFilterScoreMinOverLread 0.66 \
     --outFilterMatchNminOverLread 0.66 \
     --runThreadN 8 # Adjust number of threads as appropriate for your system

# 2. Filter out reads that mapped to RepBase (keep only unmapped reads).
# The flag '-f 4' selects reads where the 'read unmapped' flag is set.
# The output 'non_repetitive_reads.bam' will contain reads that did not map to RepBase.
samtools view -f 4 -b "${STAR_OUTPUT_PREFIX}Aligned.out.bam" > non_repetitive_reads.bam

# Optional: Convert the filtered BAM back to FASTQ if subsequent steps require FASTQ input
# samtools fastq -n -F 4 "${STAR_OUTPUT_PREFIX}Aligned.out.bam" > non_repetitive_reads.fastq.gz

View on GitHub

Remaining reads were mapped to the appropriate genome build using STAR aligner

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda create -n star_env star=2.7.10a -c bioconda -c conda-forge
# conda activate star_env

# Define variables
READS="remaining_reads.fastq.gz" # Placeholder for input reads
GENOME_DIR="STAR_genome_index/hg38" # Placeholder for STAR genome index (e.g., for human hg38)
OUTPUT_PREFIX="aligned_STAR_"
THREADS=8 # Number of threads to use

# --- Reference Data Setup (run once per genome) ---
# To generate the STAR genome index, you would typically use:
# STAR --runThreadN ${THREADS} --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles /path/to/hg38.fa \
#      --sjdbGTFfile /path/to/gencode.v38.annotation.gtf \
#      --sjdbOverhang 100 # Recommended: (ReadLength - 1)
# Reference genome (e.g., hg38.fa) can be obtained from UCSC, Ensembl, or NCBI.
# Annotation GTF (e.g., gencode.v38.annotation.gtf) can be obtained from GENCODE.

# Map reads to the genome
STAR --runThreadN ${THREADS} \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READS} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.1 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --sjdbScore 1

View on GitHub

RPKM values were calculated for all polysome samples.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# Install RSEM (if not already installed)
# conda install -c bioconda rsem

# --- Reference Preparation (run once) ---
# This step builds the RSEM reference index from a genome FASTA and gene annotation GTF file.
# Replace 'genome.fa' with your reference genome FASTA file (e.g., hg38.fa).
# Replace 'genes.gtf' with your gene annotation GTF file (e.g., gencode.v38.annotation.gtf).
# 'rsem_reference_index' will be the prefix for the generated index files.
# rsem-prepare-reference --gtf genes.gtf genome.fa rsem_reference_index

# --- RPKM Calculation for a single polysome sample ---
# This command calculates RPKM (among other metrics) for a single sample.
# For multiple polysome samples, this command would typically be run in a loop or a pipeline.
# Replace 'aligned_reads.bam' with the path to your input BAM file (e.g., from STAR alignment).
# Replace 'rsem_reference_index' with the path to your prepared RSEM reference index (e.g., 'rsem_reference_index').
# Replace 'sample_output_prefix' with a desired prefix for output files (e.g., 'polysome_sample1').
# The '--strandedness reverse' parameter is common for many RNA-seq library preparations (e.g., Illumina TruSeq stranded).
# Adjust to 'forward' or 'none' based on your library preparation protocol.

rsem-calculate-expression \
--bam \
--no-qualities \
-p 8 \
--strandedness reverse \
aligned_reads.bam \
rsem_reference_index \
sample_output_prefix

# The RPKM values will be found in the 'sample_output_prefix.genes.results' file (e.g., 'polysome_sample1.genes.results').

View on GitHub

For eCLIP samples, read densities were calculated to identify eCLIP peaks.

eCLIP vfrom source GitHub

$ Bash example

# Install dependencies for clipper
# pip install numpy scipy pysam

# Placeholder for genome sizes file (e.g., hg38.chrom.sizes)
# This file can be generated from a FASTA file using samtools faidx and awk,
# or downloaded from UCSC (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes)
# Example:
# samtools faidx hg38.fa
# cut -f1,2 hg38.fa.fai > hg38.chrom.sizes

# Run clipper to identify eCLIP peaks
# -b: Input BAM file (aligned reads)
# -s: Genome sizes file
# -o: Output prefix for peak files
python clipper.py -b sample_eclip.bam -s hg38.chrom.sizes -o sample_eclip_peaks

View on GitHub

eclip data processing pipeline can be requested from the following link : https://github.com/YeoLab/eclip

eCLIP vLatest from GitHub repository (https://github.com/YeoLab/eclip) GitHub

$ Bash example

# Install cwltool (if not already installed)
# pip install cwltool

# Clone the YeoLab eCLIP CWL workflow repository
git clone https://github.com/YeoLab/eclip.git
cd eclip # Navigate into the cloned repository directory

# --- Placeholder Input Data ---
# Replace with actual FASTQ files for your experiment
# Assuming single-end reads for demonstration, as paired-end is optional in the CWL
mkdir -p data/fastq
echo "fake_read1_content" > data/fastq/sample_R1.fastq.gz

# --- Placeholder Reference Data (hg38) ---
# Download or specify paths to your reference genome files
# These are examples for human hg38, replace with actual paths
mkdir -p reference/hg38
# Example: Download from UCSC/GENCODE
# wget -P reference/hg38 https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
# gunzip reference/hg38/hg38.fa.gz
# mv reference/hg38/hg38.fa reference/hg38/GRCh38.p14.genome.fa
# wget -P reference/hg38 http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz
# gunzip reference/hg38/gencode.v45.annotation.gtf.gz
# wget -P reference/hg38 https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
# wget -P reference/hg38 https://raw.githubusercontent.com/Boyle-Lab/Blacklist/master/lists/hg38-blacklist.v2.bed

# Create a dummy STAR index directory
mkdir -p reference/hg38/STAR_index
# In a real scenario, you would generate this using STAR --runMode genomeGenerate
# STAR --runMode genomeGenerate --genomeDir reference/hg38/STAR_index --genomeFastaFiles reference/hg38/GRCh38.p14.genome.fa --sjdbGTFfile reference/hg38/gencode.v45.annotation.gtf --sjdbOverhang 99 --runThreadN 8

# --- Create CWL input YAML file ---
# This file specifies all inputs for the eCLIP CWL workflow
cat << EOF > eclip_inputs.yaml
fastq_file_r1:
  class: File
  path: data/fastq/sample_R1.fastq.gz
# fastq_file_r2: # Uncomment and provide path for paired-end reads
#   class: File
#   path: data/fastq/sample_R2.fastq.gz
genome_fasta:
  class: File
  path: reference/hg38/GRCh38.p14.genome.fa
genome_gtf:
  class: File
  path: reference/hg38/gencode.v45.annotation.gtf
star_index:
  class: Directory
  path: reference/hg38/STAR_index
chrom_sizes:
  class: File
  path: reference/hg38/hg38.chrom.sizes
blacklist_bed:
  class: File
  path: reference/hg38/hg38-blacklist.v2.bed
# Optional parameters for CLIPper (using default values from the CWL or common practice)
min_read_count: 10
min_peak_length: 10
max_peak_length: 500
p_threshold: 0.01
# Add other parameters as needed, e.g., for adapter trimming, quality filtering, etc.
EOF

# --- Execute the eCLIP CWL workflow ---
# The eclip.cwl file is in the current directory after 'cd eclip'
cwltool eclip.cwl eclip_inputs.yaml --outdir eclip_output

View on GitHub

Tools Used

STAR eCLIP

Raw Source Text

Raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt. Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences. Remaining reads were mapped to the appropriate genome build using STAR aligner
RPKM values were calculated for all polysome samples. For eCLIP samples, read densities were calculated to identify eCLIP peaks.
eclip data processing pipeline can be requested from the following link : https://github.com/YeoLab/eclip
Genome_build: mm10 and hg19
Supplementary_files_format_and_content: RPKM files for all polysome samples. eCLIP input normalized peaks in BED format. eCLIP reproduclible peaks in BED format. eCLIP normalized read densitied in BIGWIG format.

← Back to Analysis