GSE78508 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Enhanced CLIP Uncovers IMP Protein-RNA Targets in Human Pluripotent Stem Cells Important for Cell Adhesion and Survival.

Cell reports (2016) — PMID 27068461

Dataset

GSE78508

Enhanced CLIP uncovers IMP protein-RNA targets in human pluripotent stem cells important for cell adhesion and survival [RNA-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNA-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.

cutadapt v4.1 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt

# Define input and output file names (placeholders)
# For paired-end reads, you would typically run cutadapt on both R1 and R2, 
# or use the -p option for paired-end trimming.
# Assuming single-end for this example based on the description.
INPUT_FASTQ="input.fastq.gz"
OUTPUT_FASTQ="output.trimmed.fastq.gz"

# Run cutadapt to trim adapters, polyA/T tails, and low-quality ends
cutadapt \
  --match-read-wildcards \
  --times 2 \
  -e 0 \
  -O 5 \
  --quality-cutoff 6 \
  -m 18 \
  -b TCGTATGCCGTCTTCTGCTTG \
  -b ATCTCGTATGCCGTCTTCTGCTTG \
  -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC \
  -b TGGAATTCTCGGGTGCCAAGG \
  -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA \
  -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT \
  -o "${OUTPUT_FASTQ}" \
  "${INPUT_FASTQ}"

View on GitHub

Reads were then mapped against a database of repetitive elements derived from RepBase18.05.

bowtie2 (Inferred with models/gemini-2.5-flash) v2.5.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bowtie2 (if not already installed)
# conda install -c bioconda bowtie2

# Placeholder for RepBase 18.05 sequences
# In a real scenario, you would download or generate a FASTA file
# containing the repetitive elements from RepBase 18.05.
# For example, from a source like UCSC genome browser's repeat masker track,
# or directly from RepBase (which might require a license).
# Let's assume 'repbase_18.05.fasta' is available in the working directory.
# Example: wget -O repbase_18.05.fasta "http://some_url_to_repbase_18.05.fasta"

# Build bowtie2 index for repetitive elements
# The index prefix will be 'repbase_18.05_index'
bowtie2-build repbase_18.05.fasta repbase_18.05_index

# Define input reads (replace with actual file names)
# Assuming single-end reads for simplicity, adjust for paired-end if necessary.
INPUT_READS="input_reads.fastq.gz"

# Map reads against the repetitive elements database
# -x: index prefix
# -U: unaligned reads (single-end input)
# -S: output SAM file
# --very-sensitive-local: a common preset for sensitive mapping, often used for repeat masking
bowtie2 -x repbase_18.05_index \
        -U "${INPUT_READS}" \
        --very-sensitive-local \
        -S mapped_to_repbase_18.05.sam

# Optional: Convert SAM to BAM and sort for downstream analysis
# samtools view -bS mapped_to_repbase_18.05.sam | samtools sort -o mapped_to_repbase_18.05.bam
# samtools index mapped_to_repbase_18.05.bam

View on GitHub

Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).

Bowtie v1.0.0 GitHub

$ Bash example

# Install Bowtie (if not already installed)
# conda install -c bioconda bowtie

# Placeholder for the Repbase index prefix (e.g., generated by bowtie-build)
# Replace 'repbase_index' with the actual path and prefix of your Bowtie index
REPBASE_INDEX="repbase_index"

# Placeholder for the input reads file (e.g., FASTQ format)
# Replace 'reads.fastq' with the actual path to your input reads file
INPUT_READS="reads.fastq"

# Placeholder for the output SAM file
# Replace 'output.sam' with the desired path for the output SAM file
OUTPUT_SAM="output.sam"

# Align reads using Bowtie version 1.0.0 with specified parameters
bowtie -S -q -p 16 -e 100 -l 20 "${REPBASE_INDEX}" "${INPUT_READS}" > "${OUTPUT_SAM}"

View on GitHub

Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1.

STAR v2.3.0e GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Placeholder for STAR genome index directory for hg19
# To generate the index, you would typically run:
# STAR --runMode genomeGenerate --genomeDir /path/to/STAR_index/hg19 --genomeFastaFiles hg19.fa --sjdbGTFfile genes.gtf --runThreadN <num_threads>
genome_index_dir="/path/to/STAR_index/hg19"

# Placeholder for input reads (e.g., FASTQ file of reads not mapped to Repbase)
# This assumes the input reads have already been filtered as described.
input_reads_fastq="reads_not_mapped_to_repbase.fastq"

# Placeholder for output file prefix
output_prefix="aligned_to_hg19_"

# Align reads to the hg19 human genome using STAR
STAR \
  --genomeDir "${genome_index_dir}" \
  --readFilesIn "${input_reads_fastq}" \
  --outFileNamePrefix "${output_prefix}" \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1 \
  --runThreadN 8 # Adjust number of threads as appropriate for your system

View on GitHub

counts of reads for each gene annotated in gencode v17 were calculated from featureCounts

featureCounts v2.0.3 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install featureCounts (part of Subread package)
# conda install -c bioconda subread

# Download and decompress the GENCODE v17 GTF file (corresponding to Ensembl release 67 / GRCh37)
# mkdir -p reference
# wget -O reference/Homo_sapiens.GRCh37.67.gtf.gz ftp://ftp.ensembl.org/pub/release-67/gtf/homo_sapiens/Homo_sapiens.GRCh37.67.gtf.gz
# gunzip reference/Homo_sapiens.GRCh37.67.gtf.gz

# Placeholder for input BAM files (aligned reads)
# Replace with actual BAM file paths, e.g., "sample1.bam sample2.bam"
INPUT_BAM_FILES="<input_aligned_reads_1.bam> <input_aligned_reads_2.bam>"

# Path to the GTF annotation file
GTF_FILE="reference/Homo_sapiens.GRCh37.67.gtf"

# Output file for gene counts
OUTPUT_FILE="gene_counts.txt"

# Number of threads to use for parallel processing
NUM_THREADS=8

# Execute featureCounts
# -a: Annotation file (GTF/GFF)
# -o: Output file
# -F GTF: Specify GTF format for annotation file
# -t exon: Count features of type 'exon' (common for RNA-seq gene counting)
# -g gene_id: Group features by 'gene_id' attribute to summarize counts per gene
# -s 0: Unstranded (0), use -s 1 for forward stranded, -s 2 for reverse stranded
# -T: Number of threads
# Add -p if reads are paired-end (e.g., featureCounts ... -p ...)
featureCounts -a "${GTF_FILE}" -o "${OUTPUT_FILE}" -F GTF -t exon -g gene_id -s 0 -T "${NUM_THREADS}" ${INPUT_BAM_FILES}

Tools Used

STAR

Raw Source Text

RNA-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.
Reads were then mapped against a database of repetitive elements derived from RepBase18.05. Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).
Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1.
counts of reads for each gene annotated in gencode v17 were calculated from featureCounts
Genome_build: hg19
Supplementary_files_format_and_content: csv count file, containts counts of reads for each sample

← Back to Analysis