GSE78508 Processing Pipeline
Publication
Enhanced CLIP Uncovers IMP Protein-RNA Targets in Human Pluripotent Stem Cells Important for Cell Adhesion and Survival.Cell reports (2016) — PMID 27068461
Dataset
GSE78508Enhanced CLIP uncovers IMP protein-RNA targets in human pluripotent stem cells important for cell adhesion and survival [RNA-Seq]
Processing Steps
Generate Jupyter Notebook-
1
RNA-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt # Define input and output file names (placeholders) # For paired-end reads, you would typically run cutadapt on both R1 and R2, # or use the -p option for paired-end trimming. # Assuming single-end for this example based on the description. INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="output.trimmed.fastq.gz" # Run cutadapt to trim adapters, polyA/T tails, and low-quality ends cutadapt \ --match-read-wildcards \ --times 2 \ -e 0 \ -O 5 \ --quality-cutoff 6 \ -m 18 \ -b TCGTATGCCGTCTTCTGCTTG \ -b ATCTCGTATGCCGTCTTCTGCTTG \ -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC \ -b TGGAATTCTCGGGTGCCAAGG \ -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA \ -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
2
Reads were then mapped against a database of repetitive elements derived from RepBase18.05.
bowtie2 (Inferred with models/gemini-2.5-flash) v2.5.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install bowtie2 (if not already installed) # conda install -c bioconda bowtie2 # Placeholder for RepBase 18.05 sequences # In a real scenario, you would download or generate a FASTA file # containing the repetitive elements from RepBase 18.05. # For example, from a source like UCSC genome browser's repeat masker track, # or directly from RepBase (which might require a license). # Let's assume 'repbase_18.05.fasta' is available in the working directory. # Example: wget -O repbase_18.05.fasta "http://some_url_to_repbase_18.05.fasta" # Build bowtie2 index for repetitive elements # The index prefix will be 'repbase_18.05_index' bowtie2-build repbase_18.05.fasta repbase_18.05_index # Define input reads (replace with actual file names) # Assuming single-end reads for simplicity, adjust for paired-end if necessary. INPUT_READS="input_reads.fastq.gz" # Map reads against the repetitive elements database # -x: index prefix # -U: unaligned reads (single-end input) # -S: output SAM file # --very-sensitive-local: a common preset for sensitive mapping, often used for repeat masking bowtie2 -x repbase_18.05_index \ -U "${INPUT_READS}" \ --very-sensitive-local \ -S mapped_to_repbase_18.05.sam # Optional: Convert SAM to BAM and sort for downstream analysis # samtools view -bS mapped_to_repbase_18.05.sam | samtools sort -o mapped_to_repbase_18.05.bam # samtools index mapped_to_repbase_18.05.bam -
3
Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).
$ Bash example
# Install Bowtie (if not already installed) # conda install -c bioconda bowtie # Placeholder for the Repbase index prefix (e.g., generated by bowtie-build) # Replace 'repbase_index' with the actual path and prefix of your Bowtie index REPBASE_INDEX="repbase_index" # Placeholder for the input reads file (e.g., FASTQ format) # Replace 'reads.fastq' with the actual path to your input reads file INPUT_READS="reads.fastq" # Placeholder for the output SAM file # Replace 'output.sam' with the desired path for the output SAM file OUTPUT_SAM="output.sam" # Align reads using Bowtie version 1.0.0 with specified parameters bowtie -S -q -p 16 -e 100 -l 20 "${REPBASE_INDEX}" "${INPUT_READS}" > "${OUTPUT_SAM}" -
4
Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Placeholder for STAR genome index directory for hg19 # To generate the index, you would typically run: # STAR --runMode genomeGenerate --genomeDir /path/to/STAR_index/hg19 --genomeFastaFiles hg19.fa --sjdbGTFfile genes.gtf --runThreadN <num_threads> genome_index_dir="/path/to/STAR_index/hg19" # Placeholder for input reads (e.g., FASTQ file of reads not mapped to Repbase) # This assumes the input reads have already been filtered as described. input_reads_fastq="reads_not_mapped_to_repbase.fastq" # Placeholder for output file prefix output_prefix="aligned_to_hg19_" # Align reads to the hg19 human genome using STAR STAR \ --genomeDir "${genome_index_dir}" \ --readFilesIn "${input_reads_fastq}" \ --outFileNamePrefix "${output_prefix}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --runThreadN 8 # Adjust number of threads as appropriate for your system -
5
counts of reads for each gene annotated in gencode v17 were calculated from featureCounts
featureCounts v2.0.3 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install featureCounts (part of Subread package) # conda install -c bioconda subread # Download and decompress the GENCODE v17 GTF file (corresponding to Ensembl release 67 / GRCh37) # mkdir -p reference # wget -O reference/Homo_sapiens.GRCh37.67.gtf.gz ftp://ftp.ensembl.org/pub/release-67/gtf/homo_sapiens/Homo_sapiens.GRCh37.67.gtf.gz # gunzip reference/Homo_sapiens.GRCh37.67.gtf.gz # Placeholder for input BAM files (aligned reads) # Replace with actual BAM file paths, e.g., "sample1.bam sample2.bam" INPUT_BAM_FILES="<input_aligned_reads_1.bam> <input_aligned_reads_2.bam>" # Path to the GTF annotation file GTF_FILE="reference/Homo_sapiens.GRCh37.67.gtf" # Output file for gene counts OUTPUT_FILE="gene_counts.txt" # Number of threads to use for parallel processing NUM_THREADS=8 # Execute featureCounts # -a: Annotation file (GTF/GFF) # -o: Output file # -F GTF: Specify GTF format for annotation file # -t exon: Count features of type 'exon' (common for RNA-seq gene counting) # -g gene_id: Group features by 'gene_id' attribute to summarize counts per gene # -s 0: Unstranded (0), use -s 1 for forward stranded, -s 2 for reverse stranded # -T: Number of threads # Add -p if reads are paired-end (e.g., featureCounts ... -p ...) featureCounts -a "${GTF_FILE}" -o "${OUTPUT_FILE}" -F GTF -t exon -g gene_id -s 0 -T "${NUM_THREADS}" ${INPUT_BAM_FILES}
Tools Used
Raw Source Text
RNA-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT. Reads were then mapped against a database of repetitive elements derived from RepBase18.05. Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009). Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1. counts of reads for each gene annotated in gencode v17 were calculated from featureCounts Genome_build: hg19 Supplementary_files_format_and_content: csv count file, containts counts of reads for each sample