GSE152684 Processing Pipeline
Publication
Evaluation of Engineered CRISPR-Cas-Mediated Systems for Site-Specific RNA Editing.Cell reports (2020) — PMID 33147453
Dataset
GSE152684Discovery and evaluation of CRISPR/Cas-mediated systems for site-specific RNA editing
Processing Steps
Generate Jupyter Notebook-
1
Raw reads were adapter-trimmed using Cutadapt (version 1.14)
$ Bash example
# Install Cutadapt if not already installed # conda install -c bioconda cutadapt=1.14 # Define input and output file paths INPUT_READS="raw_reads.fastq.gz" OUTPUT_TRIMMED_READS="trimmed_reads.fastq.gz" # Define adapter sequence. This must be replaced with the actual adapter sequence used. # Common Illumina universal adapter sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC # If trimming 3' adapters, use -a. If trimming 5' adapters, use -g. # If trimming unknown adapters, consider using --trim-n. ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Placeholder: Replace with actual adapter # Execute Cutadapt for adapter trimming # -a ADAPTER: Trims 3' adapters # -o: Output file for trimmed reads cutadapt -a "${ADAPTER_SEQUENCE}" \ -o "${OUTPUT_TRIMMED_READS}" \ "${INPUT_READS}" -
2
Sequenced reads were aligned to hg19 genome using STAR (version 2.5.2b) and sorted using samtools (version 1.5)
$ Bash example
# Install STAR # conda install -c bioconda star=2.5.2b # Install samtools # conda install -c bioconda samtools=1.5 # Placeholder for STAR genome index directory for hg19 # Replace with the actual path to your indexed hg19 genome # Example: /path/to/STAR_indices/hg19 GENOME_DIR="/path/to/hg19_STAR_index" # Placeholder for input FASTQ file(s) # Replace with your actual input FASTQ file(s) INPUT_FASTQ="reads.fastq.gz" # Placeholder for output prefix OUTPUT_PREFIX="aligned_reads" # Align reads using STAR # --genomeDir: path to the STAR genome index # --readFilesIn: input FASTQ file(s) # --outFileNamePrefix: prefix for output files # --outSAMtype BAM Unsorted: output unsorted BAM file, to be sorted by samtools # --runThreadN: number of threads to use (example: 8) STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_FASTQ}" \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM Unsorted \ --runThreadN 8 # Sort the aligned BAM file using samtools # -o: output file # Input file is the unsorted BAM generated by STAR (e.g., aligned_reads_Aligned.out.bam) samtools sort -o "${OUTPUT_PREFIX}.sorted.bam" "${OUTPUT_PREFIX}_Aligned.out.bam" -
3
Gene counts were quantified from sorted BAM files using featureCounts (version 1.5.0) of the Subread (Liao et al., Nucleic Acids Research 2013) package using "gene" as a metafeature.
featureCounts v1.5.0$ Bash example
# Install featureCounts (part of the Subread package) # conda install -c bioconda subread # Quantify gene counts from sorted BAM files using featureCounts # -a: Specify the annotation file (GTF/GFF) for gene features (e.g., hg38.gtf) # -o: Specify the output file for gene counts # -t gene: Specify "gene" as the feature type to count reads for (metafeature) # input.bam: Placeholder for the sorted BAM input file(s) featureCounts -a hg38.gtf -o gene_counts.txt -t gene input.bam
-
4
Annotated gene length and gene counts were subsequently used to calculate transcript per-million (TPM) for each gene in each library.
$ Bash example
# Install RSEM (example using conda) # conda install -c bioconda rsem # --- Placeholder for reference data --- # Replace with actual paths to your genome FASTA and GTF annotation. # For eCLIP, this would typically be the reference genome and annotation for the species (e.g., GRCh38). GENOME_FASTA="path/to/your/genome.fa" GTF_ANNOTATION="path/to/your/annotation.gtf" RSEM_REFERENCE_DIR="rsem_reference_index" # Directory to store RSEM index # --- Build RSEM reference index (if not already built) --- # This step needs to be run once for a given reference genome and annotation. # The GTF provides gene/transcript lengths and annotations necessary for RSEM. # rsem-prepare-reference --gtf "${GTF_ANNOTATION}" "${GENOME_FASTA}" "${RSEM_REFERENCE_DIR}" # --- Input data --- # Replace with the actual path to your aligned BAM file (e.g., from STAR alignment in an eCLIP pipeline). INPUT_ALIGNED_BAM="path/to/your/aligned_eclip_reads.bam" SAMPLE_NAME="eclip_sample" # Prefix for output files # --- Run RSEM to calculate expression and TPM --- # RSEM uses the provided GTF (for gene lengths) and the aligned reads to estimate # gene counts (expected counts) and subsequently calculate TPM values. # The '--bam' flag indicates input is a BAM file. # '--no-qualities' can speed up processing if quality scores are not needed for quantification. # '--paired-end' if your reads are paired-end (common in eCLIP). Remove if single-end. # '--num-threads' for parallel processing. rsem-calculate-expression \ --bam \ --no-qualities \ --paired-end \ --num-threads 8 \ "${INPUT_ALIGNED_BAM}" \ "${RSEM_REFERENCE_DIR}" \ "${SAMPLE_NAME}" # Output files in the current directory will include: # ${SAMPLE_NAME}.genes.results (contains TPM values for genes, along with expected counts, FPKM, etc.) # ${SAMPLE_NAME}.isoforms.results (contains TPM values for isoforms) # ${SAMPLE_NAME}.stat (statistics about the run) -
5
For downsampling prior to downstream editing analysis, samtools view -bs INT.FRAC was used for proportional downsampling of read pairs to achieve sampling depth between 15-35 million aligned reads
$ Bash example
# Install samtools (example using conda) # conda install -c bioconda samtools # Define input and output BAM files INPUT_BAM="path/to/your/input.bam" OUTPUT_BAM="path/to/your/downsampled.bam" # --- The following variables (SAMPLING_SEED, SAMPLING_FRACTION) would be calculated # --- based on the current read count and the desired target range (15-35 million aligned reads). # --- Example calculation (replace with actual logic based on your pipeline): # # Get current read count (excluding unmapped reads, primary alignments) # CURRENT_READS=$(samtools view -c -F 4 -F 256 "$INPUT_BAM") # # Define target (e.g., 25 million, midpoint of 15-35M) # TARGET_READS=25000000 # # Calculate fraction, ensuring it doesn't exceed 1 if current reads are already below target # SAMPLING_FRACTION=$(echo "scale=4; if ($CURRENT_READS > $TARGET_READS) $TARGET_READS / $CURRENT_READS else 1" | bc) # # Generate a random seed for reproducibility # SAMPLING_SEED=$(shuf -i 1-10000 -n 1) # Placeholder values for demonstration SAMPLING_SEED=12345 SAMPLING_FRACTION=0.5 # Example: downsample to 50% if current reads are 50M and target is 25M # Perform proportional downsampling of read pairs # The -b flag outputs in BAM format. # The -s flag takes a seed and a fraction (e.g., 12345.0.5 for 50% sampling with seed 12345). # When used with paired-end reads, samtools view -s ensures both reads in a pair are kept or discarded together. samtools view -b -s "${SAMPLING_SEED}.${SAMPLING_FRACTION}" "$INPUT_BAM" > "$OUTPUT_BAM" -
6
Aligned BAM files were run through SAILOR (v1.0.4) using the following parameters-- min_variant_coverage: 5, alpha: 0, beta: 0, edit_fraction: 0.01
$ Bash example
# Install SAILOR (example, adjust based on specific environment) # git clone https://github.com/yeolab/SAILOR.git # cd SAILOR # pip install . # Define input and output files INPUT_BAM="aligned.bam" # Placeholder for aligned BAM file OUTPUT_FILE="sailor_variants.tsv" # Placeholder for output file (e.g., TSV or VCF) REFERENCE_FASTA="genome.fa" # Placeholder for reference genome FASTA (e.g., hg38.fa) # Execute SAILOR python sailor.py \ --input_bam "${INPUT_BAM}" \ --reference_fasta "${REFERENCE_FASTA}" \ --output_file "${OUTPUT_FILE}" \ --min_variant_coverage 5 \ --alpha 0 \ --beta 0 \ --edit_fraction 0.01 -
7
BED files that are output corresponding to positive and negative strands of the genome from SAILOR are combined into one BED file and filtered based upon confidence score (Washburn et al., Cell Reports 2014)
$ Bash example
# Define input and output file names POSITIVE_STRAND_BED="sailor_output_positive.bed" NEGATIVE_STRAND_BED="sailor_output_negative.bed" COMBINED_BED="sailor_combined.bed" FILTERED_BED="sailor_filtered.bed" # Define confidence score threshold (placeholder - adjust as needed) # The specific threshold and filtering logic (e.g., >= or <=) depends on # how SAILOR defines its confidence score (e.g., raw score, p-value, FDR). # For demonstration, assuming a higher score is better. CONFIDENCE_THRESHOLD=5 # Example threshold, adjust based on SAILOR's score range # Combine positive and negative strand BED files cat "${POSITIVE_STRAND_BED}" "${NEGATIVE_STRAND_BED}" > "${COMBINED_BED}" # Filter the combined BED file based on confidence score (5th column) # Assuming the 5th column is the confidence score and higher is better. awk -v threshold="${CONFIDENCE_THRESHOLD}" '$5 >= threshold' "${COMBINED_BED}" > "${FILTERED_BED}" -
8
Recurrent sites between replicates were determined using pybedtools (version 0.8.0) intersect with the following options: wo=True, s=True
$ Bash example
# Install pybedtools and bedtools if not already installed # conda install -c bioconda pybedtools bedtools # Define input and output files # Replace 'replicate1.bed' and 'replicate2.bed' with your actual input BED files. # These files should contain genomic regions (e.g., peaks, binding sites). REPLICATE1_BED="replicate1.bed" REPLICATE2_BED="replicate2.bed" # Define the output file name. # The 'wo=True' option means the output will include columns from both input files # plus an additional column for the overlap length, making it a tab-separated file (TSV). OUTPUT_TSV="recurrent_sites_overlap_details.tsv" # Execute the pybedtools intersect operation using a Python one-liner # This command finds regions that overlap between replicate1.bed and replicate2.bed. # - 'wo=True' ensures that the output includes the original entries from both files # and the length of the overlap. # - 's=True' requires that overlapping features must be on the same strand. python -c " import pybedtools import sys # Check if correct number of arguments are provided if len(sys.argv) != 4: print('Usage: python -c \"...\" <replicate1.bed> <replicate2.bed> <output.tsv>', file=sys.stderr) sys.exit(1) # Load bed files from command-line arguments try: a = pybedtools.BedTool(sys.argv[1]) b = pybedtools.BedTool(sys.argv[2]) except Exception as e: print(f'Error loading BedTool files: {e}', file=sys.stderr) sys.exit(1) # Perform intersect with specified options # wo=True: write original A and B entries plus the number of base pairs of overlap # s=True: require same strand result = a.intersect(b, wo=True, s=True) # Save the result to a file try: result.saveas(sys.argv[3]) print(f'Recurrent sites (with overlap details) saved to {sys.argv[3]}', file=sys.stderr) except Exception as e: print(f'Error saving output file: {e}', file=sys.stderr) sys.exit(1) " "${REPLICATE1_BED}" "${REPLICATE2_BED}" "${OUTPUT_TSV}"
Raw Source Text
Raw reads were adapter-trimmed using Cutadapt (version 1.14) Sequenced reads were aligned to hg19 genome using STAR (version 2.5.2b) and sorted using samtools (version 1.5) Gene counts were quantified from sorted BAM files using featureCounts (version 1.5.0) of the Subread (Liao et al., Nucleic Acids Research 2013) package using "gene" as a metafeature. Annotated gene length and gene counts were subsequently used to calculate transcript per-million (TPM) for each gene in each library. For downsampling prior to downstream editing analysis, samtools view -bs INT.FRAC was used for proportional downsampling of read pairs to achieve sampling depth between 15-35 million aligned reads Aligned BAM files were run through SAILOR (v1.0.4) using the following parameters-- min_variant_coverage: 5, alpha: 0, beta: 0, edit_fraction: 0.01 BED files that are output corresponding to positive and negative strands of the genome from SAILOR are combined into one BED file and filtered based upon confidence score (Washburn et al., Cell Reports 2014) Recurrent sites between replicates were determined using pybedtools (version 0.8.0) intersect with the following options: wo=True, s=True Genome_build: hg19 Supplementary_files_format_and_content: Matrix tables containing normalized transcripts per-million (TPM) data for every annotated gene; BED files containing editing events per library