GSE195495 Processing Pipeline
Publication
Nuclear and cytoplasmic poly(A) binding proteins (PABPs) favor distinct transcripts and isoforms.Nucleic acids research (2022) — PMID 35438785
Dataset
GSE195495Identification of different transcripts that favor the nuclear or cytoplasmic Poly(A) Binding Proteins (PABPs) through RNA immunoprecipitation
Processing Steps
Generate Jupyter Notebook-
1
All reads were aligned to the human genome hg38 primary assembly using STAR.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.7.10a # Define paths for reference genome and GTF # For hg38 primary assembly, we typically use GENCODE or Ensembl annotations. # Example Ensembl GRCh38 release 110: # GENOME_FASTA="ftp://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz" # GTF_FILE="ftp://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz" # Placeholder paths for local files GENOME_FASTA="/path/to/reference/hg38/Homo_sapiens.GRCh38.dna.primary_assembly.fa" # e.g., from Ensembl or UCSC GTF_FILE="/path/to/reference/hg38/Homo_sapiens.GRCh38.110.gtf" # e.g., from Ensembl or GENCODE STAR_INDEX_DIR="/path/to/STAR_index/hg38_ensembl_110" # Directory to store the STAR genome index # Create STAR genome index (run this command once to build the index) # mkdir -p ${STAR_INDEX_DIR} # STAR --runMode genomeGenerate \ # --genomeDir ${STAR_INDEX_DIR} \ # --genomeFastaFiles ${GENOME_FASTA} \ # --sjdbGTFfile ${GTF_FILE} \ # --sjdbOverhang 100 \ # --runThreadN 8 # Adjust threads based on available resources # Define input and output files READS_R1="sample_R1.fastq.gz" # Placeholder for forward reads READS_R2="sample_R2.fastq.gz" # Placeholder for reverse reads (if paired-end, otherwise remove) OUTPUT_PREFIX="sample_aligned" # Prefix for output files (e.g., sample_aligned.Aligned.sortedByCoord.out.bam) THREADS=8 # Number of threads to use for alignment # Align reads to hg38 primary assembly using STAR STAR --genomeDir ${STAR_INDEX_DIR} \ --readFilesIn ${READS_R1} ${READS_R2} \ --readFilesCommand zcat \ --runThreadN ${THREADS} \ --outFileNamePrefix ${OUTPUT_PREFIX}. \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outSAMunmapped Within \ --twopassMode Basic -
2
featureCounts version 2.0.2 was used to annotate reads, using the flag --minOverlap 20 and a custom GTF file derived from gencode v34
featureCounts v2.0.2$ Bash example
# featureCounts is part of the Subread package. # Installation via Bioconda: # conda install -c bioconda subread # Placeholder for the custom GTF file derived from gencode v34. # Ensure this GTF file is available in your working directory or provide its full path. # Example: gencode.v34.annotation.gtf CUSTOM_GTF="custom_gencode_v34.gtf" # Placeholder for input BAM file(s). # featureCounts can take multiple BAM files as input. INPUT_BAM="input.bam" # Placeholder for output counts file. OUTPUT_COUNTS="featureCounts_counts.txt" # Execute featureCounts # -a: Annotation file (GTF/GFF) # -o: Output file # --minOverlap: Minimum number of overlapping bases in a read that is required for read assignment. # Input BAM files are provided as positional arguments. featureCounts -a "${CUSTOM_GTF}" \ -o "${OUTPUT_COUNTS}" \ --minOverlap 20 \ "${INPUT_BAM}" -
3
Differential expression was calculated using DESeq2.
$ Bash example
# Install R and Bioconductor if not already present # conda create -n deseq2_env r-base=4.3.2 bioconductor-deseq2 -c bioconda -c conda-forge -y # conda activate deseq2_env # Create dummy input files for demonstration purposes # In a real scenario, 'counts.tsv' would be your gene/transcript count matrix # and 'sample_info.tsv' would contain metadata for your samples. cat << EOF > counts.tsv gene1 gene2 gene3 sampleA1 100 50 20 sampleA2 120 60 25 sampleB1 50 100 15 sampleB2 60 110 18 EOF cat << EOF > sample_info.tsv sample condition sampleA1 control sampleA2 control sampleB1 treated sampleB2 treated EOF # R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R library(DESeq2) # Load count data count_data <- read.table("counts.tsv", header=TRUE, row.names=1, sep="\t") # Load sample information sample_info <- read.table("sample_info.tsv", header=TRUE, row.names=1, sep="\t") # Ensure sample order matches between count data and sample info sample_info <- sample_info[colnames(count_data), , drop=FALSE] # Create DESeqDataSet object dds <- DESeqDataSetFromMatrix(countData = round(count_data), # DESeq2 expects integer counts colData = sample_info, design = ~ condition) # Run DESeq2 analysis dds <- DESeq(dds) # Get results for the 'treated' vs 'control' comparison res <- results(dds, contrast=c("condition", "treated", "control")) # Order results by adjusted p-value res <- res[order(res$padj), ] # Write results to a CSV file write.csv(as.data.frame(res), file="deseq2_results.csv") # Optional: Save normalized counts norm_counts <- counts(dds, normalized=TRUE) write.csv(as.data.frame(norm_counts), file="deseq2_normalized_counts.csv") message("DESeq2 analysis complete. Results saved to deseq2_results.csv") EOF # Execute the R script Rscript run_deseq2.R -
4
Cut-offs used to determine significant differential expression were baseMean > 50 and padj <=0.01
$ Bash example
# This script assumes DESeq2 has already been run and its results are in 'deseq2_results.tsv'. # The following R script filters the results based on the specified cut-offs. # If R and DESeq2 are not installed: # # conda install -c conda-forge r-base # # conda install -c bioconda bioconductor-deseq2 Rscript -e ' results_df <- read.delim("deseq2_results.tsv", sep="\t", header=TRUE) significant_results <- subset(results_df, baseMean > 50 & padj <= 0.01) write.table(significant_results, "significant_differential_expression.tsv", sep="\t", quote=FALSE, row.names=FALSE) ' -
5
Genes which had substantial upstream intergenic reads (similar to DoGs) were removed from DE tables.
Custom script (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# This script filters a Differential Expression (DE) results table # by removing genes identified as having "substantial upstream intergenic reads" (similar to DoGs). # The identification of such genes is assumed to be performed by a prior custom script. # --- Conceptual Step: Identification of genes with substantial upstream intergenic reads --- # This part is highly dependent on how "substantial upstream intergenic reads" is defined. # It would typically involve: # 1. Defining upstream intergenic regions for each gene using a genome annotation file (e.g., GTF/GFF). # Reference dataset: Genome annotation (e.g., Homo_sapiens.GRCh38.109.gtf) # 2. Quantifying reads in these defined regions from aligned sequencing data (e.g., BAM files). # Reference dataset: Aligned reads (e.g., sample.bam) # 3. Applying a threshold or statistical method to identify genes with "substantial" read counts # in these upstream intergenic regions. # # For this filtering step, we assume a file named 'genes_to_remove.txt' has already been generated, # containing one gene identifier per line for genes that meet the removal criteria. # Example (conceptual) command that might generate 'genes_to_to_remove.txt': # python /path/to/custom_script/identify_dogs_like_genes.py \ # --bam_file /path/to/aligned_reads/sample.bam \ # --gtf_file /path/to/genome_annotation/Homo_sapiens.GRCh38.109.gtf \ # --output_file genes_to_remove.txt \ # --upstream_distance 1000 \ # --min_read_count 50 \ # --intergenic_only # --- Actual Filtering Step --- # Input: Differential Expression results table # This file is expected to have gene identifiers in the first column and a header. DE_TABLE="differential_expression_results.tsv" # Input: List of gene identifiers to be removed # Each line should contain one gene ID. GENES_TO_REMOVE="genes_to_remove.txt" # Output: Filtered Differential Expression results table FILTERED_DE_TABLE="differential_expression_results_filtered.tsv" # Check if the DE table exists if [ ! -f "$DE_TABLE" ]; then echo "Error: Differential expression table '$DE_TABLE' not found." exit 1 fi # Check if the list of genes to remove exists and is not empty if [ ! -s "$GENES_TO_REMOVE" ]; then echo "Warning: List of genes to remove '$GENES_TO_REMOVE' is empty or not found. No genes will be filtered." # If the file is empty, we can just copy the original DE table or proceed without filtering. cp "$DE_TABLE" "$FILTERED_DE_TABLE" echo "Original DE table copied to $FILTERED_DE_TABLE as no genes were specified for removal." exit 0 fi # Filter the DE table: # 1. Read the list of genes to remove into an associative array 'a'. # 2. Process the DE table: # - Print the header line (NR==1). # - For subsequent lines, print the line only if the gene ID (first column, $1) is NOT in the 'a' array. awk 'FNR==NR{a[$1]=1; next} NR==1 || !($1 in a)' "$GENES_TO_REMOVE" "$DE_TABLE" > "$FILTERED_DE_TABLE" echo "Filtering complete. Filtered DE table saved to: $FILTERED_DE_TABLE" echo "Genes removed from DE table (listed in $GENES_TO_REMOVE):" cat "$GENES_TO_REMOVE" -
6
Upstream intergenic regions were extracted using bedtools flank -l 2000 -r 0 -s.
$ Bash example
# Install bedtools if not already installed # conda install -c bioconda bedtools # Placeholder for input intergenic regions file. # This file would typically be generated in a previous step, # e.g., by subtracting gene annotations from the whole genome. # For demonstration, let's assume 'intergenic_regions.bed' exists. # Example (not part of this specific step): # bedtools complement -i genes.bed -g hg38.chrom.sizes > intergenic_regions.bed # Download a chrom.sizes file for the desired genome assembly (e.g., hg38) # wget -O hg38.chrom.sizes http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes # Extract upstream intergenic regions bedtools flank -i intergenic_regions.bed -g hg38.chrom.sizes -l 2000 -r 0 -s > upstream_intergenic_regions.bed
-
7
Intervening upstream genes that had any overlap with this region were removed with bedtools subtract.
$ Bash example
# Install bedtools if not already installed # conda install -c bioconda bedtools # Define input and output files (placeholders) # input_regions.bed: BED file containing the regions of interest (e.g., original regions) # upstream_genes.bed: BED file containing the upstream gene regions to be removed # filtered_regions.bed: Output BED file with upstream genes subtracted bedtools subtract -a input_regions.bed -b upstream_genes.bed > filtered_regions.bed
-
8
Reads that did not align to intronic or exonic regions were extracted from bam files with fgrep.
fgrep (Inferred with models/gemini-2.5-flash) vSystem Utility (Inferred with models/gemini-2.5-flash)$ Bash example
# This step describes extracting reads that *did not* align to intronic or exonic regions. # Directly using `fgrep` on a BAM file for genomic region filtering is not standard as BAM is binary. # A common approach involves first identifying reads that *do* align to these regions (e.g., using bedtools intersect), # extracting their read IDs, and then using `fgrep -v` on a list of all read IDs to find the complement. # Finally, `samtools` can filter the original BAM based on these non-aligning read IDs. # Alternatively, if 'fgrep' is used directly, it implies the BAM was converted to SAM/FASTQ and then filtered for specific string patterns, which is less likely for genomic region filtering. # The following code block demonstrates a plausible interpretation using standard tools to achieve the described outcome, assuming 'fgrep' was used on a list of read IDs. # Define input and output files INPUT_BAM="input.bam" INTRONIC_EXONIC_BED="intronic_exonic.bed" # Placeholder for a BED file defining intronic/exonic regions OUTPUT_BAM="reads_not_in_intronic_exonic.bam" # --- Installation (commented out) --- # conda install -c bioconda samtools bedtools # 1. Extract all read IDs from the input BAM file samtools view "${INPUT_BAM}" | cut -f1 > all_read_ids.txt # 2. Identify reads that *do* align to intronic or exonic regions # This step uses bedtools intersect to find reads overlapping the defined regions. # The output is a BAM file containing only reads that align to intronic/exonic regions. bedtools intersect -abam "${INPUT_BAM}" -b "${INTRONIC_EXONIC_BED}" > aligned_to_regions.bam # 3. Extract read IDs from the reads that *do* align to intronic/exonic regions samtools view aligned_to_regions.bam | cut -f1 | sort -u > aligned_read_ids.txt # 4. Use fgrep to find read IDs that are *not* in the 'aligned_read_ids.txt' list # This effectively identifies reads that did not align to intronic or exonic regions. fgrep -v -f aligned_read_ids.txt all_read_ids.txt > non_intronic_exonic_read_ids.txt # 5. Filter the original BAM file to extract reads corresponding to the non-intronic/exonic read IDs # Note: Filtering BAM by a list of read names is not directly supported by `samtools view -N`. # A common workaround is to convert to SAM, filter with `grep`, and convert back. # However, `samtools view -N` is a common misconception. A more efficient way is to use `samtools view -L` for regions or `samtools view -f` for flags. # Given the explicit mention of `fgrep` and the need to filter by read names, converting to SAM and back is the most direct interpretation of using `fgrep` for this purpose. samtools view -h "${INPUT_BAM}" | fgrep -f non_intronic_exonic_read_ids.txt - > temp_filtered.sam samtools view -bS temp_filtered.sam > "${OUTPUT_BAM}" # Clean up intermediate files rm all_read_ids.txt aligned_to_regions.bam aligned_read_ids.txt non_intronic_exonic_read_ids.txt temp_filtered.sam -
9
Coverage across the 2000 bp upstream region was determined with bedtools coverage -S -split.
$ Bash example
# Install bedtools (if not already installed) # conda install -c bioconda bedtools # Placeholder variables (replace with actual file paths) INPUT_BAM="aligned_reads.bam" # e.g., a STAR-aligned BAM file from eCLIP data UPSTREAM_REGIONS_BED="upstream_regions_2000bp.bed" # BED file defining the 2000 bp upstream regions OUTPUT_COVERAGE_FILE="upstream_coverage.tsv" # Determine coverage across the 2000 bp upstream region bedtools coverage -a "${UPSTREAM_REGIONS_BED}" -b "${INPUT_BAM}" -S -split > "${OUTPUT_COVERAGE_FILE}" -
10
A TPM value was determined for this upstream region and compared to the TPM of the adjacent downstream gene.
$ Bash example
# Install Salmon (if not already installed) # conda install -c bioconda salmon # Define variables GENOME_VERSION="GRCh38" # Example: Human genome version TRANSCRIPTOME_FASTA="gencode.v38.transcripts.fa.gz" # Example: GENCODE v38 transcriptome FASTA SALMON_INDEX_DIR="salmon_index_${GENOME_VERSION}" READ1="sample_R1.fastq.gz" READ2="sample_R2.fastq.gz" # Optional, if paired-end OUTPUT_DIR="salmon_quant_output" # Download reference transcriptome (example commands - replace with actual paths if already available) # mkdir -p references # cd references # wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.transcripts.fa.gz # cd .. # 1. Build Salmon index (run once for a given transcriptome) salmon index -t "${TRANSCRIPTOME_FASTA}" -i "${SALMON_INDEX_DIR}" --gencode # 2. Quantify expression (determine TPM values) # For paired-end reads: salmon quant -i "${SALMON_INDEX_DIR}" -l A \ -1 "${READ1}" -2 "${READ2}" \ -p 8 --validateMappings -o "${OUTPUT_DIR}" # For single-end reads (uncomment and use if applicable): # salmon quant -i "${SALMON_INDEX_DIR}" -l A \ # -r "${READ1}" \ # -p 8 --validateMappings -o "${OUTPUT_DIR}" # The output directory will contain 'quant.sf' which includes TPM values for transcripts. # Further analysis (e.g., aggregating transcript TPMs to gene TPMs, # identifying "upstream regions" and "downstream genes" based on genomic coordinates # from an annotation file like GTF, and comparing their TPMs) would be performed # using custom scripts (e.g., R or Python) parsing the 'quant.sf' file and the annotation. -
11
If a ratio of 20% of the signal (determined by TPM) was present in the upstream region, and a breadth of 70% of that region was covered, that gene was determined to have significant enough upstream transcription so as to not be a reliable functional coding product, and was removed from subsequent analysis.
Custom Script (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash)$ Bash example
#!/bin/bash # --- Configuration --- TPM_RATIO_THRESHOLD=0.20 # 20% of signal (TPM) in upstream region COVERAGE_BREADTH_THRESHOLD=0.70 # 70% breadth of coverage in upstream region # --- Input Files (placeholders - replace with actual paths) --- # These files would typically be generated by previous steps in the pipeline. # For example: # - gene_body_tpm.tsv: Gene-level TPM quantification (e.g., from Salmon, Kallisto, RSEM) # Format: gene_id<tab>TPM_value # - upstream_region_tpm.tsv: TPM quantification specifically for the upstream region of each gene. # This would require defining upstream regions (e.g., using bedtools flank on gene annotations), # quantifying reads in these regions (e.g., featureCounts), and then normalizing to TPM. # Format: gene_id<tab>TPM_value # - upstream_coverage_breadth.tsv: Fraction of the upstream region covered by transcription. # This could be calculated using bedtools coverage with aligned RNA-seq reads (BAM) and upstream region BED files. # Format: gene_id<tab>fraction_covered # --- Output Files --- FILTERED_GENES_LIST="genes_removed_due_to_upstream_transcription.tsv" RELIABLE_GENES_LIST="reliable_functional_coding_products.tsv" # --- Placeholder for generating dummy input files for demonstration --- # In a real pipeline, these would be actual output from previous steps. echo -e "geneA\t100.0" > gene_body_tpm.tsv echo -e "geneB\t20.0" >> gene_body_tpm.tsv echo -e "geneC\t50.0" >> gene_body_tpm.tsv echo -e "geneD\t10.0" >> gene_body_tpm.tsv echo -e "geneE\t75.0" >> gene_body_tpm.tsv echo -e "geneA\t30.0" > upstream_region_tpm.tsv # Ratio 0.30 echo -e "geneB\t5.0" >> upstream_region_tpm.tsv # Ratio 0.25 echo -e "geneC\t8.0" >> upstream_region_tpm.tsv # Ratio 0.16 echo -e "geneD\t3.0" >> upstream_region_tpm.tsv # Ratio 0.30 echo -e "geneE\t10.0" >> upstream_region_tpm.tsv # Ratio 0.13 echo -e "geneA\t0.85" > upstream_coverage_breadth.tsv # Covered echo -e "geneB\t0.60" >> upstream_coverage_breadth.tsv # Not covered echo -e "geneC\t0.90" >> upstream_coverage_breadth.tsv # Covered echo -e "geneD\t0.75" >> upstream_coverage_breadth.tsv # Covered echo -e "geneE\t0.65" >> upstream_coverage_breadth.tsv # Not covered # --- Core Filtering Logic (using awk for custom script) --- echo "Applying filtering logic..." awk -v tpm_ratio_thresh="${TPM_RATIO_THRESHOLD}" -v cov_breadth_thresh="${COVERAGE_BREADTH_THRESHOLD}" ' BEGIN { OFS="\t" } FILENAME=="gene_body_tpm.tsv" { gene_body_tpm[$1] = $2; } FILENAME=="upstream_region_tpm.tsv" { upstream_tpm[$1] = $2; } FILENAME=="upstream_coverage_breadth.tsv" { coverage_breadth[$1] = $2; } END { for (gene_id in gene_body_tpm) { # Check if upstream data exists for this gene if (gene_id in upstream_tpm && gene_id in coverage_breadth) { # Calculate TPM ratio: (TPM in upstream region) / (TPM in gene body) ratio = upstream_tpm[gene_id] / gene_body_tpm[gene_id]; # Apply filtering conditions: # 1. Upstream TPM ratio >= threshold (20%) # 2. Upstream coverage breadth >= threshold (70%) if (ratio >= tpm_ratio_thresh && coverage_breadth[gene_id] >= cov_breadth_thresh) { print gene_id, "REMOVED", "UpstreamTPMRatio="ratio, "CoverageBreadth="coverage_breadth[gene_id] >> "'"${FILTERED_GENES_LIST}"'" ; } else { print gene_id, "KEPT", "UpstreamTPMRatio="ratio, "CoverageBreadth="coverage_breadth[gene_id] >> "'"${RELIABLE_GENES_LIST}"'" ; } } else { # If upstream data is missing, assume it is not problematic and keep the gene print gene_id, "KEPT", "NoUpstreamData" >> "'"${RELIABLE_GENES_LIST}"'" ; } } }' gene_body_tpm.tsv upstream_region_tpm.tsv upstream_coverage_breadth.tsv echo "Filtering complete. Genes to remove are in ${FILTERED_GENES_LIST}" echo "Reliable genes are in ${RELIABLE_GENES_LIST}" # Clean up dummy files (remove this in a real pipeline) rm gene_body_tpm.tsv upstream_region_tpm.tsv upstream_coverage_breadth.tsv -
12
Genes that showed this pattern in either the input or IP condition were both removed from that IP analysis.
Custom Filtering Script (Inferred with models/gemini-2.5-flash) vN/A (Custom) (Inferred with models/gemini-2.5-flash)$ Bash example
# This script conceptually demonstrates the removal of genes based on a defined pattern # in either the input or IP condition, as described. The exact "pattern" (e.g., high input signal, # low IP signal, specific statistical criteria) and the input data format would be specific # to the experimental design and upstream analysis. # For demonstration, let's assume we have a file 'gene_quantification_data.tsv' # that contains gene identifiers along with their quantified values (e.g., counts, RPKM, TPM) # for both input and IP conditions. # Example format (tab-separated): # GeneID Input_Value IP_Value # Let's create a dummy file for demonstration purposes: echo -e "GeneA\t1000\t5000" > gene_quantification_data.tsv echo -e "GeneB\t5000\t100" >> gene_quantification_data.tsv # High input, low IP - candidate for removal echo -e "GeneC\t50\t20" >> gene_quantification_data.tsv # Low IP - candidate for removal echo -e "GeneD\t200\t8000" >> gene_quantification_data.tsv echo -e "GeneE\t10000\t12000" >> gene_quantification_data.tsv # High input, but also high IP - might be kept or removed based on specific criteria # Define example thresholds for the "pattern" to be removed: # - Remove if Input_Value is above a certain threshold (e.g., highly abundant background) # - OR Remove if IP_Value is below a certain threshold (e.g., not enriched enough) # These thresholds are illustrative and would be determined empirically based on the specific assay. INPUT_VALUE_THRESHOLD_FOR_REMOVAL=4000 # Genes with input value > 4000 might be considered problematic background IP_VALUE_THRESHOLD_FOR_REMOVAL=50 # Genes with IP value < 50 might be considered not enriched # Perform filtering using awk. # Assuming columns are: 1:GeneID, 2:Input_Value, 3:IP_Value # We want to KEEP genes where: # (Input_Value <= INPUT_VALUE_THRESHOLD_FOR_REMOVAL) AND (IP_Value >= IP_VALUE_THRESHOLD_FOR_REMOVAL) # The following awk command filters out genes matching the removal pattern. # It prints lines where the input value is NOT above the input threshold AND the IP value is NOT below the IP threshold. awk -v input_thresh_remove="$INPUT_VALUE_THRESHOLD_FOR_REMOVAL" \ -v ip_thresh_remove="$IP_VALUE_THRESHOLD_FOR_REMOVAL" \ 'NR==1 {print; next} # Assuming a header line, print it and skip to next record $2 <= input_thresh_remove && $3 >= ip_thresh_remove {print}' \ gene_quantification_data.tsv > filtered_gene_analysis.tsv echo "Filtered genes saved to filtered_gene_analysis.tsv" # Reference datasets: The gene quantification data would typically be derived from # alignment to a reference genome (e.g., hg38, mm10) and annotation (e.g., GENCODE, Ensembl). # The specific reference used would depend on the species and experimental design # of the upstream quantification step.
Raw Source Text
All reads were aligned to the human genome hg38 primary assembly using STAR. featureCounts version 2.0.2 was used to annotate reads, using the flag --minOverlap 20 and a custom GTF file derived from gencode v34 Differential expression was calculated using DESeq2. Cut-offs used to determine significant differential expression were baseMean > 50 and padj <=0.01 Genes which had substantial upstream intergenic reads (similar to DoGs) were removed from DE tables. Upstream intergenic regions were extracted using bedtools flank -l 2000 -r 0 -s. Intervening upstream genes that had any overlap with this region were removed with bedtools subtract. Reads that did not align to intronic or exonic regions were extracted from bam files with fgrep. Coverage across the 2000 bp upstream region was determined with bedtools coverage -S -split. A TPM value was determined for this upstream region and compared to the TPM of the adjacent downstream gene. If a ratio of 20% of the signal (determined by TPM) was present in the upstream region, and a breadth of 70% of that region was covered, that gene was determined to have significant enough upstream transcription so as to not be a reliable functional coding product, and was removed from subsequent analysis. Genes that showed this pattern in either the input or IP condition were both removed from that IP analysis. Genome_build: hg38 Supplementary_files_format_and_content: comma separated files with output of DESeq2; normalized abundance measurements for all genes when comparing RIP condition to total input condition