GSE51264 Processing Pipeline
Publication
Widespread RNA editing dysregulation in brains from autistic individuals.Nature neuroscience (2019) — PMID 30559470
Dataset
GSE51264Conserved expression of lincRNA during human and macaque prefrontal cortex development and maturation
Processing Steps
Generate Jupyter Notebook-
1
For transcript assembly, TopHat2 was used with default parameters to align all sequenced reads to the human genome (hg19).
$ Bash example
# Install TopHat2 (if not already installed) # conda install -c bioconda tophat2 # Ensure Bowtie2 is also installed as TopHat2 depends on it # conda install -c bioconda bowtie2 # Download or create Bowtie2 index for hg19 # This is a placeholder. In a real scenario, you would use an existing index. # For example, to build from UCSC hg19 reference FASTA: # wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # gunzip hg19.fa.gz # bowtie2-build hg19.fa /path/to/bowtie2_indexes/hg19 # Define variables BOWTIE2_INDEX="/path/to/bowtie2_indexes/hg19" # Placeholder for hg19 Bowtie2 index READS_FASTQ="reads.fastq.gz" # Placeholder for input FASTQ file (e.g., single-end reads) OUTPUT_DIR="tophat_out" # Run TopHat2 with default parameters # TopHat2 automatically uses Bowtie2 for alignment. # The output (e.g., accepted_hits.bam) will be in the specified output directory. tophat2 -o "${OUTPUT_DIR}" "${BOWTIE2_INDEX}" "${READS_FASTQ}" -
2
Using the GENCODE v16 annotation, all sequeced transcripts overlapping annotated transcripts were removed.
$ Bash example
# Install bedtools if not already installed # conda install -c bioconda bedtools # --- Reference Data Setup --- # Download GENCODE v16 annotation (example for human, GRCh37/hg19) # Note: GENCODE v16 is based on GRCh37/hg19. # wget -O gencode.v16.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_16/gencode.v16.annotation.gtf.gz # gunzip gencode.v16.annotation.gtf.gz # Convert GTF to BED format for bedtools (e.g., using gtf2bed or awk) # Example using gtf2bed (install with: conda install -c bioconda gtf2bed) # gtf2bed < gencode.v16.annotation.gtf > gencode.v16.annotation.bed # Or a simplified awk command for transcript features: # awk -F'\t' '($3 == "transcript") {print $1"\t"($4-1)"\t"$5"\t"$10"\t"$6"\t"$7}' gencode.v16.annotation.gtf | sed 's/;//g' | sed 's/"//g' | sort -k1,1 -k2,2n > gencode.v16.annotation.bed # --- Main Execution --- # Assuming 'sequenced_transcripts.bed' contains the transcripts to be filtered # and 'gencode.v16.annotation.bed' contains the GENCODE v16 annotated transcripts. # Remove sequenced transcripts that overlap with GENCODE v16 annotated transcripts bedtools subtract -a sequenced_transcripts.bed -b gencode.v16.annotation.bed > filtered_transcripts.bed -
3
To estimate gene expression including both protein-coding genes and lincRNAs in human and macaque, a human-macaque consensus genome based on a pairwise whole-genome alignment of the human (hg19) and the rhesus macaque (rheMac3) genomes was constructed.
(Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# Install necessary tools (example using conda) # conda install -c bioconda lastz ucsc-axtchain ucsc-chainnet ucsc-nettoaxt # Download reference genomes # mkdir -p genomes # wget -P genomes http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # wget -P genomes http://hgdownload.soe.ucsc.edu/goldenPath/rheMac3/bigZips/rheMac3.fa.gz # gunzip genomes/hg19.fa.gz # gunzip genomes/rheMac3.fa.gz # Step 1: Perform pairwise whole-genome alignment using LASTZ # This generates an alignment file (e.g., AXT format) which is the basis for consensus construction. # The 'multiple' option is for aligning multiple sequences from the target to the query. lastz genomes/hg19.fa[multiple] genomes/rheMac3.fa --format=axt > human_macaque.axt # Step 2: Chain and Net the alignments (UCSC tools) # This creates a chain file for coordinate mapping and a net file for higher-level organization. axtChain -linearGap=loose human_macaque.axt genomes/hg19.fa genomes/rheMac3.fa stdout | chainAntiRepeat > human_macaque.chain chainNet human_macaque.chain genomes/hg19.fa genomes/rheMac3.fa human_macaque.net /dev/null # Step 3: Convert net to axt (optional, for detailed alignment in net regions) netToAxt human_macaque.net human_macaque.chain genomes/hg19.fa genomes/rheMac3.fa stdout > human_macaque_net.axt # Step 4: Construct the consensus genome # The construction of a "consensus genome" from whole-genome alignments is highly custom # and depends on the specific definition (e.g., a merged reference, a new sequence resolving differences, # or a comprehensive coordinate mapping system). # This step typically involves custom scripts or further processing of the generated alignment files # (e.g., human_macaque.chain, human_macaque_net.axt) to derive a unified reference sequence or coordinate system. # Example placeholder for a custom script: # python custom_consensus_builder.py --human genomes/hg19.fa --macaque genomes/rheMac3.fa --alignment human_macaque_net.axt --output consensus_genome.fa echo "Consensus genome construction typically involves custom scripts or further processing of alignment files (e.g., human_macaque.chain, human_macaque_net.axt) to derive a unified reference sequence or coordinate system."
-
4
Reads in both, human and macaque samples were mapped to the consensus genome using STAR with default parameters.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Placeholder for consensus genome index directory # This directory should contain the STAR genome index files (SA, SAindex, genome, etc.) # The consensus genome would have been built previously using STAR --runMode genomeGenerate CONSENSUS_GENOME_INDEX="./consensus_genome_index" # Placeholder for input FASTQ files for human sample HUMAN_R1="./human_sample_R1.fastq.gz" HUMAN_R2="./human_sample_R2.fastq.gz" # Placeholder for input FASTQ files for macaque sample MACAQUE_R1="./macaque_sample_R1.fastq.gz" MACAQUE_R2="./macaque_sample_R2.fastq.gz" # Placeholder for number of threads to use NUM_THREADS=8 # Create output directories mkdir -p star_output_human mkdir -p star_output_macaque # Map human samples to the consensus genome using STAR with default parameters STAR \ --genomeDir "${CONSENSUS_GENOME_INDEX}" \ --readFilesIn "${HUMAN_R1}" "${HUMAN_R2}" \ --runThreadN "${NUM_THREADS}" \ --outFileNamePrefix "star_output_human/human_sample_" \ --outSAMtype BAM SortedByCoordinate \ --outBAMcompression 6 \ --readFilesCommand zcat # Map macaque samples to the consensus genome using STAR with default parameters STAR \ --genomeDir "${CONSENSUS_GENOME_INDEX}" \ --readFilesIn "${MACAQUE_R1}" "${MACAQUE_R2}" \ --runThreadN "${NUM_THREADS}" \ --outFileNamePrefix "star_output_macaque/macaque_sample_" \ --outSAMtype BAM SortedByCoordinate \ --outBAMcompression 6 \ --readFilesCommand zcat -
5
Gene expression was quantified using RPKM = n*1000*100/L/N, where n is the number of reads aligned to exon regions, and L is the length of exons regions, N is the total number of reads in the sample.
featureCounts (Inferred with models/gemini-2.5-flash) vSubread 2.0.6$ Bash example
# --- Installation (commented out) --- # conda install -c bioconda star # conda install -c bioconda subread # conda install -c bioconda samtools # --- Reference Data (placeholders) --- # Replace with actual paths to your reference genome and annotation GENOME_FASTA="GRCh38.primary_assembly.genome.fa" GTF_FILE="gencode.v44.annotation.gtf" STAR_INDEX_DIR="star_index_GRCh38_gencode_v44" # --- Input Files --- # Replace with your actual FASTQ file and desired sample name READS_FASTQ="sample.fastq.gz" SAMPLE_NAME="sample" OUTPUT_DIR="quantification_output" mkdir -p "${OUTPUT_DIR}" # --- Step 1: Alignment (using STAR) --- # This step assumes STAR index is already built. # Uncomment and run if alignment has not been performed. # STAR --runThreadN 8 --genomeDir "${STAR_INDEX_DIR}" \ # --readFilesIn "${READS_FASTQ}" --readFilesCommand zcat \ # --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_NAME}." \ # --outSAMtype BAM SortedByCoordinate \ # --outSAMunmapped None \ # --outFilterType BySJout \ # --outFilterMultimapNmax 20 \ # --outFilterMismatchNmax 999 \ # --outFilterMismatchNoverLmax 0.1 \ # --alignIntronMin 20 --alignIntronMax 1000000 \ # --alignMatesGapMax 1000000 \ # --limitBAMsortRAM 30000000000 # Adjust based on available RAM # Assuming the aligned BAM file exists from a previous step ALIGNED_BAM="${OUTPUT_DIR}/${SAMPLE_NAME}.Aligned.sortedByCoord.out.bam" # --- Step 2: Count reads aligned to exon regions (n) and get exon lengths (L) using featureCounts --- # -p: paired-end reads (remove if single-end) # -t exon: count features of type 'exon' # -g gene_id: group features by 'gene_id' attribute in GTF # -a: annotation file # -o: output file FEATURE_COUNTS_OUTPUT="${OUTPUT_DIR}/${SAMPLE_NAME}.featureCounts.txt" featureCounts -t exon -g gene_id -a "${GTF_FILE}" -o "${FEATURE_COUNTS_OUTPUT}" "${ALIGNED_BAM}" # --- Step 3: Get total number of reads in the sample (N) --- # This counts all mapped reads in the BAM file. TOTAL_READS=$(samtools flagstat "${ALIGNED_BAM}" | grep "mapped (" | awk '{print $1}') # --- Step 4: Calculate RPKM using the formula: RPKM = n*1000*100/L/N --- # This uses awk to process the featureCounts output and apply the given formula. # n = count for the gene (column 7 in featureCounts output for a single sample) # L = length of the gene (sum of exon lengths, column 6 in featureCounts output) # N = total mapped reads in the sample (calculated above) echo -e "GeneID\tRPKM" > "${OUTPUT_DIR}/${SAMPLE_NAME}.rpkm.tsv" awk -v N_total="${TOTAL_READS}" ' BEGIN { OFS="\t" } NR > 1 { # Skip header line from featureCounts output gene_id = $1; n = $7; # Count for the gene (adjust column if multiple samples in featureCounts output) L = $6; # Length of the gene if (L > 0 && N_total > 0) { rpkm = (n * 1000 * 100) / (L * N_total); print gene_id, rpkm; } else { print gene_id, 0; # Handle cases with zero length or zero total reads } }' "${FEATURE_COUNTS_OUTPUT}" >> "${OUTPUT_DIR}/${SAMPLE_NAME}.rpkm.tsv"
Raw Source Text
For transcript assembly, TopHat2 was used with default parameters to align all sequenced reads to the human genome (hg19). Using the GENCODE v16 annotation, all sequeced transcripts overlapping annotated transcripts were removed. To estimate gene expression including both protein-coding genes and lincRNAs in human and macaque, a human-macaque consensus genome based on a pairwise whole-genome alignment of the human (hg19) and the rhesus macaque (rheMac3) genomes was constructed. Reads in both, human and macaque samples were mapped to the consensus genome using STAR with default parameters. Gene expression was quantified using RPKM = n*1000*100/L/N, where n is the number of reads aligned to exon regions, and L is the length of exons regions, N is the total number of reads in the sample. Genome_build: GENCODE 16; hg19; GCA_000230795.1; consensus genome Supplementary_files_format_and_content: rpkm_human.tsv is a table showing gene expression (in RPKM) of each gene in each human PFC sample. rpkm_macaque.tsv is a similar table but showing information in macaque PFC samples. Supplementary_files_format_and_content: All the ENSG* IDs in the GENE_ID column are correspond to GENCODE v16 gene IDs (not transcript IDs). The "CUFF.*" IDs are novel lincRNAs we identified. Please see the file named "novel_lincrna.gtf", which involves the information of them.