BCH394P BCH364C 2025: Difference between revisions
No edit summary |
No edit summary |
||
Line 10: | Line 10: | ||
== Lectures & Handouts == | == Lectures & Handouts == | ||
<!-- | |||
'''Apr 17 - 24, 2025 - Final Project Presentations''' | '''Apr 17 - 24, 2025 - Final Project Presentations''' | ||
* Welcome to the end of the course! You made it! The last 3 days will be presentations of your class projects. | * Welcome to the end of the course! You made it! The last 3 days will be presentations of your class projects. | ||
Line 26: | Line 27: | ||
* [https://sites.google.com/view/a-synucleindms?usp=sharing Predictive Model of Deep Mutation Scanning Fitness Landscape, by Nathan Strominger, Luiz Vieira] | * [https://sites.google.com/view/a-synucleindms?usp=sharing Predictive Model of Deep Mutation Scanning Fitness Landscape, by Nathan Strominger, Luiz Vieira] | ||
* [https://sites.google.com/utexas.edu/bch394p-2024/home?authuser=0 Computational Investigation of Point Mutations in the Tumor Suppressor Protein p53, by Wes Wolfe, Mikayla Horvath] | * [https://sites.google.com/utexas.edu/bch394p-2024/home?authuser=0 Computational Investigation of Point Mutations in the Tumor Suppressor Protein p53, by Wes Wolfe, Mikayla Horvath] | ||
--> | |||
<!-- | |||
'''April 15, 2025 - Synthetic Biology, highly compressed''' | '''April 15, 2025 - Synthetic Biology, highly compressed''' | ||
* '''Reminder: All projects are due by 10PM, April 16'''. Turn them in as a URL to the web site you created, sent by email to the TA AND PROFESSOR. | * '''Reminder: All projects are due by 10PM, April 16'''. Turn them in as a URL to the web site you created, sent by email to the TA AND PROFESSOR. | ||
Line 43: | Line 44: | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/EdgeDetector.pdf Edge detector] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/EdgeDetector.pdf Edge detector] | ||
[https://colossal.com/ Food for thought] | [https://colossal.com/ Food for thought] | ||
--> | |||
<!-- | |||
'''April 10, 2025 - Orthologs, Paralogs, and Phenologs''' | '''April 10, 2025 - Orthologs, Paralogs, and Phenologs''' | ||
* '''Remember: The final project web page is due by 10PM April 16, 2025, turned in as a URL emailed to the TA+Professor. Please indicate in the email if you are willing to let us post the project to the course web site. Also, note that ''late days can't be used for the final project'' ''' | * '''Remember: The final project web page is due by 10PM April 16, 2025, turned in as a URL emailed to the TA+Professor. Please indicate in the email if you are willing to let us post the project to the course web site. Also, note that ''late days can't be used for the final project'' ''' | ||
Line 55: | Line 56: | ||
* All (well, at least some) of [http://www.marcottelab.org/users/BCH394P_364C_2025/Sonnhammer2002TiG.pdf your ortholog definition questions answered!] | * All (well, at least some) of [http://www.marcottelab.org/users/BCH394P_364C_2025/Sonnhammer2002TiG.pdf your ortholog definition questions answered!] | ||
* A nice paper about selecting good research problems. [http://www.marcottelab.org/users/BCH394P_364C_2025/ChoosingAProblemInScienceAndEngineering.pdf Here's the pdf]. | * A nice paper about selecting good research problems. [http://www.marcottelab.org/users/BCH394P_364C_2025/ChoosingAProblemInScienceAndEngineering.pdf Here's the pdf]. | ||
--> | |||
<!-- | |||
'''Apr 8, 2025 - Computational Protein Design''' | '''Apr 8, 2025 - Computational Protein Design''' | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/ComputationalProteinDesign_Spring2025.pdf Today's slides] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/ComputationalProteinDesign_Spring2025.pdf Today's slides] | ||
Line 62: | Line 63: | ||
* Try it yourself! Here's the [https://colab.research.google.com/github/sokrypton/ColabDesign/blob/main/rf/examples/diffusion.ipynb RFDiffusion + ProteinMPNN colab notebook for protein backbone + sequence design], [https://huggingface.co/spaces/simonduerr/ProteinMPNN a site for running ProteinMPNN in isolation for protein sequence redesign], and [https://colab.research.google.com/github/ullahsamee/ligandMPNN_Colab/blob/main/LigandMPNN_Colab.ipynb the LigandMPNN colab notebook for small-molecule award protein sequence redesign]. | * Try it yourself! Here's the [https://colab.research.google.com/github/sokrypton/ColabDesign/blob/main/rf/examples/diffusion.ipynb RFDiffusion + ProteinMPNN colab notebook for protein backbone + sequence design], [https://huggingface.co/spaces/simonduerr/ProteinMPNN a site for running ProteinMPNN in isolation for protein sequence redesign], and [https://colab.research.google.com/github/ullahsamee/ligandMPNN_Colab/blob/main/LigandMPNN_Colab.ipynb the LigandMPNN colab notebook for small-molecule award protein sequence redesign]. | ||
<!-- | |||
'''Apr 3, 2025 - Large Language Models in Biology''' | '''Apr 3, 2025 - Large Language Models in Biology''' | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/AaronFeller_2024-04-04_TeachingLanguageModelstoSpeakBiology.pdf Today's slides] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/AaronFeller_2024-04-04_TeachingLanguageModelstoSpeakBiology.pdf Today's slides] | ||
* [https://www.linkedin.com/in/aaronleefeller/ Aaron Feller]. Aaron is a computational biologist and PhD student in our program specializing in large language models. (His LinkedIn Bio is literally "Ask me about biological language models.") He will be providing a conceptual basis for how LLMs work and are applied to biological datasets. | * [https://www.linkedin.com/in/aaronleefeller/ Aaron Feller]. Aaron is a computational biologist and PhD student in our program specializing in large language models. (His LinkedIn Bio is literally "Ask me about biological language models.") He will be providing a conceptual basis for how LLMs work and are applied to biological datasets. | ||
--> | |||
<!-- | |||
'''Apr 1, 2025 - 3D Protein Structure Modeling with AlphaFold & ChimeraX''' | '''Apr 1, 2025 - 3D Protein Structure Modeling with AlphaFold & ChimeraX''' | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/DarylBarth_ProteinStructPredict_Bioinformatics_Apr2_2024.pdf Today's slides] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/DarylBarth_ProteinStructPredict_Bioinformatics_Apr2_2024.pdf Today's slides] | ||
Line 74: | Line 75: | ||
* & a few other useful 3D structure tools: The [http://www.rcsb.org/ Protein Data Bank], [https://salilab.org/modeller/ MODELLER], and [http://www.pymol.org/ Pymol] | * & a few other useful 3D structure tools: The [http://www.rcsb.org/ Protein Data Bank], [https://salilab.org/modeller/ MODELLER], and [http://www.pymol.org/ Pymol] | ||
* There is a nice tutorial on using AlphaFold and ChimeraX from EMBL/DFG (Kosinski group) available [https://docs.google.com/document/d/1_g1_M-I40CqOQc5obwAt08YntC5D2Z_WNz6mYuUQtyc/edit#heading=h.m7ei2f72v2ig here]. | * There is a nice tutorial on using AlphaFold and ChimeraX from EMBL/DFG (Kosinski group) available [https://docs.google.com/document/d/1_g1_M-I40CqOQc5obwAt08YntC5D2Z_WNz6mYuUQtyc/edit#heading=h.m7ei2f72v2ig here]. | ||
--> | |||
<!-- | |||
'''Mar 27, 2025 - Principal Component Analysis (& the curious case of European genotypes)''' | '''Mar 27, 2025 - Principal Component Analysis (& the curious case of European genotypes)''' | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/BCH394P-364C_PCA_Spring2025.pdf Today's slides] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/BCH394P-364C_PCA_Spring2025.pdf Today's slides] | ||
Line 86: | Line 87: | ||
* Science Signaling (more specifically, Neil R. Clark and Avi Ma’ayan!) had a nice introduction to PCA that I've reposted [http://www.marcottelab.org/users/BCH394P_364C_2025/IntroToPCA.pdf here] (with [http://www.marcottelab.org/users/BCH394P_364C_2025/2001967Slides-FINAL.ppt slides]) | * Science Signaling (more specifically, Neil R. Clark and Avi Ma’ayan!) had a nice introduction to PCA that I've reposted [http://www.marcottelab.org/users/BCH394P_364C_2025/IntroToPCA.pdf here] (with [http://www.marcottelab.org/users/BCH394P_364C_2025/2001967Slides-FINAL.ppt slides]) | ||
* Python code for [http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html performing PCA yourself]. This example gives a great intro to several important numerical/statistical/data mining packages in Python, including pandas and numpy. | * Python code for [http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html performing PCA yourself]. This example gives a great intro to several important numerical/statistical/data mining packages in Python, including pandas and numpy. | ||
--> | |||
<!-- | |||
'''Mar 25, 2025 - Classifiers''' | '''Mar 25, 2025 - Classifiers''' | ||
* Science news of the day: [https://www.nytimes.com/2024/03/21/health/pig-kidney-organ-transplant.html Surgeons Transplant Pig Kidney Into a Patient A Medical Milestone] ([http://www.marcottelab.org/users/BCH394P_364C_2025/SurgeonsTransplantPigKidneyIntoaPatientAMedicalMilestone-TheNewYorkTimes.pdf pdf version]) | * Science news of the day: [https://www.nytimes.com/2024/03/21/health/pig-kidney-organ-transplant.html Surgeons Transplant Pig Kidney Into a Patient A Medical Milestone] ([http://www.marcottelab.org/users/BCH394P_364C_2025/SurgeonsTransplantPigKidneyIntoaPatientAMedicalMilestone-TheNewYorkTimes.pdf pdf version]) | ||
Line 96: | Line 97: | ||
* For those of you interesting in trying out classifiers on your own, here's the best stand-alone open software for do-it-yourself classifiers and data mining: [http://www.cs.waikato.ac.nz/ml/weka/ Weka]. There is a great introduction to using Weka in this book chapter [http://link.springer.com/protocol/10.1007/978-1-4939-3578-9_17 Introducing Machine Learning Concepts with WEKA], as well as the very accessible Weka-produced book [http://www.cs.waikato.ac.nz/ml/weka/book.html Data Mining: Practical Machine Learning Tools and Techniques]. | * For those of you interesting in trying out classifiers on your own, here's the best stand-alone open software for do-it-yourself classifiers and data mining: [http://www.cs.waikato.ac.nz/ml/weka/ Weka]. There is a great introduction to using Weka in this book chapter [http://link.springer.com/protocol/10.1007/978-1-4939-3578-9_17 Introducing Machine Learning Concepts with WEKA], as well as the very accessible Weka-produced book [http://www.cs.waikato.ac.nz/ml/weka/book.html Data Mining: Practical Machine Learning Tools and Techniques]. | ||
* & to do this directly in Python, there's a really excellent library of simple, easy-to-use, classification, regression, machine learning and data mining tools called [https://scikit-learn.org/stable/ scikit-learn]. I highly recommend using scikit-learn in combination with the [https://pandas.pydata.org/ pandas library], which makes it easy to work with large, tabular datasets. Here's [https://www.youtube.com/watch?v=PcvsOaixUh8 a helpful pandas tutorial] to get you started. | * & to do this directly in Python, there's a really excellent library of simple, easy-to-use, classification, regression, machine learning and data mining tools called [https://scikit-learn.org/stable/ scikit-learn]. I highly recommend using scikit-learn in combination with the [https://pandas.pydata.org/ pandas library], which makes it easy to work with large, tabular datasets. Here's [https://www.youtube.com/watch?v=PcvsOaixUh8 a helpful pandas tutorial] to get you started. | ||
--> | |||
<!-- | |||
'''Mar 18,20, 2025 - SPRING BREAK''' | '''Mar 18,20, 2025 - SPRING BREAK''' | ||
--> | |||
<!-- | |||
'''Mar 13, 2025 - Clustering II''' | '''Mar 13, 2025 - Clustering II''' | ||
* We'll be continuing the slides from last time | * We'll be continuing the slides from last time | ||
Line 111: | Line 112: | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/nature_review_2000.pdf Review of phylogenetic profiles] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/nature_review_2000.pdf Review of phylogenetic profiles] | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/FuzzyK-Means.pdf Fuzzy k-means] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/FuzzyK-Means.pdf Fuzzy k-means] | ||
--> | |||
<!-- | |||
'''Mar 11, 2025 - Functional Genomics & Data Mining - Clustering I''' | '''Mar 11, 2025 - Functional Genomics & Data Mining - Clustering I''' | ||
* Don't forget to turn in the proposal for your course project by '''March 12'''. | * Don't forget to turn in the proposal for your course project by '''March 12'''. | ||
Line 129: | Line 130: | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/Bcelllymphoma.pdf B cell lymphomas] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/Bcelllymphoma.pdf B cell lymphomas] | ||
* [http://en.wikipedia.org/wiki/RNA-Seq RNA-Seq] | * [http://en.wikipedia.org/wiki/RNA-Seq RNA-Seq] | ||
--> | |||
<!-- | |||
'''Mar 6, 2025 - Intro to Proteomics''' | '''Mar 6, 2025 - Intro to Proteomics''' | ||
* Guest speaker: Vy Dang, who earned her B.S. and subsequently worked in genomics at the University of Washington, Seattle, where she was a major contributor to [https://www.science.org/doi/full/10.1126/science.aax2083 the sequencing of the Melanesian genome] before joining us at UT. Here, she has performed >2,000 mass spectrometry proteomics experiments to map brain protein-protein interactions conserved across vertebrates. | * Guest speaker: Vy Dang, who earned her B.S. and subsequently worked in genomics at the University of Washington, Seattle, where she was a major contributor to [https://www.science.org/doi/full/10.1126/science.aax2083 the sequencing of the Melanesian genome] before joining us at UT. Here, she has performed >2,000 mass spectrometry proteomics experiments to map brain protein-protein interactions conserved across vertebrates. | ||
--> | |||
<!-- | <!-- | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/IntroToProteomics2-03-24-2025.pdf Today's slides] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/IntroToProteomics2-03-24-2025.pdf Today's slides] | ||
--> | --> | ||
<!-- | |||
'''Mar 4, 2025 - NGS analysis best practices''' | '''Mar 4, 2025 - NGS analysis best practices''' | ||
* Guest speaker: [https://www.linkedin.com/in/anna-battenhouse-abba1/ Anna Battenhouse] from the [https://research.utexas.edu/cbrs/ Center for Biomedical Research Support], where she maintains the [https://wikis.utexas.edu/display/RCTFusers Biomedical Research Computing Facility]. | * Guest speaker: [https://www.linkedin.com/in/anna-battenhouse-abba1/ Anna Battenhouse] from the [https://research.utexas.edu/cbrs/ Center for Biomedical Research Support], where she maintains the [https://wikis.utexas.edu/display/RCTFusers Biomedical Research Computing Facility]. | ||
--> | |||
<!-- | <!-- | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/2024-02-NGS_IntroForEdM.pdf Today's slides] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/2024-02-NGS_IntroForEdM.pdf Today's slides] | ||
--> | --> | ||
<!-- | |||
'''Feb 27, 2025 - Genome Assembly/Mapping II'''<br> | '''Feb 27, 2025 - Genome Assembly/Mapping II'''<br> | ||
* We're finishing up the slides from last time. Note that we give short shrift to read mapping/alignment algorithms, of which there are now [https://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Short-Read_Sequence_Alignment a very long list]. Here's an interesting discussion by Lior Pachter of the [https://liorpachter.wordpress.com/2015/11/01/what-is-a-read-mapping/ major developments in that field.] | * We're finishing up the slides from last time. Note that we give short shrift to read mapping/alignment algorithms, of which there are now [https://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Short-Read_Sequence_Alignment a very long list]. Here's an interesting discussion by Lior Pachter of the [https://liorpachter.wordpress.com/2015/11/01/what-is-a-read-mapping/ major developments in that field.] | ||
Line 153: | Line 156: | ||
* Two notable advances in genome assembly: [http://www.marcottelab.org/users/BCH394P_364C_2025/StringGraphAssembly.pdf String Graphs] and more recently, [http://www.marcottelab.org/users/BCH394P_364C_2025/MultiplexDeBruijnGraphs.pdf multiplexed De Bruijn graphs]. Both have been used to assemble a [http://www.marcottelab.org/users/BCH394P_364C_2025/CompleteHumanGenomeSequence.pdf fully complete human genome sequence] (check out the [https://www.biorxiv.org/content/biorxiv/early/2021/05/27/2021.05.26.445798/F2.large.jpg?width=800&height=600&carousel=1 beautiful string graph visualizations] of the final assemblies, which capture gapless telomere-to-telomere assemblies for all 22 human autosomes and Chromosome X) | * Two notable advances in genome assembly: [http://www.marcottelab.org/users/BCH394P_364C_2025/StringGraphAssembly.pdf String Graphs] and more recently, [http://www.marcottelab.org/users/BCH394P_364C_2025/MultiplexDeBruijnGraphs.pdf multiplexed De Bruijn graphs]. Both have been used to assemble a [http://www.marcottelab.org/users/BCH394P_364C_2025/CompleteHumanGenomeSequence.pdf fully complete human genome sequence] (check out the [https://www.biorxiv.org/content/biorxiv/early/2021/05/27/2021.05.26.445798/F2.large.jpg?width=800&height=600&carousel=1 beautiful string graph visualizations] of the final assemblies, which capture gapless telomere-to-telomere assemblies for all 22 human autosomes and Chromosome X) | ||
* k-mer-based RNA quantification offers [https://www.nature.com/articles/nbt.3519 near-optimal probabilistic RNA-seq quantification]. Here's [https://bioinfo.iric.ca/understanding-how-kallisto-works/ how the program kallisto works] | * k-mer-based RNA quantification offers [https://www.nature.com/articles/nbt.3519 near-optimal probabilistic RNA-seq quantification]. Here's [https://bioinfo.iric.ca/understanding-how-kallisto-works/ how the program kallisto works] | ||
--> | |||
<!-- | |||
'''Feb 25, 2025 - Genome Assembly - I''' | '''Feb 25, 2025 - Genome Assembly - I''' | ||
* Homework #3 (worth 10% of your final course grade) has been assigned on Rosalind and is '''due by 10:00PM March 5'''. In past years, we've run into problems with Rosalind timing out before Meme completes although it usually runs eventually, so be warned you may have to try it a couple of times. Meme also runs faster using the "zero to one" or "one" occurrence per sequence option, rather than the "any number of repeats" option. Ultimately, if you can't get it to work, you can paste the input sequences + Meme output into a single file and submit that through Canvas, and we'll give you credit for it. | * Homework #3 (worth 10% of your final course grade) has been assigned on Rosalind and is '''due by 10:00PM March 5'''. In past years, we've run into problems with Rosalind timing out before Meme completes although it usually runs eventually, so be warned you may have to try it a couple of times. Meme also runs faster using the "zero to one" or "one" occurrence per sequence option, rather than the "any number of repeats" option. Ultimately, if you can't get it to work, you can paste the input sequences + Meme output into a single file and submit that through Canvas, and we'll give you credit for it. | ||
Line 165: | Line 168: | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/DeBruijnPrimer.pdf DeBruijn Primer] and [http://www.marcottelab.org/users/BCH394P_364C_2025/DeBruijnSupplement.pdf Supplement] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/DeBruijnPrimer.pdf DeBruijn Primer] and [http://www.marcottelab.org/users/BCH394P_364C_2025/DeBruijnSupplement.pdf Supplement] | ||
--> | |||
<!-- | |||
'''Feb 20, 2025 - Motifs''' | '''Feb 20, 2025 - Motifs''' | ||
* We'll talk about motif finding today. | * We'll talk about motif finding today. | ||
Line 177: | Line 180: | ||
* [http://www.rcsb.org/pdb/explore/explore.do?structureId=1L1M The biochemical basis of a particular motif] | * [http://www.rcsb.org/pdb/explore/explore.do?structureId=1L1M The biochemical basis of a particular motif] | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/GibbsSampling.pdf Gibbs Sampling] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/GibbsSampling.pdf Gibbs Sampling] | ||
--> | |||
<!-- | |||
'''Feb 18, 2025 - Gene finding II''' | '''Feb 18, 2025 - Gene finding II''' | ||
* [https://research.utexas.edu/cbrs/classes/short-courses/spring-2025-semester/ Short classes at UT] will be offered starting in March in programming, bioinformatics, genome sequencing, and cryoEM | * [https://research.utexas.edu/cbrs/classes/short-courses/spring-2025-semester/ Short classes at UT] will be offered starting in March in programming, bioinformatics, genome sequencing, and cryoEM | ||
Line 185: | Line 188: | ||
Reading:<br> | Reading:<br> | ||
* Re-posting this so it doesn't fall through the cracks: [http://www.marcottelab.org/users/BCH394P_364C_2025/2019StateOfGeneAnnotation.pdf The current state of gene annotation] | * Re-posting this so it doesn't fall through the cracks: [http://www.marcottelab.org/users/BCH394P_364C_2025/2019StateOfGeneAnnotation.pdf The current state of gene annotation] | ||
--> | |||
<!-- | |||
'''Feb 13, 2025 - Gene finding''' | '''Feb 13, 2025 - Gene finding''' | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/BCH394P-364C-GeneFinding-Spring2025.pdf Today's slides on gene finding] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/BCH394P-364C-GeneFinding-Spring2025.pdf Today's slides on gene finding] | ||
Line 199: | Line 202: | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/EukGeneAnnotation.pdf Eukaryotic gene finding], [http://www.marcottelab.org/users/BCH394P_364C_2025/GeneMark.hmm.pdf GeneMark.hmm], and [http://www.marcottelab.org/users/BCH394P_364C_2025/BurgeKarlin-main.pdf GENSCAN] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/EukGeneAnnotation.pdf Eukaryotic gene finding], [http://www.marcottelab.org/users/BCH394P_364C_2025/GeneMark.hmm.pdf GeneMark.hmm], and [http://www.marcottelab.org/users/BCH394P_364C_2025/BurgeKarlin-main.pdf GENSCAN] | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/SplicingAI-jaganathan2019.pdf Deep learning for splice set identification] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/SplicingAI-jaganathan2019.pdf Deep learning for splice set identification] | ||
--> | |||
<!-- | |||
'''Feb 11, 2025 - HMMs II''' | '''Feb 11, 2025 - HMMs II''' | ||
* We'll be finishing up slides from last time. | * We'll be finishing up slides from last time. | ||
Line 206: | Line 209: | ||
* Link to [http://setosa.io/blog/2014/07/26/markov-chains/ a great interactive visualization of Markov chains], by Victor Powell & Lewis Lehe. It's worth checking out to build some intuition. They correctly point out that [https://en.wikipedia.org/wiki/PageRank Google's PageRank algorithm] is based on Markov chains. There, the ranking of pages in a web search relates to how random walks across linked web pages spend more time on some pages than on others. | * Link to [http://setosa.io/blog/2014/07/26/markov-chains/ a great interactive visualization of Markov chains], by Victor Powell & Lewis Lehe. It's worth checking out to build some intuition. They correctly point out that [https://en.wikipedia.org/wiki/PageRank Google's PageRank algorithm] is based on Markov chains. There, the ranking of pages in a web search relates to how random walks across linked web pages spend more time on some pages than on others. | ||
* A non-biological example of using log odds ratios & Bayesian stats [https://priceonomics.com/how-statistics-solved-a-175-year-old-mystery-about/ to learn the authors of the Federalist Papers]. In a related example, [https://arstechnica.com/science/2024/02/lost-and-found-code-breakers-decipher-50-letters-of-mary-queen-of-scots/ researchers just decoded >50 coded letters from a French archive] and discovered they were lost correspondence from Mary, Queen of Scots, before she was executed in 1587 for treason against Elizabeth I. The researchers used an approach closely related to computing log odds ratios of 5-mer frequencies between putative decoded texts and known free text to figure out the correct ciphers. If you're curious, you can read about it in [https://www.tandfonline.com/doi/full/10.1080/01611194.2022.2160677 Appendix A of their paper] | * A non-biological example of using log odds ratios & Bayesian stats [https://priceonomics.com/how-statistics-solved-a-175-year-old-mystery-about/ to learn the authors of the Federalist Papers]. In a related example, [https://arstechnica.com/science/2024/02/lost-and-found-code-breakers-decipher-50-letters-of-mary-queen-of-scots/ researchers just decoded >50 coded letters from a French archive] and discovered they were lost correspondence from Mary, Queen of Scots, before she was executed in 1587 for treason against Elizabeth I. The researchers used an approach closely related to computing log odds ratios of 5-mer frequencies between putative decoded texts and known free text to figure out the correct ciphers. If you're curious, you can read about it in [https://www.tandfonline.com/doi/full/10.1080/01611194.2022.2160677 Appendix A of their paper] | ||
--> | |||
<!-- | |||
'''Feb 6, 2025 - Hidden Markov Models''' | '''Feb 6, 2025 - Hidden Markov Models''' | ||
* Don't forget: Rosalind Homework #2 (worth 10% of your final course grade) is '''due by 10 PM February 12'''. | * Don't forget: Rosalind Homework #2 (worth 10% of your final course grade) is '''due by 10 PM February 12'''. | ||
Line 215: | Line 218: | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/NBTPrimer-HMMs.pdf HMM primer] and [http://www.marcottelab.org/users/BCH394P_364C_2025/NBTPrimer-Bayes.pdf Bayesian statistics primer #1], [http://www.marcottelab.org/users/BCH394P_364C_2025/BayesPrimer-NatMethods.pdf Bayesian statistics primer #2], [http://en.wikipedia.org/wiki/Bayes'_theorem Wiki Bayes] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/NBTPrimer-HMMs.pdf HMM primer] and [http://www.marcottelab.org/users/BCH394P_364C_2025/NBTPrimer-Bayes.pdf Bayesian statistics primer #1], [http://www.marcottelab.org/users/BCH394P_364C_2025/BayesPrimer-NatMethods.pdf Bayesian statistics primer #2], [http://en.wikipedia.org/wiki/Bayes'_theorem Wiki Bayes] | ||
* Care to practice your [http://en.wikipedia.org/wiki/Regular_expression regular expressions]? (In [https://www.tutorialspoint.com/python3/python_reg_expressions.htm python?] & a [https://www.pcwdld.com/python-regex-cheat-sheet Python regexp cheat sheet]) | * Care to practice your [http://en.wikipedia.org/wiki/Regular_expression regular expressions]? (In [https://www.tutorialspoint.com/python3/python_reg_expressions.htm python?] & a [https://www.pcwdld.com/python-regex-cheat-sheet Python regexp cheat sheet]) | ||
--> | |||
<!-- | |||
'''Another reminder:''' ''Lectures will generally be about algorithms and concepts; the coding help hours (or my office hours) are for you to get individual coding help and feedback. Please plan to go regularly to coding help hours if you need that support!'' | '''Another reminder:''' ''Lectures will generally be about algorithms and concepts; the coding help hours (or my office hours) are for you to get individual coding help and feedback. Please plan to go regularly to coding help hours if you need that support!'' | ||
--> | |||
<!-- | |||
'''Feb 4, 2025 - Biological databases''' | '''Feb 4, 2025 - Biological databases''' | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/BCH394P-364C-BiologicalDatabases-Spring2025.pdf Today's slides]<br> | * [http://www.marcottelab.org/users/BCH394P_364C_2025/BCH394P-364C-BiologicalDatabases-Spring2025.pdf Today's slides]<br> | ||
Line 229: | Line 232: | ||
* Just a note that we'll be seeing ever more statistics as go on. Here's a [http://www.marcottelab.org/users/BCH394P_364C_2025/StatisticsPrimer.pdf good primer] from [https://integrativebio.utexas.edu/component/cobalt/item/7-integrative-biology/226-meyers-lauren-a?Itemid=1224 Prof. Lauren Ancel Myers] (who leads the [https://covid-19.tacc.utexas.edu/ UT Austin COVID-19 Modeling Consortium]) to refresh/explain basic concepts. | * Just a note that we'll be seeing ever more statistics as go on. Here's a [http://www.marcottelab.org/users/BCH394P_364C_2025/StatisticsPrimer.pdf good primer] from [https://integrativebio.utexas.edu/component/cobalt/item/7-integrative-biology/226-meyers-lauren-a?Itemid=1224 Prof. Lauren Ancel Myers] (who leads the [https://covid-19.tacc.utexas.edu/ UT Austin COVID-19 Modeling Consortium]) to refresh/explain basic concepts. | ||
* Finally, here's great opportunity to hone your Python skills a bit more: The UT CBRS cores will offer [https://research.utexas.edu/cbrs/classes/short-courses/ short courses] in Python, Unix, and Python for Data Sciences starting in March. | * Finally, here's great opportunity to hone your Python skills a bit more: The UT CBRS cores will offer [https://research.utexas.edu/cbrs/classes/short-courses/ short courses] in Python, Unix, and Python for Data Sciences starting in March. | ||
--> | |||
<!-- | |||
'''Jan 30, 2025 - Speeding up your searches: BLAST, MMSeqs2, and Foldseek''' | '''Jan 30, 2025 - Speeding up your searches: BLAST, MMSeqs2, and Foldseek''' | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/BCH394P-364C-BLAST-Spring2025.pdf Our slides today] are modified from a paper on [http://dx.doi.org/10.1371/journal.pbio.1001014 Teaching BLAST] by Cheryl Kerfeld & Kathleen Scott. | * [http://www.marcottelab.org/users/BCH394P_364C_2025/BCH394P-364C-BLAST-Spring2025.pdf Our slides today] are modified from a paper on [http://dx.doi.org/10.1371/journal.pbio.1001014 Teaching BLAST] by Cheryl Kerfeld & Kathleen Scott. | ||
Line 238: | Line 241: | ||
* The [http://www.marcottelab.org/users/BCH394P_364C_2025/MMSeqs2_NBT_2017.pdf MMSeqs2 paper] | * The [http://www.marcottelab.org/users/BCH394P_364C_2025/MMSeqs2_NBT_2017.pdf MMSeqs2 paper] | ||
* The [http://www.marcottelab.org/users/BCH394P_364C_2025/FoldSeek_NBT_2023.pdf FoldSeek paper] and a link to the [https://search.foldseek.com/search FoldSeek server] if you want to try it out | * The [http://www.marcottelab.org/users/BCH394P_364C_2025/FoldSeek_NBT_2023.pdf FoldSeek paper] and a link to the [https://search.foldseek.com/search FoldSeek server] if you want to try it out | ||
--> | |||
<!-- | |||
'''Jan 28, 2025 - Sequence Alignment II''' | '''Jan 28, 2025 - Sequence Alignment II''' | ||
* We'll be finishing up slides from last time. | * We'll be finishing up slides from last time. | ||
--> | |||
<!-- | <!-- | ||
* For those of you who could use more tips on programming, '''the weekly peer-led open coding hour is starting up again'''! Every Monday, 3:30-4:30, in the MBB 2.232 lounge. It's a very informal setting where you can work and ask questions of more experienced programmers. | * For those of you who could use more tips on programming, '''the weekly peer-led open coding hour is starting up again'''! Every Monday, 3:30-4:30, in the MBB 2.232 lounge. It's a very informal setting where you can work and ask questions of more experienced programmers. | ||
--> | --> | ||
<!-- | |||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/FactAndFictionInAlignment.png Fact and Fiction in Sequence Alignments] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/FactAndFictionInAlignment.png Fact and Fiction in Sequence Alignments] | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/NBTPrimer-DynamicProgramming.pdf Dynamic programming primer] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/NBTPrimer-DynamicProgramming.pdf Dynamic programming primer] | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/GALPAS.xls An example of dynamic programming using Excel], created by [https://hoffmanlab.org/ Michael Hoffman] (a former U Texas undergraduate, now U Toronto professor, who took a prior incarnation of this class) | * [http://www.marcottelab.org/users/BCH394P_364C_2025/GALPAS.xls An example of dynamic programming using Excel], created by [https://hoffmanlab.org/ Michael Hoffman] (a former U Texas undergraduate, now U Toronto professor, who took a prior incarnation of this class) | ||
* A few examples of proteins with internally repetitive sequences: [http://www.pdb.org/pdb/explore/explore.do?structureId=1QYY 1], [http://www.pdb.org/pdb/explore/explore.do?structureId=2BEX 2], [http://www.pdb.org/pdb/explore/explore.do?structureId=1BKV 3] | * A few examples of proteins with internally repetitive sequences: [http://www.pdb.org/pdb/explore/explore.do?structureId=1QYY 1], [http://www.pdb.org/pdb/explore/explore.do?structureId=2BEX 2], [http://www.pdb.org/pdb/explore/explore.do?structureId=1BKV 3] | ||
--> | |||
<!-- | |||
'''Jan 23, 2025 - Sequence Alignment I''' | '''Jan 23, 2025 - Sequence Alignment I''' | ||
* Reminder relevant to our discussion of ChatGPT last class: CNET & other news sources used it to write articles; [https://gizmodo.com/cnet-ai-chatgpt-news-robot-1849996151 this Gizmodo story] found that "the AI-program fabricates information and bungles facts like nobody’s business" and CNET was "forced to issue multiple, major corrections". So, if you do opt to try ChatGPT to help with Python, be sure to check (and then double-check) everything. | * Reminder relevant to our discussion of ChatGPT last class: CNET & other news sources used it to write articles; [https://gizmodo.com/cnet-ai-chatgpt-news-robot-1849996151 this Gizmodo story] found that "the AI-program fabricates information and bungles facts like nobody’s business" and CNET was "forced to issue multiple, major corrections". So, if you do opt to try ChatGPT to help with Python, be sure to check (and then double-check) everything. | ||
Line 264: | Line 269: | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/BLOSUM_paper.pdf The original BLOSUM paper] (hot off the presses from 1992!) | * [http://www.marcottelab.org/users/BCH394P_364C_2025/BLOSUM_paper.pdf The original BLOSUM paper] (hot off the presses from 1992!) | ||
* [http://www.marcottelab.org/users/BCH394P_364C_2025/BLOSUM62Miscalculations.pdf BLOSUM miscalculations improve performance] | * [http://www.marcottelab.org/users/BCH394P_364C_2025/BLOSUM62Miscalculations.pdf BLOSUM miscalculations improve performance] | ||
--> | |||
<!-- | |||
'''Jan 21, 2025 - Intro to Python II''' | '''Jan 21, 2025 - Intro to Python II''' | ||
* Reminder that today will be part 2 of the "Python boot camp" for those of you with little to no previous Python coding experience. We'll be finishing the slides from last time, plus Rosalind help & programming Q/A. | * Reminder that today will be part 2 of the "Python boot camp" for those of you with little to no previous Python coding experience. We'll be finishing the slides from last time, plus Rosalind help & programming Q/A. | ||
Line 271: | Line 276: | ||
* We'll talk a bit about [https://chat.openai.com/ ChatGPT] today for co-programming | * We'll talk a bit about [https://chat.openai.com/ ChatGPT] today for co-programming | ||
* Another strong recommendation (really) to the Python newbies to download Eric Matthes's GREAT, free [https://github.com/ehmatthes/pcc/releases/download/v1.0.0/beginners_python_cheat_sheet_pcc_all.pdf Python command cheat sheets] that he provides to accompany his [https://nostarch.com/pythoncrashcourse2e Python Crash Course] book. | * Another strong recommendation (really) to the Python newbies to download Eric Matthes's GREAT, free [https://github.com/ehmatthes/pcc/releases/download/v1.0.0/beginners_python_cheat_sheet_pcc_all.pdf Python command cheat sheets] that he provides to accompany his [https://nostarch.com/pythoncrashcourse2e Python Crash Course] book. | ||
--> | |||
<!-- | |||
'''Jan 16, 2025 - Intro to Python''' | '''Jan 16, 2025 - Intro to Python''' | ||
* '''Remember that today and the next lecture are dedicated to the Python Boot Camp to start getting those of you with minimal coding skills up to speed on the basics. Advanced programmers can skip class!'''<br> | * '''Remember that today and the next lecture are dedicated to the Python Boot Camp to start getting those of you with minimal coding skills up to speed on the basics. Advanced programmers can skip class!'''<br> | ||
Line 281: | Line 286: | ||
* Don't forget that the Rosalind assignments are due by 10 PM January 22. Please do start if you haven't already, or you won't have time to get help if you have any issues installing Python. | * Don't forget that the Rosalind assignments are due by 10 PM January 22. Please do start if you haven't already, or you won't have time to get help if you have any issues installing Python. | ||
* We'll use Python version 3 (any version after 3.0 should be fine; just get the latest version in Anaconda), but Rosalind and some older materials are only available in Python 2.7, so we'll generally try to be version agnostic for compatibility. For beginners, the [http://www.practicepython.org/blog/2017/02/09/python2-and-3.html differences are quite minimal] and are [https://www.guru99.com/python-2-vs-python-3.html summarized in a table here]. There's also a great [https://python-future.org/compatible_idioms.html cheat sheet here] for writing code compatible with both versions. | * We'll use Python version 3 (any version after 3.0 should be fine; just get the latest version in Anaconda), but Rosalind and some older materials are only available in Python 2.7, so we'll generally try to be version agnostic for compatibility. For beginners, the [http://www.practicepython.org/blog/2017/02/09/python2-and-3.html differences are quite minimal] and are [https://www.guru99.com/python-2-vs-python-3.html summarized in a table here]. There's also a great [https://python-future.org/compatible_idioms.html cheat sheet here] for writing code compatible with both versions. | ||
--> | |||
'''Jan 14, 2025 - Introduction''' | '''Jan 14, 2025 - Introduction''' |
Revision as of 19:08, 8 January 2025
BCH394P/BCH364C Systems Biology & Bioinformatics
Course unique #: 54960/54860
Lectures: Tues/Thurs 9:30 – 11:00 AM WEL 2.246
Instructor: Edward Marcotte, marcotte @ utexas.edu
- Office hours: Mon 4 – 5 PM on the class Zoom channel (available on Canvas)
TA: Zoya Ansari, zansari @ utexas.edu
- TA Office hours: Tues 1 - 2 PM / Fri 1 - 2 PM in MBB 3.304 or by appointment on Zoom
Class Canvas site: https://utexas.instructure.com/courses/1407802
Lectures & Handouts
Jan 14, 2025 - Introduction
- Today's slides
- We'll be conducting homework using the online environment Rosalind. Go ahead and register on the site, and enroll specifically for BCH394P/364C (Spring 2025) Systems Biology/Bioinformatics using this link. Homework #1 (worth 10% of your final course grade) has already been assigned on Rosalind and is due by 10:00PM January 22.
- We'll be using the free Anaconda distribution of Python and Jupyter (download here). Note that there are many other options out there, such as Google colab. You're welcome to use those, but we'll restrict our teaching and TA help sessions to Jupyter/Anaconda for simplicity.
Here are some online Python resources that you might find useful:
- First and foremost, and very, very useful if you're a complete Python newbie: Eric Matthes's Python Crash Course book. He made some GREAT, free Python command cheat sheets to support the book.
- Practical Python, worth checking out!
- If you have any basic experience at all in other programming languages, Google offered an extremely good, 2-day intro course to Python (albeit version 2) that is now available on Youtube.
- Khan Academy has archived their older intro videos on Python here (again, version 2)
Syllabus & course outline
An introduction to systems biology and bioinformatics, emphasizing quantitative analysis of high-throughput biological data, and covering typical data, data analysis, and computer algorithms. Topics will include introductory probability and statistics, basics of Python programming, protein and nucleic acid sequence analysis, genome sequencing and assembly, proteomics, analysis of large-scale gene expression data, data clustering & classification, biological pattern recognition, gene and protein networks, AI/machine learning, and protein 3D structure prediction/design.
Open to graduate students and upper division undergrads (with permission) in natural sciences and engineering.
Prerequisites: Basic familiarity with molecular biology, statistics & computing, but realistically, it is expected that students will have extremely varied backgrounds. Undergraduates have additional prerequisites, as listed in the catalog.
Note that this is not a course on practical sequence analysis or using web-based tools. Although we will use a number of these to help illustrate points, the focus of the course will be on learning the underlying algorithms, exploratory data analyses, and their applications, esp. in high-throughput biology. By the end of the course, students will know the fundamentals of important algorithms in bioinformatics and systems biology, will be able to design and implement computational studies in biology, and will have performed an element of original computational biology research.
Most of the lectures will be from research articles and slides posted online, with some material from the...
Optional text (for sequence analysis): Biological sequence analysis, by R. Durbin, S. Eddy, A. Krogh, G. Mitchison (Cambridge University Press),
For biologists rusty on their stats, The Cartoon Guide to Statistics (Gonick/Smith) is very good. A reasonable online resource for beginners is Statistics Done Wrong. A truly excellent stats book with a free download is An Introduction to Statistical Learning, by James, Witten, Hastie, Tibshirani, and Taylor, and is accompanied by many supporting Python examples and applications.
Two other online probability & stats references: #1, #2 (which has some lovely visualizations)
No exams will be given. Grades will be based on online homework (counting 30% of the grade), 3 problem sets (given every 2-3 weeks and counting 15% each towards the final grade) and an independent course project (25% of the final grade), which can be collaborative (1-3 students/project). The course project will consist of a research project on a bioinformatics topic chosen by the student (with approval by the instructor) containing an element of independent computational biology research (e.g. calculation, programming, database analysis, etc.). This will be turned in as a link to a web page. The final project is due by 10 PM, April 16, 2025. The last 3 classes will be spent presenting your projects to each other. (The presentation will account for 5/25 points of the project grade.)
If at some point, we have to go into coronavirus lockdown, that portion of the class will be web-based. We will hold lectures by Zoom during the normally scheduled class time. Log in to the UT Canvas class page for the link, or, if you are auditing, email the TA and we will send the link by return email. Slides will be posted before class so you can follow along with the material. We'll record the lectures & post the recordings afterward on Canvas so any of you who might be in other time zones or otherwise be unable to make class will have the opportunity to watch them. Note that the recordings will only be available on Canvas and are reserved only for students in this class for educational purposes and are protected under FERPA. The recordings should not be shared outside the class in any form. Violation of this restriction could lead to Student Misconduct proceedings.
Online homework will be assigned and evaluated using the free bioinformatics web resource Rosalind.
All projects and homework will be turned in electronically and time-stamped. No makeup work will be given. Instead, all students have 5 days of free “late time” (for the entire semester, NOT per project, and counting weekends/holidays). For projects turned in late, days will be deducted from the 5-day total (or what remains of it) by the number of days late (in 1-day increments, rounding up, i.e. 10 minutes late = 1 day deducted). Once the full 5 days have been used up, assignments will be penalized 10 percent per day late (rounding up), i.e., a 50-point assignment turned in 1.5 days late would be penalized 20%, or 10 points.
Homework, problem sets, and the project total to a possible 100 points. There will be no curving of grades, nor will grades be rounded up. We’ll use the plus/minus grading system, so: A= 92 and above, A-=90 to 91.99, etc. Just for clarity's sake, here are the cutoffs for the grades: 92% = A, 90% = A- < 92%, 88% = B+ < 90%, 82% = B < 88%, 80% = B- < 82%, 78% = C+ < 80%, 72% = C < 78%, 70% = C- < 72%, 68% = D+ < 70%, 62% = D < 68%, 60% = D- < 62%, F < 60%.
Students are welcome to discuss ideas and problems with each other, but all programs, Rosalind homework, problem sets, and written solutions should be performed independently (except for the final collaborative project). Students are expected to follow the UT honor code. Cheating, plagiarism, copying, & reuse of prior homework, projects, or programs from CourseHero, Github, or any other sources are all strictly forbidden and constitute breaches of academic integrity and cause for dismissal with a failing grade, possibly expulsion (UT's academic integrity policy). In particular, no materials used in this class, including, but not limited to, lecture hand-outs, videos, assessments (papers, projects, homework assignments), in-class materials, review sheets, and additional problem sets, may be shared online or with anyone outside of the class unless you have the instructor’s explicit, written permission. Any materials found online (e.g. in CourseHero) that are associated with you, or any suspected unauthorized sharing of materials, will be reported to Student Conduct and Academic Integrity in the Office of the Dean of Students. These reports can result in sanctions, including failure in the course.
The use of artificial intelligence tools (such as ChatGPT or Github co-pilot) in this class shall be permitted on a limited basis for programming assignments. You are also welcome to seek my prior-approval to use AI writing tools on any assignment. In either instance, AI writing tools should be used with caution and proper citation, as the use of AI should be properly attributed. Using AI writing tools without my permission or authorization, or failing to properly cite AI even where permitted, shall constitute a violation of UT Austin’s Institutional Rules on academic integrity.
Students with disabilities may request appropriate academic accommodations from Disability and Access.
The final project website is due by 10 PM April 16, 2025
- How to make a website for the final project
- Google Site: https://sites.google.com/new
- You might also consider streamlit, which lets you generate websites on the fly direct from Python