Applied biostatistics
Aperçu des semaines
-
-
To contact me (24/24):
email: darlene.goldstein@epfl.chtel/sms/whatsapp/signal: 079 427 2501skype: darlenegoldsteinCourse format:
Although we are currently allowed to attend the course on campus, ALL LECTURES WILL ALSO BE PRERECORDED, since some of you may be unable to attend in person from time to time. Please follow the lecture before the 'office hours' or lab time so that you can ask any additional questions then.
Office hours: I will be available for your questions in my office MA B1 477 by appointment.
The lab time is Tuesday 16.15-18.00 in CO3. I understand that there will be some class conflicts, so I will try to find a new room for those of you who are unable to attend at the assigned time. For the first week, in any case, please try to attend either my office hours or the lab time, and we will figure out a solution to any conflict.
Course language:
This course is given in English, but feel free to speak in either English or French.
Resources:
A useful book for both statistics and R:
- A Handbook of Statistical Analyses Using R, 3rd edition. Torsten Hothorn and Brian S. Everitt. CRC Press.
Some resources to get you started with R, R Studio and R Markdown: -
Repository of R packages, you can download R from here. Also a good source of documentation (see 'Contributed' under the Documentation heading).
-
RStudio makes R easier to use. It includes a code editor, debugging & visualization tools. Choose the free desktop version corresponding to your computer and operating system.
-
Tutorials and examples for reproducible research using R:
-
Forum for students to find group members. Once you have formed a group, please send me 1 email containing the names of all group members. As a reminder, your group can contain 1-4 persons.
-
-
Week 1: Course organization, reproducible research, confidence interval / hypothesis testing review
Organization: you will write a short group report (~ 5-7 pages; a 'group' can be 1-4 persons), a short group article critique (1 page, it can be in question/answer format), and a longer individual report (up to ~ 7-10 pages). The 2 reports will be about data analyses you carry out. The group data set will be assigned to you. For the individual report, you can choose a topic from a list that I will provide once we have covered all the eligible topics in lecture. I will announce when you can email me your choice, so please do not send me an email earlier than that. Once you email me your choice, I will assign you a data set on that topic.
The purpose of this course is to help you to learn something without too much stress!! That is why you can do each of the 2 reports twice: a preliminary version, which will be commented according to posted criteria, then a final version, where you can incorporate the comments, due at the end of the semester. Only the final version will count towards your course note. The deadlines will be posted on the course moodle page.
For the article critique, you will get the 1/2 point (full credit) as long as you submit it by the deadline - you don't need to do a preliminary version.
In order to give you time to work on your reports, there will be no in-person lectures and mainly optional topics toward the end of the course. These 'extra' topics are NOT required, there are slides (and possibly videos) in case you are interested. There is no penalty associated with not following them.
NOTE: this first week's LECTURE is ONLINE ONLY, there will be NO in-class lecture. Please come to the lab meeting in CO3 on Tuesday afternoon for the EDA presentation and to get started with R and RStudio.
Grading
- 1/2 point: short report 1 (either regression or anova, will be assigned to you), can be in a group of up to 4 people
- 1/2 point: short critique on a scientific article (will be assigned to you), can be in a group of up to 4 people
- 5 points: individual analysis report (your choice among a number of topics)
-
(with some extra slides on Central Limit Theorem, Confidence Intervals)
-
Practice with R - first download R and RStudio, then work through the exercises.
-
Week 2: Linear regression modeling
You can already email me your groups (1 email per group); remember, each group can contain 1-4 persons. Each group will be assigned to analyze EITHER a regression data set OR an anova data set. -
Week 3: Experimental design, Analysis of variance (anova)
Report 1: (initial/preliminary deadline Friday 11 April - any time)
The purpose of this assignment is to give you practice writing a scientific report. Report writing is an extremely important skill, regardless of whether you continue in an academic career, in government or in industry.
You should analyze your data in an appropriate manner (either like lab week 2 for regression or lab week 3 for anova, or a combination if you have both factor and continuous explanatory variables) and write a short report, ~ 5 pages (7 pages max).
The goal is NOT to replicate the analysis presented in the paper corresponding to the data set, so don't worry if you do something different, or obtain results that are different from the paper when you are doing the same thing that the paper seems to describe. YOU are in charge of the analyses you carry out !!
Please submit your report as a .pdf file, (NOT .DOC, etc.) in the moodle assignment space, 1 per group. The spaces will be labeled R1, R2, A1, A2, for regression problems 1-2 and anova problems 1-2. Your file name should be labeled as XX-##.pdf, etc., where XX is your assigned problem (either R1, R2, A1, or A2) and ## is your group number.
Your report should contain a short background/intro to the problem (including the aim of the original study), a presentation of the results of your statistical analyses, including exploratory data analysis, model fitting and final model, along with a short discussion of any shortcomings of the final model, and your conclusions. Include relevant graphics and tables, but DO NOT include any raw R code or output (you will be penalized for this if you do). Your graphs should be 'pretty', if you copy/paste a graph from the screen, it will most likely appear to be blurry (png file) and you will be penalized for this. It is easiest to include nice-looking graphs if you save a pdf version and use R Markdown, but this is not the only way.
Please use 12 point size and margins of 2.5 cm. Please remember to number each page at the bottom (including page 1). Inside the top margin of each page, please include the surnames of each group member (separated by commas).
Do not include a cover page, abstract, table of contents, or EPFL logo, and do not exceed 7 pages (not including any references) or you will be penalized.
Your report will also be graded based on language use and overall presentation. (It can be in either English or French.)
As a reminder, this report counts for 1/2 point (out of 6) of your course note.
The initial deadline 11 April, any time) is for your preliminary report. The final version is due by 30 June (any time).
If you turn in your report before the initial deadline then we will be able to comment on your report and you can re-do it before the final deadline. If you need to turn it in later, that's ok, I should still have enough time to comment it for you, so.... NO STRESS !!!!!
When you email me with the names of your group members I will send you the dataset (after Lab 3).-
UPDATED: 22.55 Tuesday 11 March
-
Your report will be assessed according to these critera.
-
-
-
Week 4: Model selection
-
Carry out this R tutorial with an example environmental dataset. (It is ok to skip the part about partial correlation analysis - 7.1.2.).
-
Week 5: Generalized linear modeling, logistic regression, Poisson regression
-
Poisson Regression (OPTIONAL)
-
Week 6: Survival analysis
Second assignment This assignment is a statistical critique of a published paper. Your report can either be written as a full review or in a question/answer format by just simply by responding to each question. Your report should not be more than 1 page.
You can turn in this report any time before the final deadline - 30 June 2024. You will get full credit (i.e. 1/2 point toward your course note) for turning in a reasonable effort.
There is a deposit slot near the bottom of the course moodle page for you to submit your report.
Groups who worked on regression problems:
L1: http://www.jcancer.org/v09p1421.htm
Groups who worked on anova problems:
L2: https://www.sciencedirect.com/science/article/pii/S1743919118307337
A guide sheet (study assessment questions) is uploaded to help you to address statistical issues.The file contains a longer list of questions to consider when evaluating a study in your future career. As a guide for your 2nd assignment report, please make sure that you respond particularly to the following: (numbers in parentheses represent points out of 6)
(1) 1. Briefly give the biomedical background for the paper. What question/hypothesis is being investigated?
(1) 2. What data are collected (include how many individuals, what variables, inclusion / exclusion criteria for the study)?
(1) 3. What analyses were carried out? Are these analyses appropriate for the problem?
(1) 4. What other analyses should have been done (or might have been done but not shown)? Explain.
(1) 5. Is there any mention of power of the analyses? How would you go about trying to estimate power?
(NOTE: you do NOT have to actually give power estimates, just say how you might go about it.)(1) 6. What conclusions do the authors draw? Are these conclusions substantiated by the results? Explain.
-
This package has stepwise selection functions for linear, generalized linear and Cox models. Should be helpful for those of you with GLM (Poisson), logistic (which is also a GLM) or survival data. You can download this package from the CRAN:
https://cran.r-project.org/web/packages/My.stepwise/index.html
-
Week 7: Discrete data analysis, contingency tables, 2x2 tables; data visualization; asymptotic and exact tests
-
Work on manipulating tables and carrying out tests (sections 2.1-2.5, 3.1-3.5 only).
Before starting, you will need to load the vcd and vcdExtra packages
using the R function library().NOTE: The web address for the article by Richard Darlington (section 3.5) incorrect.
Explore making mosaic plots
## Example R code for Arthritis mosaic plot:
data("Arthritis", package = "vcd")
(art <- xtabs(~ Treatment + Improved, data = Arthritis, subset = Sex == "Female")) ## females only
set.seed(1071)
library(vcd)
mosaic(art, gp = shading_max, gp_args = list(n = 5000), split_vertical=TRUE)
## OR: mosaicplot(art) -
For more informtion about mosaic plots in the vcd package, see the 2 vignettes:
-
Week 8: Genetic association studies, genome-wide association studies (GWAS); principal components analysis, multiple hypothesis testing
NOTE: There will be NO CLASS today and NO LAB tomorrow.
This week's labs are OPTIONAL and there will be NO LAB MEETING; you might want to have a look at them though if you choose to do a GWAS as your individual report.
NOTE: The GWAS tutorial uses biocLite to install BioConductor packages - this is the older method. The newer method to install BioConductor packages is by using BiocManager.
First install BiocManager:
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
After that is installed, then you can install any BioConductor package (e.g. GWASTools) as follows:
BiocManager::install("GWASTools")-
This folder contains functions from the genABEL package (no longer available) and data that will help with your GWAS analysis. For instructions and code, please follow the tutorial available at (the paper):
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5019244
OR
https://github.com/AAlhendi1707/GWAS
OR the updated version at
https://pbreheny.github.io/adv-gwas-tutorial/index.html
In the genome-wide association analysis section of the tutorial, you will get either a warning or an error about the genABEL package. Instead of loading genABEL, ignore the error/warning and source the functions ztransform, rntransform, estlambda and GWAA.R (assuming those .R files are in your R working directory):
# Phenotype data preparation
# library(GenABEL)
source("ztransform.R")
source("rntransform.R")
source("estlambda.R")
source("GWAA.R")NEW: If you are using R 4.0.x on a mac, you may get an error when you execute the GWAA function. If that happens, you should downgrade R to 3.6.0 (for example) and see if that works. If you still have problems, please contact me and we will try to work it out.
NOTE: You do NOT have to do the very last part (Regional Association) -
NOTE: should be princiPAL components, not princiPLE
-
Choose your individual project topic from one of the following:
- survival analysis
- logistic regression
- generalized linear model (other than logistic, e.g. Poisson)
- discrete data / contingency table analysis
- genome-wide association study (GWAS)
and EMAIL ME your choice (please follow the email instructions in the announcement). I will then send you a dataset for analysis (or you can start working on the GWAS tutorial if you are doing a GWAS, just let me know).
Your final report should be ~7-10 pages (absolute maximum, not including references; fewer pages is better if you can be concise).
The preliminary deadline is Friday 16 May (any time), then I should be able to give you feedback in 1-2 weeks. You should then have a few more weeks to work on it before the final deadline of Monday 30 June (any time).
NOTE: As a reminder, you MUST work on this individual analysis and report ALONE. Your analysis and report should represent YOUR OWN WORK. DO NOT COMMUNICATE WITH ANYONE in ANY WAY about this project. If you have ANY question or problem, please ask ONLY ME and NOT anyone else.
I will consider ANY violation of this policy as PLAGIARISM (PLAGIAT) and will report any suspicion of plagiarism/plagiat to the Vice-présidence académique – Affaires juridiques. I have reported previous students who have been sanctioned for violating this rule, including getting a course note of 1, so please DO NOT TEST ME ON THIS.
If you have ANY questions, please don't hesitate to ask ME and ONLY ME. Do not risk your course note or your EPFL career by asking or communicating with any student. -
Please deposit only 1 report per group.
-
Please deposit your first group assignment here if you did Regression problem 1, as a pdf file named R1-## , where ## is your group number. The preliminary due date is 11 April (any time).
-
Please deposit your first group assignment here if you did Regression problem 2, as a pdf file named R2-## , where ## is your group number. The preliminary due date is 11 April (any time).
-
Please deposit your first group assignment here if you did Anova problem 1, as a pdf file named A1-## , where ## is your group number. The preliminary due date is 11 April (any time).
-
Please deposit your first group assignment here if you did Anova problem 2, as a pdf file named A2-## , where ## is your group number. The preliminary due date is 11 April (any time).
-
-
Week 9: Clinical trials (OPTIONAL)
NO CLASS OR LAB - time to work on reports; if you have any questions, please visit me during office hours or make an appointment with me.
-
Might also be of interest
-
Tutorial instructions:
You can start reading at page 5, and do exercises 9-12 but ONLY for the t-test (not the tests listed).
Next, read the section about power curves, then make a graph of power curves like the one in the tutorial, but with deltas varying from 0.1-0.9 by 0.1. Do exercise 13.
Work through the section on Cox regression and do exercise 14.
If you have time and interest, you can work through the section on Power Simulation. You can also do exercise 16 if you want.
-
NO CLASS OR LAB - PÂQUES / EASTER
-
Week 10: Meta-analysis (OPTIONAL)
NO CLASS OR LAB - time to work on reports; if you have any questions, please visit me during office hours or make an appointment with me.
-
Week 11: Introduction to mixed-effects models (OPTIONAL)
NO CLASS OR LAB - time to work on reports; if you have any questions, please visit me during office hours or make an appointment with me.
-
Week 12: Cluster Analysis (OPTIONAL)
NO CLASS OR LAB - time to work on reports; if you have any questions, please visit me during office hours or make an appointment with me.
-
-
If your group worked on a regression problem. Due 30 June (any time).
-
If your group worked on an anova problem. Due 30 June (any time).
-