• 3. Reproducibility summary

Links to: Summary, Chatbot Tutor, Practice Questions, Glossary, R functions and R packages, and Additional resources.

Chapter Summary

Doing science is hard work. It is therefore important to keep your hard-fought data in a stable form that is less likely to be corrupted, and to provide clean, reproducible code to analyze these data. By following the best practices in collecting, storing, documenting, and analyzing data laid out in this chapter, your work will be reproducible and trustworthy.

Chatbot tutor

Please interact with this custom chatbot (ChatGPT link here, Gemini link here) I have made to help you with this chapter. I suggest interacting with at least ten back-and-forths to ramp up and then stopping when you feel like you got what you needed from it.

Practice Questions

Try these questions!

Q1) What is the biggest mistake in the table below?

ID should be lower case Its perfect, change nothing the column name, weight is not sufficiently descriptive, it should include the units. date_colleted_empty_means_same_as_above is too wordy, replace with date Values for date_collected_empty_means_same_as_above are implied. date is in Year-Month-Day format, while Month-Day-Year format is preffered.

ID	weight	date_collected_empty_means_same_as_above
1-A1	104	2024-03-01
1-1B	210
3-7	150
2-B	176	2024-03-15
1-A5	110

While some of these (like the long name for date) are clearly shortcomings, spreadsheets should never leave values implied.

Q2) What would you expect in a data dictionary accompanying the table above? (select all correct)

The units for weight. A statement that date is in Year-Month-Day format A statement explaining that in the date colleted column, empty means same as above.

Q3) How do you read data from a Excel sheet, called raw_data in an Excel filed named bird_data.xlsx located inside the R project you are working in?

You cannot load excel files into R. You must save it as a csv, and read it in with read_csv(). Assuming the readxl package is installed and loaded, type read_xlsx(file = “bird_data.xlsx”, sheet = “raw_data”). While you can read excel into R, you cannot specify the sheet.

Q4) What should you do to make code reproducible? (pick the best answer)

Specify the working directory with setwd() Show the packages installed with install.packages() Restart R once your done, and rerun your script to see if it works

Q5) As we saw in this chapter, R has a built-in dataset called iris. You can look at it or give it to functions by typing iris. Which variable type is the Species in the iris dataset?

numeric logical character factor

Q6) Cosider the plot generated by the code in the previous section. The plot is consists of “small multiples” (or in ggplot language “facets”). The facet on the far right is pink. What is the facet on the far right?

For the following question consider the diabetes dataset available at: https://raw.githubusercontent.com/ybrandvain/datasets/refs/heads/master/diabetes.csv

Q7) What is the first column in the diabetes dataset that will change its name after going through the clean_names() function?:

For the next questions, consider the R script below, which is (sadly) not far from bad code I often have to handle as a data editor.

# My R script

setwd("~/ybrandva/Desktop/silphium_project")

#### load data 
silphium_data <- read_csv("mydata.csv")
View(silphium_data)


ggplot(silphium_data, aes(x = location, y = yield))+
  geom_point()

Q8) The script above worked fine on my computer yesterday, but today I opened up a new R session on the very same computer and it failed. What went wrong?

Q9) what about the script would make it work on my computer (once the thing above is fixed) but not on another computer?

Q10) What about the script is annoying for someone using this code (once it works) and should be removed, but doesnt stop our code from working?

Glossary of Terms

Absolute Path – A file location specified from the root directory (e.g., /Users/username/Documents/data.csv), which can cause issues when sharing code across different computers. Using relative paths instead is recommended.

Data Dictionary – A structured document that defines each variable in a dataset, including its name, description, units, and expected values. It helps ensure data clarity and consistency.

Data Validation – A method for reducing errors in data entry by restricting input values (e.g., dropdown lists for categorical variables, ranges for numerical values).

Field Sheet – A structured data collection form used in the field or lab, designed for clarity and ease of data entry.

Metadata – Additional information describing a dataset, such as when, where, and how data were collected, the units of measurement, and details about the variables.

R Project – A self-contained environment in RStudio that organizes files, code, and data in a structured way, making analysis more reproducible.

Raw Data – The original, unmodified data collected from an experiment or survey. It should always be preserved in its original form, with any modifications performed in separate scripts.

README File – A text file that provides an overview of a dataset, including project details, data sources, file descriptions, and instructions for use.

Reproducibility – The ability to re-run an analysis and obtain the same results using the same data and code. This requires careful documentation, structured data storage, and clear coding practices.

Relative Path – A file path that specifies a location relative to the current working directory (e.g., data/my_file.csv), making it easier to share and reproduce analyses.

Tidy Data – A dataset format where each variable has its own column, each observation has its own row, and each value is in its own cell.

Key R functions

📥 Data import

- read_csv("file.csv") – Reads a CSV file into R as a tibble (from the readr package).
read_xlsx("file.xlsx", sheet = "sheetname") – Reads an excel sheet into R as a tibble (from the readxl package).

🔍 Inspecting data

glimpse() — shows the structure of your data from the dplyr package.
View() — open data viewer.

🧹 Cleaning & renaming

clean_names(data) – Standardizes column names (from the janitor package).
rename(data, new_name = old_name) – Renames columns in a dataset (from the dplyr package).

🔧 Data wrangling

select() — choose columns (from the dplyr package).
pivot_longer(data, cols, names_to, values_to) – Converts wide-format data to long format (from the tidyr package).

R Packages Introduced

readr – Provides fast and flexible functions for reading tabular data (here we revisited read_csv() for CSV files).
dplyr – A grammar for data manipulation. Here we introduced the rename(data, new_name = old_name) function to give columns better names.
tidyr – Helps tidy messy data. Here we introduced pivot_longer() to make wide data long.
janitor – Cleans and standardizes data, including clean_names() for formatting column names.

Additional resources

R Recipes:

Read a .csv: Learn how to read a csv into R as a tibble.
Read an Excel file: Learn how to read an excel file into R as a tibble.
Obey R’s naming rules: You want to give a valid name to an object in R.
Rename columns in a table: You want to rename one or more columns in a data frame.

Other web resources:

Data Organization in Spreadsheets (Broman & Woo, 2018).
Tidy Data: (Wickham, 2014).
Ten Simple Rules for Reproducible Computational Research: (Sandve, 2013).
NYT article: For big data scientists hurdle to insights is janitor work.
Style guide: Chapter 9 of Data management in large-scale education research by Lewis (2024). Includes sections on general good practices, file naming, and variable naming.
Data Storage and security: Chapter 13 of Data management in large-scale education research by Lewis (2024).

Videos:

Data integrity: (By Kate Laskowski who was the victim of data fabrication by her collaborator (and my former roommate) Jonathan Pruitt).
Tidying data with pivor_longer (From Stat454)