EDI Data Uploading | kimvincentdesigns

Creating metadata with EML Assemblyline to upload to Environmental Data Initiative (EDI)

RMarkdown Script

Below is a step by step guide via RMarkdown script to create the XML file needed to upload data to EDI. You can download the RMarkdown file by clicking on the button below.

Download RMarkdown file from Github

Basic steps for compiling the XML document with all necessary metadata

1. Obtain a user ID (necessary to upload). Request from info@environmentaldatainitiative.org.
user_ID <- "enter user_id here"

2. What is the title of the dataset? (be descriptive, must be between 7 and 20 words)
dataset_title <- "Enter title here"

3. What are the start and end dates of the dataset (not the project)
Must use double digit months for single digit months, e.g. July is 07, not 7. YYYY-MM-DD
start_date <- "2016-07-11"
end_date <- "2016-08-23"

4. What is the status of the project? Ongoing or complete? You can update datasets with more data from ongoing projects.
status <- "complete" # or "ongoing"

5. Specify the geographic region where this data was collected.
verbal_description <- "Rocky Mountain National Park and Snowy Range, Colorado and Wyoming, USA"
N <- "41.3793" # North bounding coordinates (decimals) ,
S <- "40.1726" # South bounding coordinates (decimals)
E <- "-105.5936" # East bounding coordinates (decimals)
W <- "-106.2747" # West bounding coordinates (decimals

6. Intellectual Rights: What license would you like to use?
license <- "CCBY"

7. Are you uploading LTER data? If so, specify the user domain as "LTER". If not, specify the user domain as "EDI".
user_domain <- "EDI" # or "LTER"

8. Create a package id: scope + dataset number + version

a. Identify data scope. Search here: https://portal.edirepository.org/nis/scopebrowse
scope <- "edi" # edi/ ecotrends...

b. Find the next available number in the scope you chose above. This will be the unique number

identifying your dataset.
dataset_number <- 123 #"next available number"

c. Identify the version. i.e. the first upload is 1, the first revision is 2, the second revision is 3, etc.

version <- "1"

package_id <- paste0(scope, ".", dataset_number, ".", version)

9. Install EMLassemblyline package
remotes::install_github("EDIorg/EMLassemblyline") # Install EML assemblyline from GitHub

library(EMLassemblyline) # load the library

10. Name the folder/ directory where you'd like to put your files.

directory <- "DirectoryName_EML"

11. Set your working directory or save your R Project to your working directory.

12. Create a folder to house your original datasets.

dir.create(file.path(paste0("./", directory, "/original_data"))

13. Make sure your datasets (.csv) are clean and upload them to the original data folder you just created.

Clean your datasets before uploading them. Notes: Data should be in csv text file. If starting with an Excel spreadsheet, please make sure it does not contain any formulas and comments on cells. If you need comments put them in their own column. If data were used in a database and major table linking is necessary to analyze, please de-normalize into a flat file, not just database table exports.

14. Template a data package directory.
template_directories(
path = ".", # navigates to the working directory
dir.name = directory) # defined above

# View directory contents (folders to hold the EML objects)
dir(paste0("./", directory))

# Confirm that the templates directory is empty, i.e. "char = (0)"
dir(paste0("./", directory, '/metadata_templates'))

15. Create templates for elements to be added to the XML file.
template_core_metadata(
path = paste0("./", directory, "/metadata_templates"), # This creates a folder with several template items.
license = license) # Specified above.

# Confirm that the templates core metadata files are now present
dir(paste0("./", directory, "/metadata_templates"))

16. Create a new data folder to hold the data files within the working directory.
og_data.fp <- "/Users/name/Documents/Data_filepath"

17. Copy the cleaned datasets from the Data folder to the directory.

Enter the name of the csvs to upload and a description for each dataset being uploaded. Repeat for all datasets to be included in this data package.
csv1 <- "Dataset1_cleaned.csv"
csv1_description <- "Water chemistry and other measured characteristics of lakes"

csv2 <- "Seq_Table_Bac_wtaxa.csv"
csv2_description <- "Relative abundance of bacterial taxa by sample for alpine and subalpine lakes"

file.copy(from = paste0(og_data.fp, "/", csv1),
to = paste0("./", directory, "/data_objects"))

file.copy(from = paste0(og_data.fp, "/", csv2),
to = paste0("./", directory, "/data_objects"))

18. Confirm that the files are present in the EML directory
dir(paste0("./", directory, "/data_objects"))

19. Create the template attribute table for each table you are uploading.
# Create template attribute tables for each table
template_table_attributes(
path = paste0("./", directory, "/metadata_templates"),
data.path = paste0("./", directory, "/data_objects"),
data.table = c(csv1, csv2))

# Check that attribute table templates are now present
dir(paste0("./", directory, "/metadata_templates"))
paste("Stop and manually edit the variable types in the attribute files: attributes...")

At this point, you will need to manually edit the variable types in the attribute files: "attributes..."
The easiest way to do this is to open the attribute tables in Excel for editing. Check especially for categorical and datetime variables as sometimes these are attributed as numeric or factors. Correctly identifying the categorical variables is very important before proceeding to the next step.

# These two lines open a searchable window in R Studio with the library of unitsstandardUnits <- EML::get_unitList()

View(standardUnits$units)

Download documentation regarding attributes here: https://environmentaldatainitiative.files.wordpress.com/2017/11/emlbestpractices-v3.pdf

After populating the attributes, you will create templates to define the categorical variables.

20. Create the template attribute table for categorical variables.
# Create template categorical variables
template_categorical_variables(
path = paste0("./", directory, "/metadata_templates"),
data.path = paste0("./", directory, "/data_objects"))
paste("Stop and manually edit the categorical variables in the categorical template files: catvars...")

21. Populate all the remaining template files

Before moving on, you will need to populate the template files in the metadat_templates folder. The following chunk of code compiles all the files, so only proceed to the next step once populating is completed.

a. Edit categorical variables in categorical template files for each dataset you are uploading (catvars...).

b. Edit the abstract.txt file by opening and pasting the abstract information into the text file.

c. Edit the additional_info.txt file. Leave blank if no additional info.

d. Edit the custom_units.txt file with any custom units not found in the library of units.

e. Edit the keywords.txt file by opening in Excel.
For keywords, see the LTER Controlled Vocabulary Library:

https://vocab.lternet.edu/vocab/vocab/index.php

f. Edit the personnel.txt file by opening in Excel. Repeat on multiple lines if the person plays multiple

roles. Required: PI, creator (this refers to the EDI depository), contact. Only need to list the grant

information on one line.

22. Construct the EML document.

This code does not need to be altered; all arguments have been specified.

make_eml(
path = paste0("./", directory, "/metadata_templates"),
data.path = paste0("./", directory, "/data_objects"),
eml.path = paste0("./", directory, "/eml"),
dataset.title = dataset_title,
temporal.coverage = c(start_date, end_date),
geographic.description = verbal_description,
geographic.coordinates = c(N, E, S, W),
maintenance.description = status,
data.table = c(csv1, csv2),
data.table.description = c(csv1_description, csv2_description),
user.id = user_ID,
user.domain = user_domain,
package.id = package_id)
warnings()

# View directory
dir(paste0("./", directory, "/eml"))

Kim Vincent