This README.txt file was generated on 20222704 by Ramya Ramadoss ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset: Supplemental Material for the Manuscript "Comparative Computational Study to Augment Substrate Binding Affinity of 4-hydroxybenzoate octaprenyltransferase (ubiA) enzymes Encoded in the Genomes of Multiple Environmental Purple Photosynthetic Bacteria." 2. Author Information Author Contact Information Name: Ramya Ramadoss Institution: Carnegie Mellon University Qatar Address: Biological Sciences, Carnegie Mellon University Qatar, PO box 24866, Doha, Qatar Email: rramado2@andrew.cmu.edu Office Phone Number: (+974) 4484852 Author Contact Information Name: Drishya M. George Institution: Hamad bin Khalifa University, Qatar. Address: College of Health and Life Sciences, Hamad bin Khalifa University, Qatar Foundation, Doha, Qatar. Email: dgeorge@hbku.edu.qa Author Contact Information Name: Hamish R. Mackey Institution: Hamad bin Khalifa University, Qatar. Address: Division of Sustainable Development, College of Science and Engineering, Hamad bin Khalifa University, Qatar Foundation, Doha, Qatar. Email: hmackey@hbku.edu.qa Corresponding Author Contact Information Name: Annette S. Vincent Institution: Carnegie Mellon University Qatar Address: Biological Sciences, Carnegie Mellon University Qatar, PO box 24866, Doha, Qatar Email: annettev@andrew.cmu.edu Office Phone Number: (+974) 4484852 --------------------- DATA & FILE OVERVIEW --------------------- Directory of Files: A. Filename: Supplemental Table 1.xlsx Short description: Supplemental Material for the Manuscript "Comparative Computational Study to Augment Substrate Binding Affinity of 4-hydroxybenzoate octaprenyltransferase (ubiA) enzymes Encoded in the Genomes of Multiple Environmental Purple Photosynthetic Bacteria." B. Filename: Supplemental Table 2.xlsx Short description: Supplemental Material for the Manuscript "Comparative Computational Study to Augment Substrate Binding Affinity of 4-hydroxybenzoate octaprenyltransferase (ubiA) enzymes Encoded in the Genomes of Multiple Environmental Purple Photosynthetic Bacteria." C. Filename: Supplemental Table 3.xlsx Short description: Supplemental Material for the Manuscript "Comparative Computational Study to Augment Substrate Binding Affinity of 4-hydroxybenzoate octaprenyltransferase (ubiA) enzymes Encoded in the Genomes of Multiple Environmental Purple Photosynthetic Bacteria." Additional Notes on File Relationships, Context, or Content (for example, if a user wants to reuse and/or cite your data, what information would you want them to know?): The file is the Supplemental Material for the Manuscript titled "Comparative Computational Study to Augment Substrate Binding Affinity of 4-hydroxybenzoate octaprenyltransferase (ubiA) enzymes Encoded in the Genomes of Multiple Environmental Purple Photosynthetic Bacteria." to be published in a journal. Some of the peer-reviewed journals do not host supplemental material and incorporating the dataset within the Manucript text would lead to incomprehensibility. Hence the KiltHub repository public DOI of this file is cited in the manuscript text. Sheet1 of file - Supplemental Table 1.xlsx is the Dataset of Protein Sequence entries of 4-Hydroxybenzoate octaprenyl transferase derived from UniProt database. This dataset was input to MMseqs2 tool for sensitive sequence search for Clustering analysis. Sheet1 of file - Supplemental Table 2.xlsx is the Dataset of Largest Cluster, Cluster-19 identified during first step of clustering using MMseqs2 tool. Protein Sequences share 30% sequence identity and 50% minimum coverage. Sheet1 of file - Supplemental Table 3.xlsx is the Dataset of Largest Cluster, Cluster-35 identified during Second step of clustering using MMseqs2 tool. Protein Sequences share 40% sequence identity and 80% minimum coverage. File Naming Convention: Objectiveoffile.xlsx ----------------------------------------- DATA DESCRIPTION FOR: Supplemental Table 1.xlsx - Sheet "Sheet1" ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 18123 3. Missing data codes: The dataset has no missing data, but in the case of missing codes, the dataset would use "NA" to denote missing data. 4. Variable List A. Name: UniProt Entry Description: Accession number of the UniProt entries of the 4-Hydroxybenzoate octaprenyl transferase derived from UniProt database. B. Name: Protein Sequence Description: Protein Sequences of the UniProt entries of the 4-Hydroxybenzoate octaprenyl transferase derived from UniProt database. ----------------------------------------- DATA DESCRIPTION FOR: Supplemental Table 2.xlsx - Sheet "Sheet1" ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 17873 3. Missing data codes: The dataset has no missing data, but in the case of missing codes, the dataset would use "NA" to denote missing data. 4. Variable List A. Name: UniProt Entry Description: Accession number of the UniProt entries of the 4-Hydroxybenzoate octaprenyl transferase derived from Largest Cluster, Cluster-19 identified during first step of clustering using MMseqs2 tool. Protein Sequences share 30% sequence identity and 50% minimum coverage. B. Name: Protein Sequence Description: Protein Sequences of the UniProt entries of the 4-Hydroxybenzoate octaprenyl transferase derived from Largest Cluster, Cluster-19 identified during first step of clustering using MMseqs2 tool. Protein Sequences share 30% sequence identity and 50% minimum coverage. ----------------------------------------- DATA DESCRIPTION FOR: Supplemental Table 3.xlsx - Sheet "Sheet1" ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 4040 3. Missing data codes: The dataset has no missing data, but in the case of missing codes, the dataset would use "NA" to denote missing data. 4. Variable List A. Name: UniProt Entry Description: Accession number of the UniProt entries of the 4-Hydroxybenzoate octaprenyl transferase derived from Largest Cluster, Cluster-35 identified during Second step of clustering using MMseqs2 tool. Protein Sequences share 40% sequence identity and 80% minimum coverage. B. Name: Protein Sequence Description: Protein Sequences of the UniProt entries of the 4-Hydroxybenzoate octaprenyl transferase derived from Largest Cluster, Cluster-35 identified during Second step of clustering using MMseqs2 tool. Protein Sequences share 40% sequence identity and 80% minimum coverage. ------------------------------------------------------- METHODOLOGICAL INFORMATION ------------------------------------------------------- 1. Software-specific information: Name: Microsoft Excel Version: 2019 System Requirements: Windows or macOS Open Source? (Y/N): N Additional Notes: The data were initially entered into Microsoft Excel and can be input into Excel. 2. Equipment-specific information: Manufacturer: Dell Model: Inspiron 3668 (if applicable) Embedded Software / Firmware Name: Ubuntu Embedded Software / Firmware Version: 14.04.6 LTS Additional Notes: The data were entered, cleaned, and formatted on this Dell computer. 3. Date of data collection: 20222001 - 20220502 -------------------------------------------------- NOTES ON REPRODUCIBILITY -------------------------------------------------- It would be possible to recreate similar data as shown in the datasets, using the Methodology section - "Extended Motif Discovery by Protein Sequence Cluster Analysis of 4-Hydroxybenzoate octaprenyl transferase (ubiA)" in Manuscript- "Comparative Computational Study to Augment Substrate Binding Affinity of 4-hydroxybenzoate octaprenyltransferase (ubiA) enzymes Encoded in the Genomes of Multiple Environmental Purple Photosynthetic Bacteria." Then, the data from output tsv file can be copy-pasted in a blank Excel file and non-essential variables deleted retaining ony the variables - "Entry" & "Sequence" followed by renaming the variable "Entry" to "UniProt Entry" and "Sequence" to "Protein Sequence". Any discrepancy could be the result of updated version of the UniProt database and MMseqs2 tool.