Loading Tabular Data

Introduction

This tutorial was prepared using Genome Workbench version 2.7.0

All the illustrations in this tutorial are fully applicable for the MS Windows 7 users. Unix/Linux users or users of the other MS Windows versions might experience minor variations in default settings, column order, window size, and other insignificant differences.

This tutorial will take you through the steps necessary to import tables into Genome Workbench. Table can be any file that has a fixed number of fields in each row and has well-defined separators, such as tabs, whitespaces or commas, between those fields. This can be very useful for user-created files, such as data exported from Microsoft Excel (tm). When a file is in a format already supported by Genome Workbench, such as a BED file, the file can be opened directly by selecting File on the open dialog with the File Format set to "Bed Files" or "Autodetect format".

Select and Preview the Table
Select Field Delimiters
Identify Field Types
Transform and Save Results
View Transformed Tables

The following file will be used as the sample dataset for this tutorial:

sample_table.txt

Step 1: Select and Preview the Table

To start, Select Open command from the File drop-down menu. Select File Import on the left side of the pop-up dialog.

This opens the file-import wizard. On the first page under File Format: select "Table files". You may also select "Autodetect format"but that may not work consistently for all tabel files.

Under Filenames click the file selection button (...) and use the file browser to select sample_table.txt.

Click the Next> button and you will see the contents of the file under Table Preview at the bottom of the page (you might have to increase the size of the pop-up dialog to see the content of the Table Preview section).

Notice the option Table Data Format on this screen. It is telling you that the system recognizes this is a delimited table. This option lets you decide how to import the table, as a Delimited table or a Fixed Width table. Most of the tables we deal with should be imported as delimited.

The next option, Import data starting with row lets you skip the first rows of the table because, for example, they are comment rows. Just set the number of the first row you want to import here. The following field, Comment Character can be set to a single character that precedes all of the header rows. It may be filled in automatically after you select the file, but you can change it if you wish.

The last option Parse Column Names from Selected Row lets you click on a row in the table from which the importer will then extract column header names. Column names will be parsed from that row with the same rules used for parsing other rows. Header names can also be entered manually.

Step 2: Select Field Delimiters

After selecting Next > on the preview screen you will be able to select the delimiters for separating the table rows into fields. When the file was read in the table loader made a "best guess" on the delimiters and on what will be shown on this page. To change them, click on the combination of delimiters and watch the results under Table Data to see which combination separates the fields correctly.

In addition to choosing the delimiters you can choose whether to Treat adjacent delimiters as a single delimiter. If your delimiter is a tab, for example, and you have two adjacent tabs with no text in-between, that can create a blank field or it can be treated as a single break between fields. You can also choose whether there is a quote-character for defining strings. If you choose this option and your delimiter is a comma (for example) than any commas that appear between quotes will not be treated as field separators.

If on the first screen you indicated that the Table Data Format was a Fixed Width format, then you will be taken to a different screen for dividing rows into fields. On this screen you split existing columns in two by clicking on Insert Column and then clicking on the column to be split. To remove a column, click on Remove Column and then click on the column to be removed. After you add and remove columns you can move the column dividers around until the fields are properly divided.

Step 3: Identify Field Types

After leaving the delimiter or fixed width pages that divided the rows into fields, the table importer takes its best guess as to the underlying data and semantic types for each column. On this page you can review and update the column names, data type and properties.

To update the name and properties for a column, click on the column name in the table at the bottom of the page. When you then pick a Data Type such as Sequence ID or Integer the Properties list will change to show options for that type. If you select an Integer type you need to indicate whether the value is One-Based or not. The standard convention in biological data is for data to be one-based, e.g. the first nucleotide in a sequence is position one. If a field is 0-based, the more common programming convention, then the first element of the set would be position 0. Lastly under types, if you do not want to import a column, select Skipped.

If one or more ID columns in the table represent a chromosome number or letter, then select an assembly for the column so that an accession can be found for the ID in the next import step. To choose an assembly for a column, click on the column header, then enter a search term to find the correct genome and then click Find Assembly.

Find Assembly brings up the Select Assembly selection dialog.

Select the correct assembly here for the current column. Note that if you have chromosome identifiers in more than one column, you can assign a different assembly to each of these ID columns.

After the column names, types and, if needed, assemblies have been specified for the columns - hit Next > to go to the transformations page.

Step 4: Transform and Save Results

The transformation page allows you to save your table and, if the proper fields are available, to specify locations or features in order to link your table entries to other data. The three transformation and save options are:

Convert to Feature Table
Create Locations
Keep Table Unchanged

The last transformation option, Keep Table Unchanged saves your table directly as an ASN file with the column names and types you specified on the previous Identify Field Types page. This is the only option that you will be able to use if your table does not include the ID, start position and stop position columns needed to create features or locations.

The Convert to Feature Table transformation option adds and updates fields as needed so that the rows can be viewed as region features in Genome Workbench. If none of the columns were identified as a region field, a region field is added automatically. An ID, start, and stop column are also required. If there is no strand column, one is added based on whether stop is greater than or equal to start (positive) or start is greater than stop (negative). Lastly, if more than one ID column has been identified, you will not be able to create region features since these features can only refer to one region per row. To fix this, go back to the previous page and set the type of the column(s) you do not want to transform to be Text or Integer rather than Sequence ID.

Summary of rules for transforming rows into region features:

Strand not provided: Add strand field, setting value to be positive if start>=stop, and negative otherwise. Swap start and stop values for negative strands.
Strand field provided as string: Add numeric strand field using standard NCBI mapping
Start is one-based: subtract 1 from start
Stop is one-based: subtract 1 from stop
Region field not provided: Add region field with generated region names (region + row#)
Stop field not provided but a length field is: Add a stop field where stop == start + length

Error conditions while converting region features:
If any rows have errors, an error column will be appended to the table. Each row will have its own error column value.

Start or stop are negative (this is an error whether nor not strand is provided)
Start or stop fields are 0 when they are specified as being 1-based by the user
Length is provided instead of stop and value is less than 0
A strand is provided and stop<start
A strand is provided but its value is not recognized, causing strand to be saved as "unknown".
Seq-id not recognized
Required field missing - Seq-id, start, and stop (or length) are required

The Create Locations option does not change any of the existing columns but it does add a new location column to the table for each set of ID, start and stop fields in the table. When you open the resulting ASN file in Genome Workbench you will be able to click on the locations and jump to the corresponding positions in the Graphical View.

Summary of rules for adding locations (multiple locations per row are allowed if data supports it)

Strand not provided: Specify + strand in seq-loc if stop>=start, and – otherwise
Strand field provided as string: Add numeric strand field to seq-loc using ENa_strand enum values
Strand not provided: Set value of strand in seq-loc based on whether start > stop
Strand not provided and stop < start: Swap start and stop fields in seq-loc, but do not update values in table
Start is one-based: Subtract 1 from start and from start position in the generated seq-loc
Stop is one-based: Subtract 1 from stop and from stop position in the generated seq-loc
Stop field not provided but length is: Set stop value in seq-loc to start + length

Possible errors when adding locations

The start or stop field is negative
Start or stop fields are 0 when they are specified as being 1-based by the user
Length field provided instead of start and its value < 0
A strand is provided and stop<start
Strand is provided but its value is not recognized, so an error is logged and value in seq-loc is set based on whether stop>start
Seq-id not provided or provided but not recognized
Missing or non-numeric number field for start, stop, or, if stop not provided, length

This window also gives you the option to save your transformation parameters for the table by selecting Save Import Parameters. This option is provided for debugging and for future use.

Once you have chosen a Transformation Type click Next > to select project.

Click Finish to save the resulting ASN file which you will then be able to open as a table in Genome Workbench.

Step 5: View Transformed Tables

Once you have saved the transformed table you will see it in your project tree and you will be able to open it as a Generic Table View or as a Text View.

If the table was saved without changes then the fields should appear just as they did in the original file, minus any fields that you chose not to import.

If you added locations to your table, you will see the location field(s) in the Generic Table View on the right hand side of the table. Double-clicking on the locations will cause the corresponding sequence to open at the specified location. If you have not previously selected a default view you will be presented with the Open View intermediate dialog.

If you specified a feature table in your transformation step, then you can also open the the corresponding sequence in the same folder of the project tree and see the features on the sequence. Note that to see region features in the Graphical View, you also need to select the display option Other features as shown in the image.

Download

Current Version is 2.7.6 (released June 3, 2013)

Release Notes
FTP site for all downloads
Windows
- Windows 7 (64-bit)
- Windows 2000/XP/Vista
Mac OS X
- Mac OS X 10.6+ 64-Bit (Snow Leopard, Lion, Mountain Lion)
Linux (Ubuntu 12.04 (Precise Pangolin))
- Linux x86 (64-bit DEB)
- Linux x86 (64-bit tarball)
Linux (Fedora 18)
- Linux x86 (32-bit RPM)
- Linux x86_64 (64-bit RPM)
Source
Older Versions

Genome Workbench

Genome Workbench

Loading Tabular Data

Introduction

Step 1: Select and Preview the Table

Step 2: Select Field Delimiters

Step 3: Identify Field Types

Step 4: Transform and Save Results

Step 5: View Transformed Tables

Download

Tutorials

Simple NCBI Directory

Getting Started

Resources

Popular

Featured

NCBI Information