www.fgks.org   »   [go: up one dir, main page]

Warning: The NCBI web site requires JavaScript to function. more...

Loading Tabular Data

Introduction

This tutorial was prepared using Genome Workbench version 2.7.0

All the illustrations in this tutorial are fully applicable for the MS Windows 7 users. Unix/Linux users or users of the other MS Windows versions might experience minor variations in default settings, column order, window size, and other insignificant differences.

This tutorial will take you through the steps necessary to import tables into Genome Workbench. Table can be any file that has a fixed number of fields in each row and has well-defined separators, such as tabs, whitespaces or commas, between those fields. This can be very useful for user-created files, such as data exported from Microsoft Excel (tm). When a file is in a format already supported by Genome Workbench, such as a BED file, the file can be opened directly by selecting File on the open dialog with the File Format set to "Bed Files" or "Autodetect format".

  • Select and Preview the Table
  • Select Field Delimiters
  • Identify Field Types
  • Transform and Save Results
  • View Transformed Tables

The following file will be used as the sample dataset for this tutorial:

sample_table.txt

Step 1: Select and Preview the Table

To start, Select Open command from the File drop-down menu. Select File Import on the left side of the pop-up dialog.

Select files open dialog

This opens the file-import wizard. On the first page under File Format: select "Table files". You may also select "Autodetect format"but that may not work consistently for all tabel files.

Under Filenames click the file selection button (...) and use the file browser to select sample_table.txt.

Click the Next> button and you will see the contents of the file under Table Preview at the bottom of the page (you might have to increase the size of the pop-up dialog to see the content of the Table Preview section).

Preview sample table

Notice the option Table Data Format on this screen. It is telling you that the system recognizes this is a delimited table. This option lets you decide how to import the table, as a Delimited table or a Fixed Width table. Most of the tables we deal with should be imported as delimited.

The next option, Import data starting with row lets you skip the first rows of the table because, for example, they are comment rows. Just set the number of the first row you want to import here. The following field, Comment Character can be set to a single character that precedes all of the header rows. It may be filled in automatically after you select the file, but you can change it if you wish.

The last option Parse Column Names from Selected Row lets you click on a row in the table from which the importer will then extract column header names. Column names will be parsed from that row with the same rules used for parsing other rows. Header names can also be entered manually.

Step 2: Select Field Delimiters

After selecting Next > on the preview screen you will be able to select the delimiters for separating the table rows into fields. When the file was read in the table loader made a "best guess" on the delimiters and on what will be shown on this page. To change them, click on the combination of delimiters and watch the results under Table Data to see which combination separates the fields correctly.

Select sample table delimiiters

In addition to choosing the delimiters you can choose whether to Treat adjacent delimiters as a single delimiter. If your delimiter is a tab, for example, and you have two adjacent tabs with no text in-between, that can create a blank field or it can be treated as a single break between fields. You can also choose whether there is a quote-character for defining strings. If you choose this option and your delimiter is a comma (for example) than any commas that appear between quotes will not be treated as field separators.

If on the first screen you indicated that the Table Data Format was a Fixed Width format, then you will be taken to a different screen for dividing rows into fields. On this screen you split existing columns in two by clicking on Insert Column and then clicking on the column to be split. To remove a column, click on Remove Column and then click on the column to be removed. After you add and remove columns you can move the column dividers around until the fields are properly divided.

Define sample table fixed width columns

Step 3: Identify Field Types

After leaving the delimiter or fixed width pages that divided the rows into fields, the table importer takes its best guess as to the underlying data and semantic types for each column. On this page you can review and update the column names, data type and properties.

Set sample table column properties

To update the name and properties for a column, click on the column name in the table at the bottom of the page. When you then pick a Data Type such as Sequence ID or Integer the Properties list will change to show options for that type. If you select an Integer type you need to indicate whether the value is One-Based or not. The standard convention in biological data is for data to be one-based, e.g. the first nucleotide in a sequence is position one. If a field is 0-based, the more common programming convention, then the first element of the set would be position 0. Lastly under types, if you do not want to import a column, select Skipped.

If one or more ID columns in the table represent a chromosome number or letter, then select an assembly for the column so that an accession can be found for the ID in the next import step. To choose an assembly for a column, click on the column header, then enter a search term to find the correct genome and then click Find Assembly.

Find sample table assembly 1

Find Assembly brings up the Select Assembly selection dialog.

Find sample table assembly 2

Select the correct assembly here for the current column. Note that if you have chromosome identifiers in more than one column, you can assign a different assembly to each of these ID columns.

Find sample table assembly 3

After the column names, types and, if needed, assemblies have been specified for the columns - hit Next > to go to the transformations page.

Step 4: Transform and Save Results

The transformation page allows you to save your table and, if the proper fields are available, to specify locations or features in order to link your table entries to other data. The three transformation and save options are:

  • Convert to Feature Table
  • Create Locations
  • Keep Table Unchanged

The last transformation option, Keep Table Unchanged saves your table directly as an ASN file with the column names and types you specified on the previous Identify Field Types page. This is the only option that you will be able to use if your table does not include the ID, start position and stop position columns needed to create features or locations.

The Convert to Feature Table transformation option adds and updates fields as needed so that the rows can be viewed as region features in Genome Workbench. If none of the columns were identified as a region field, a region field is added automatically. An ID, start, and stop column are also required. If there is no strand column, one is added based on whether stop is greater than or equal to start (positive) or start is greater than stop (negative). Lastly, if more than one ID column has been identified, you will not be able to create region features since these features can only refer to one region per row. To fix this, go back to the previous page and set the type of the column(s) you do not want to transform to be Text or Integer rather than Sequence ID.

Transform sample table

Summary of rules for transforming rows into region features:

  • Strand not provided: Add strand field, setting value to be positive if start>=stop, and negative otherwise. Swap start and stop values for negative strands.
  • Strand field provided as string: Add numeric strand field using standard NCBI mapping
  • Start is one-based: subtract 1 from start
  • Stop is one-based: subtract 1 from stop
  • Region field not provided: Add region field with generated region names (region + row#)
  • Stop field not provided but a length field is: Add a stop field where stop == start + length

Error conditions while converting region features:
If any rows have errors, an error column will be appended to the table. Each row will have its own error column value.

  • Start or stop are negative (this is an error whether nor not strand is provided)
  • Start or stop fields are 0 when they are specified as being 1-based by the user
  • Length is provided instead of stop and value is less than 0
  • A strand is provided and stop<start
  • A strand is provided but its value is not recognized, causing strand to be saved as "unknown".
  • Seq-id not recognized
  • Required field missing - Seq-id, start, and stop (or length) are required

The Create Locations option does not change any of the existing columns but it does add a new location column to the table for each set of ID, start and stop fields in the table. When you open the resulting ASN file in Genome Workbench you will be able to click on the locations and jump to the corresponding positions in the Graphical View.

Summary of rules for adding locations (multiple locations per row are allowed if data supports it)

  • Strand not provided: Specify + strand in seq-loc if stop>=start, and – otherwise
  • Strand field provided as string: Add numeric strand field to seq-loc using ENa_strand enum values
  • Strand not provided: Set value of strand in seq-loc based on whether start > stop
  • Strand not provided and stop < start: Swap start and stop fields in seq-loc, but do not update values in table
  • Start is one-based: Subtract 1 from start and from start position in the generated seq-loc
  • Stop is one-based: Subtract 1 from stop and from stop position in the generated seq-loc
  • Stop field not provided but length is: Set stop value in seq-loc to start + length

Possible errors when adding locations

  • The start or stop field is negative
  • Start or stop fields are 0 when they are specified as being 1-based by the user
  • Length field provided instead of start and its value < 0
  • A strand is provided and stop<start
  • Strand is provided but its value is not recognized, so an error is logged and value in seq-loc is set based on whether stop>start
  • Seq-id not provided or provided but not recognized
  • Missing or non-numeric number field for start, stop, or, if stop not provided, length

This window also gives you the option to save your transformation parameters for the table by selecting Save Import Parameters. This option is provided for debugging and for future use.

Once you have chosen a Transformation Type click Next > to select project.

Select project

Click Finish to save the resulting ASN file which you will then be able to open as a table in Genome Workbench.

Step 5: View Transformed Tables

Once you have saved the transformed table you will see it in your project tree and you will be able to open it as a Generic Table View or as a Text View.

If the table was saved without changes then the fields should appear just as they did in the original file, minus any fields that you chose not to import.

Imported sample table unchanged

If you added locations to your table, you will see the location field(s) in the Generic Table View on the right hand side of the table. Double-clicking on the locations will cause the corresponding sequence to open at the specified location. If you have not previously selected a default view you will be presented with the Open View intermediate dialog.

Imported sample table locations

If you specified a feature table in your transformation step, then you can also open the the corresponding sequence in the same folder of the project tree and see the features on the sequence. Note that to see region features in the Graphical View, you also need to select the display option Other features as shown in the image.

Imported sample table features

Last updated: 2013-04-05T12:37:10-04:00