General Data Import Helpful Hints
These instructions were written by Evan Williams on 2 July 2008 after importing the
BxH? _ApoE null F2 data into the
GeneNetwork mirror at UCLA. Arthur Centeno at UT Memphis will know more about importing the data if this tutorial is insufficient. There was no standard error taken into account when this data was imported, since there was none available for my dataset, but if you have standard error and want to include it, talk to Rob Williams or Arthur. Edits by EGW 4 Aug 2008.
Instructions:
-- Genotype --
1. Acquire Genotype information from database, then organize in format identical to other genotype files, which can
either be found in the Search -> Database Information tab of GeneNetwork, or in the /webqtl/genotypes/ folder on
the actual server itself. Format in text format with UNIX line breaks and tabs between columns, then place the file in
/webqtl/genotypes/.
-- Expression data --
2. Download information for each database and format in Excel. The ROWS should be the Marker/Probe, and the
COLUMNS should be the Strains or F2 animals. The column names must match the genotype column names. Delete any
strain or F2 animal data here that does not have Genotype information, as it will not be useable. Export again in text
format with UNIX line breaks and tabs between columns.
3. Run QTL reaper (Python program, can be downloaded on Sourceforge. You can compile it on Windows using cygwin
or on Linux without anything special. I have had no luck compiling it on Mactel machines.) on these files. This should
be done on your local computer; put the .geno file in the same directory as the file(s), modify the example.py script to
use your genotype data and trait information, script included in appendix of this document. This might take a long
time, up to a week, so you'll probably want to run it on a server of some sort and not your PC. Hold on to those files.
-- Setting up GN --
4. Log into mySQL as root, and use db_webqtl. You'll have to add data to several tables to get the databases to show
up and be accepted into the search.html page. In this order,
4.1: table "GeneChip": select * from GeneChip; See if your chip is in the list. If it's not, INSERT INTO to add your
chip to the list.
4.2: table "InbredSet": select * from InbredSet; if your InbredSet (or F2) is not in the list, add it here.
4.3: table "AvgMethod": select * from AvgMethod; if your averaging method is not in the list, add it here. This is
unimportant for calculations, but it displays your method to users.
4.4: table "Strain": Add here all the Strains or F2 animals here that are listed in your genotype file. Make sure that
you are choosing a unique name when inputting your animal, if it is not the same as the previous animals input (as
may be the case with F2s). Make sure they are named the same in this Strain table as in the genotype file.
4.5: table "Tissue": Make sure the tissue you're listing is included in this table; if not, add it.
4.6: table ProbeFreeze: Insert some information for the new data set you're adding in. Here you have to include
the ChipId from GeneChip, the TissueId from Tissue, the InbredSetId from InbredSet, as well as some identifying
information you create here.
4.7: table ProbeSetFreeze: Similar to ProbeFreeze, but you have to do ProbeFreeze before since you have to list the
ProbeFreezeId. This should be pretty straightforward to input, just follow the format of previously input sets.
4.8: table GenoFreeze: Add your genotype name to this database (if it doesn't already exist; e.g. if you're inputting
an F2. Check if you're inputting a RI set to see if it's already there.)
4.9: table Geno**: Add your genotype file's information into the mySQL database (file still required in
/gnshare/gn/web/webqtl/genotypes directory)
The naming in this area isnt too critical, as long as it's unique, -consistent-, and descriptive. Try to follow the format
used though. At this point, if you look at the search.html page, your database should show up, although the search
feature won't work (since you don't have any actual data input).
5. Acquire the Gene Annotation information from your database that you want to import into GN. This should have
information corresponding to the markers/probes that are listed as the ROWS in the data you have (e.g. the gene name
corresponding to a specific marker).
5.1: Look at the ProbeSet table, and match your data accordingly. If some columns are missing, that's fine and to be
expected. Just make sure you have ChipId (from the GeneChip table), Name, TargetId, symbol, Chr, Mb_mm6, and BlatSeq.
Other things are nice to have, but not necessary to run. At -minimum-, the ChipId and Name columns are required; certain
features of GeneNetwork may not work without the other columns, however. You'll probably want to create this file in Excel
then export it as a tab-delimited file. After this export, open the file up in a plain text editor and remove all quotation "
marks, so that they don't show up in the "description" field when you load it into GN. For this reason, I would recommend
NOT using CSV format (or you'll have to remove the " marks in MySQL later via an UPDATE script).
5.2: Load this into GeneNetwork by using the LOAD DATA LOCAL INFILE command. Make sure you declare the columns
you're loading, so that the Id column will auto-increment. (If you just straight load it in, without declaring the
columns being loaded, it will not work properly).
6. Now to load the data into the Data table. The script to do this is in appendix.
7. Hopefully the data imported properly. If you get KeyError: 'something', then it means that your data is not listed in
the ProbeSet table. This could either be because the annotation was either done or imported improperly, or because
you have a row in the data that for some reason is not listed in the annotation file, and you did import it properly.
If the latter is the case, delete that row from the original column and re-upload it to the data table.
To reupload, you'll have to delete what you just inserted, which went into Data and ProbeSetXRef. Be careful when
deleting these! (It will likely take about 10 minutes to delete out of Data, and about 1 second to delete out of
ProbeSetXRef. Be careful! Make sure you're deleting the right things, and that you got all of it!)
8. Once you're here, the data should show up fine in GeneNetwork and be searchable for most basic features.
9. Now you'll have to make the other, non-basic features work properly -- so far the core features should be
working. For example, you'll have to align the SNP tracks to make the SNPs show up in the Interval Analyst,
and edit the Analyst to properly There are probably more features that you will have to fix after following these
directions, but this should point you in the right direction. Hongqiang or Arthur can help fill this in, maybe.
APPENDIX ---
1.
qtlReaper.py, used to calculate the peak LRS of each trait and p-value.
2.
DataLoader: Load data into Data and
ProbeSetXRef? columns
3.
XrefLoader: Load the LRS computations from step 3 (above) into
ProbeSetXRef? .
END EVAN'S TUTORIAL---
The NCBI Entrez Taxonomy
Id: Name
9606: Home sapiens (human)
10090: Mus musculus (mouse)
10116: Rattus norvegicus (rat)
3702: Arabidopsis thaliana
SpeciesId:
Id:
GeneSymbol: NCBI.gene_info.symbol
GeneDescription: NCBI.gene_info.description
Chromosome: UCSC.mrnaOrientInfo.chrom
TxStart: UCSC.mrnaOrientInfo.chromStart
TxEnd: UCSC.mrnaOrientInfo.chromEnd
Strand: NCBI.gene2refseq.orientation
GeneID: NCBI.gene_info.GeneID
NM_ID: NCBI.gene2accession.RNA nucleotide accessionversion (where status != '-')
kgID
GenBankID:
UnigenID: NCBI.gene2unigene.UniGene cluster
ProteinID: NCBI.gene2refseq.protein accession.version
AlignID: UCSC.knownGene.AlignID
exonCount: UCSC.refFlat.exonCount
exonStarts: UCSC.refFlat.exonStarts
exonEnds: UCSC.refFlat.exonEnds
cdsStart: UCSC.refFlat.cdsStart
cdsEnd: UCSC.refFlat.cdsEnd
GeneList.NM_ID <--> mrnaOrientInfo.name <--> refFlat.name
--
HongqiangLi - 04 Dec 2007