General Data Import Helpful Hints

These instructions were written by Evan Williams on 2 July 2008 after importing the BxH? _ApoE null F2 data into the GeneNetwork mirror at UCLA. Arthur Centeno at UT Memphis will know more about importing the data if this tutorial is insufficient. There was no standard error taken into account when this data was imported, since there was none available for my dataset, but if you have standard error and want to include it, talk to Rob Williams or Arthur. Edits by EGW 4 Aug 2008.


Instructions:

-- Genotype --
1. Acquire Genotype information from database, then organize in format identical to other genotype files, which can
 either be found in the Search -> Database Information tab of GeneNetwork, or in the /webqtl/genotypes/ folder on 
the actual server itself. Format in text format with UNIX line breaks and tabs between columns, then place the file in 
/webqtl/genotypes/.

-- Expression data -- 
2. Download information for each database and format in Excel. The ROWS should be the Marker/Probe, and the 
COLUMNS should be the Strains or F2 animals. The column names must match the genotype column names. Delete any
 strain or F2 animal data here that does not have Genotype information, as it will not be useable. Export again in text 
format with UNIX line breaks and tabs between columns.

3. Run QTL reaper (Python program, can be downloaded on Sourceforge. You can compile it on Windows using cygwin
 or on Linux without anything special. I have had no luck compiling it on Mactel machines.) on these files. This should 
be done on your local computer; put the .geno file in the same directory as the file(s), modify the example.py script to
 use your genotype data and trait information, script included in appendix of this document. This might take a long 
time, up to a week, so you'll probably want to run it on a server of some sort and not your PC. Hold on to those files.


-- Setting up GN --
4. Log into mySQL as root, and use db_webqtl. You'll have to add data to several tables to get the databases to show 
up and be accepted into the search.html page. In this order,
   
   4.1: table "GeneChip": select * from GeneChip; See if your chip is in the list. If it's not, INSERT INTO to add your 
chip to the list.
   4.2: table "InbredSet": select * from InbredSet; if your InbredSet (or F2) is not in the list, add it here.
   4.3: table "AvgMethod": select * from AvgMethod; if your averaging method is not in the list, add it here. This is 
unimportant for calculations, but it displays your method to users.
   4.4: table "Strain": Add here all the Strains or F2 animals here that are listed in your genotype file. Make sure that
 you are choosing a unique name when inputting your animal, if it is not the same as the previous animals input (as 
may be the case with F2s). Make sure they are named the same in this Strain table as in the genotype file.
   4.5: table "Tissue": Make sure the tissue you're listing is included in this table; if not, add it.
   4.6: table ProbeFreeze: Insert some information for the new data set you're adding in. Here you have to include 
the ChipId from GeneChip, the TissueId from Tissue, the InbredSetId from InbredSet, as well as some identifying 
information you create here.
   4.7: table ProbeSetFreeze: Similar to ProbeFreeze, but you have to do ProbeFreeze before since you have to list the
 ProbeFreezeId. This should be pretty straightforward to input, just follow the format of previously input sets.
   4.8: table GenoFreeze: Add your genotype name to this database (if it doesn't already exist; e.g. if you're inputting 
an F2. Check if you're inputting a RI set to see if it's already there.)
   4.9: table Geno**: Add your genotype file's information into the mySQL database (file still required in 
/gnshare/gn/web/webqtl/genotypes directory)

The naming in this area isnt too critical, as long as it's unique, -consistent-, and descriptive. Try to follow the format 
used though. At this point, if you look at the search.html page, your database should show up, although the search 
feature won't work (since you don't have any actual data input).
   
5. Acquire the Gene Annotation information from your database that you want to import into GN. This should have 
information corresponding to the markers/probes that are listed as the ROWS in the data you have (e.g. the gene name
corresponding to a specific marker).

   5.1: Look at the ProbeSet table, and match your data accordingly. If some columns are missing, that's fine and to be
 expected. Just make sure you have ChipId (from the GeneChip table), Name, TargetId, symbol, Chr, Mb_mm6, and BlatSeq. 
Other things are nice to have, but not necessary to run. At -minimum-, the ChipId and Name columns are required; certain
features of GeneNetwork may not work without the other columns, however. You'll probably want to create this file in Excel
then export it as a tab-delimited file. After this export, open the file up in a plain text editor and remove all quotation " 
marks, so that they don't show up in the "description" field when you load it into GN. For this reason, I would recommend 
NOT using CSV format (or you'll have to remove the " marks in MySQL later via an UPDATE script).
   5.2: Load this into GeneNetwork by using the LOAD DATA LOCAL INFILE command. Make sure you declare the columns
 you're loading, so that the Id column will auto-increment. (If you just straight load it in, without declaring the 
columns being loaded, it will not work properly).

6. Now to load the data into the Data table. The script to do this is in appendix.
   
7. Hopefully the data imported properly. If you get KeyError: 'something', then it means that your data is not listed in
 the ProbeSet table. This could either be because the annotation was either done or imported improperly, or because
 you have a row in the data that for some reason is not listed in the annotation file, and you did import it properly.
 If the latter is the case, delete that row from the original column and re-upload it to the data table.

To reupload, you'll have to delete what you just inserted, which went into Data and ProbeSetXRef. Be careful when
 deleting these! (It will likely take about 10 minutes to delete out of Data, and about 1 second to delete out of 
ProbeSetXRef. Be careful! Make sure you're deleting the right things, and that you got all of it!)

   
8. Once you're here, the data should show up fine in GeneNetwork and be searchable for most basic features.

9. Now you'll have to make the other, non-basic features work properly -- so far the core features should be 
working. For example, you'll have to align the SNP tracks to make the SNPs show up in the Interval Analyst, 
and edit the Analyst to properly There are probably more features that you will have to fix after following these 
directions, but this should point you in the right direction. Hongqiang or Arthur can help fill this in, maybe.


APPENDIX ---

1. qtlReaper.py, used to calculate the peak LRS of each trait and p-value.

2. DataLoader: Load data into Data and ProbeSetXRef? columns

3. XrefLoader: Load the LRS computations from step 3 (above) into ProbeSetXRef? .


END EVAN'S TUTORIAL---

The NCBI Entrez Taxonomy

Id:         Name
9606: Home sapiens (human)
10090: Mus musculus (mouse)
10116: Rattus norvegicus (rat)
3702: Arabidopsis thaliana

Fields of GeneList table

SpeciesId:
Id:
GeneSymbol: NCBI.gene_info.symbol
GeneDescription: NCBI.gene_info.description
Chromosome: UCSC.mrnaOrientInfo.chrom
TxStart: UCSC.mrnaOrientInfo.chromStart
TxEnd: UCSC.mrnaOrientInfo.chromEnd
Strand: NCBI.gene2refseq.orientation
GeneID: NCBI.gene_info.GeneID
NM_ID: NCBI.gene2accession.RNA nucleotide accessionversion (where status != '-')
kgID
GenBankID: 
UnigenID: NCBI.gene2unigene.UniGene cluster
ProteinID: NCBI.gene2refseq.protein accession.version
AlignID: UCSC.knownGene.AlignID
exonCount: UCSC.refFlat.exonCount
exonStarts: UCSC.refFlat.exonStarts
exonEnds: UCSC.refFlat.exonEnds
cdsStart: UCSC.refFlat.cdsStart
cdsEnd: UCSC.refFlat.cdsEnd

GeneList.NM_ID <--> mrnaOrientInfo.name <--> refFlat.name

-- HongqiangLi - 04 Dec 2007

Topic revision: r10 - 02 Jan 2009 - 16:53:13 - EvanW
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback