Updating Assemblies (This was written specifically for mm8 to mm9, but should be the same across all assemblies and species.)
Since the last assembly update (mm6->mm8), this has gotten easier to perform, and should take no more than about a week to do. There are five tables that need to be updated; Chr_Length,
GeneList, Geno,
ProbeSet,
BXDSnpPosition? , and
SnpAll. In every case, when updating the
MySQL, BE SURE TO KEEP A BACKUP OF THE POSITION. For example, add a column "Mb_mmX" and then SET "Mb=Mb_mmX" before updating the Mb column with your new data. This way, if you do make a mistake, it is easily reversible, without having to resort to backups on different machines. Instructions are given for users with basic Excel knowledge (under #2:
GeneList) as well as minor script knowledge (the scripts I used are under section 3: Geno).
1. Chr_Length: Simple and straightforward. The length of the chromosome changes slightly between each build, and can be determined by going to the UCSC
GenomeBrowser? , selecting the latest assembly, searching for each chromosome, navigating to the end it, and recording what that last position is. Then put that data into the Chr_Length table.
2.
GeneList: You will want to use liftOver (available from UCSC) to update the position of the genes. REMEMBER TO BACK UP ANY DATA COLUMNS YOU ARE CHANGING.
- First, I suggest selecting Chromosome, TxStart? , TxEnd? , and GeneSymbol? into an outfile ONLY FOR YOUR SPECIES OF INTEREST for this. e.g. " mysql>SELECT Chromosome, TxStart? , TxEnd? , GeneSymbol? FROM GeneList WHERE SpeciesId? =1 INTO OUTFILE /tmp/GeneListUpdate_Mouse.txt".
- From here, you will have to add "chr" (lowercase!) in front of every chromosome (required for liftOver). You will also have to multiply the TxStart? and TxEnd? by 1 million to convert it from Mb into Bp. If you do these steps in Excel, be careful of genes like "March1" and "Sept15" that get turned into "1-Mar" and "15-Sep" in addition to a few genes that have E's in them that get turned into scientific notation.
- Now run liftOver on the file. Look at the unMapped output file and make sure that there aren't very many genes that didn't get mapped over. (If they don't get mapped over, they are deleted!)
- Take the liftOver'd output file and change the Bp back to Mb and remove the "chr" from each chromosome.
- Now you have to add the updated strand, cdsStart, cdsEnd, and exon location information to the file. To do this, download the refFlat_[assembly].zip from UCSC (it's in the "Annotation database" directory). Do a cross-reference using the "GeneSymbol" and add in the strand, exon, and cdsStart/Stop data. (You do NOT want to do this in place of liftOver. It's true it has TxStart? and TxEnd? info in refFlat, but you will miss many hundreds of genes if you skip the liftOver part, and it doesn't save you very much time). If you do this in Excel, again you will have to watch out for March1, Sept15, etc.
- Now format the file in such a way that you can update the MySQL database with it. The way I do this is to make each line its own UPDATE statement. So a row called "1 15.193103 15.19601 Targ1 15.194100 15.19559 + 1 15194105, 1519490," will become "UPDATE GeneList SET Chromosome="1", TxStart? =15.194100, TxEnd? =15.19559, cdsStart=15.194100, cdsEnd=15.19559, Strand="+", exonCount=1, exonStarts="15194105,", exonEnds="1519490," WHERE GeneSymbol? ="Targ1" AND SpeciesId? =1.
- Make sure your formatting script is working properly (I use Excel), then upload it to the MySQL server computer, open up mySQL, and type "mysql> source /tmp/GeneListUpdate_Mouse.txt"
3. Geno: This is a simple process requiring liftOver.
• Select the Chromosome, Mb, and the Marker Name into an outfile for all of the markers IN SPECIES 1 ONLY. Run a script (like
ThisOne) to sort the file so it can be used for liftOver (requiring columns without headers in this order: Chromosome,
MbStart? ,
MbEnd? ,
MarkerName? ).
MbEnd? must be at least 1 base pair in front of
MbStart? , so you should make it so.
• Run liftOver for mm9 to whatever build you're updating to.
• Take that data, delete the "MbEnd" column, and upload it into
GeneNetwork using a script like
ThisOtherOne.
4.
ProbeSet: This is the same process as
GeneList, but without the refFlat_.txt step, so it's much more streamlined. REMEMBER TO BACKUP DATA COLUMNS BEING USED.
- Select the Chromosome, Mb, Probe_set_Blat_Mb_end, and Id into an oufile using only the ChipIds? for the species of interest. (SKIP Probe_set_Blat_Mb_start!!) Select * from GeneChip? to figure out which ChipIds? you want to grab. If you are going to use Excel to modify the files for liftOver/reimport into MySQL, I suggest doing seven chromosomes at a time, for a total of three batches (adding ChrM/Y/Un at the end).
- Again, add "chr" to all the chromosomes and turn all the Mb into Bp. Then run liftOver. If you have a problem where "Mb_end is before Mb", then just replace "Mb_end" with Mb+1 for all of those instances. Also be wary of instances where Mb_end is like 100 Mb from Mb_start. I run a comparison column in mySQL and I 'flatten out' the edges, moving all negative differences and all differences greater than 100 kb to a distance of 1 bp.
- A few hundred/few thousand will be unMapped after liftOver, but it shouldn't be enough to really worry about. Take the liftOver'd files and remove the "chr" and convert the Bp back to Mb.
- Create the update MySQL file again, as you did for GeneList. e.g. "5 5.163139 5.163140 12431" becomes "UPDATE ProbeSet SET Chr="5", Mb=5.163139, Probe_set_Blat_Mb_start=5.163139, Probe_set_Blat_Mb_end=5.163140 WHERE Id=123431."
- Do this for all the chromosomes, and then you're done. You don't have to worry about the 'Probe' table.
5.
BXDSnpPosition? /SnpAll: Hopefully the
BXDSnpPosition? table has been deleted or been replaced by something better by now. For
SnpAll, you need to update the positions of every marker, though.
•
SnpAll: By this point you should be a master. Export the Chromosome, Mb, and the Id number for all Species1 snps. Then make a text file for liftOver with the Chr, Mb, Mb+1, Name, and lift it over. Then take that file and put it back into GN.
OUTDATED INFORMATION -- Updating Assemblies (from mm6 to mm8)
The update from mm6 to mm8 took two steps: Data preparation and
MySQL table update. The first step was done by Evan Williams. In this step, Evan used the mm8 data from UCSC and BLAT program to compute the position data and store these data in some temporary tables or fields. In the second step, Zhaohui Sun used a
MySQL script to update the tables that are used by the python codes for genenetwork based on Evan's data. Zhaohui also modified the python code so it will read the data from mm8. The first step was not fully automated, but it was fully described by Evan. The second step was automated.
Data preparation
Every year or so, NCBI (Deanna Church and colleagues) and the group at
the University of California at Santa Cruz update assemblies and nucleotide coordinates for the genomes of mouse, rat, human, etc. Because
GeneNetwork uses this information (but in a different way) we have to reprocess this information when we want to update our own data correspondingly, and add it to our own
MySQL database. Unfortunately, this is quite a bit of manual labor. Fortunately, there is now some documentation on how to do this. Unfortunately, there are a few problems with updating the data that you will come across, and either fix with a quick hack (as I mostly did) or come up with a better way to incorporate the new data (as was done with one section, as you will see).
Without documentation, this process took me around two months to complete. With documentation, hopefully the process can be sped up to around two weeks, perhaps slightly faster.
In the table
Mm6ToMm8 I provide instructions on where the mm8 data that is already loaded is located, so that the data can be moved over. Once this transition is complete, this text may be removed.
There are five parts of the
MySQL database that need to be updated before a transition can take place, as well as corresponding code changes. They are:
For
Chr_Length/
GeneList, it updates the gene names, as well as overall chromosome length update. This update can easily be done at the same time. This is all very hacked together in our database, and
needs to be fixed at some point in the way Geno2 etc has been done, so that multiple assemblies can be stored. Instructions for updating these tables are located at the
GeneListUpdate and
ChrLengthUpdate entries.
SnpAll is one table that needs to be updated, containing about 6 million records at the time of writing.
At present, it is very hacked together, and there is a "Mb" column that is for mm6 and a "Mb_mm8" column that is for mm8. Ideally this should also be the same was Geno2 etc is organized. Instructions for updating the SNP locations is located at the
SnpAllUpdate entry.
For
Geno2 /
GenoAssembly /
GenoPos, this is the ideal layout.
Geno2 itself will not have to be updated unless the user wants to add more genetic markers -- if an assembly update that is all that's needed, it will not be modified.
GenoAssembly is a database that simply says where the input information is from, with fairly obvious nomenclature. A new row will have to be added for each additional assembly. The actual new assembly data will only have to be loaded into
GenoPos. Instructions for updating the genetic marker information is in the
GeneticMarkerUpdate entry.
For updating the probe set, this is done exactly the same as the
GeneListUpdate. Despite this, there are
a few database changes that need to be made. Some of the data in ProbeSet2 needs to be moved to ProbeSetPos. I provide a few brief pointers for updating in the
ProbeSetUpdate entry.
GenoAssembly is easy to update; simply add a new Id, the proper speciesId, name, and comment for your new assembly. Currently, both
ProbeSet2 and
Geno2 reference this; in the future the gene list, chromosome length, and SNP list should as well.
Once
all of these sections of the
MySQL database are updated, the code can be modified to look for the more recenty assemblies. Be sure to allow users to switch to the old assemblies as well. The intial time this is added, it will take a lot of coding, but subsequent assembly updates should be trivial after that.
Update the tables using the data
A SQL script has been written to update the tables in our db_webqtl database (
GeneList, Chr_Length,
SnpAll, Geno, and
ProbeSet), . This script also re-builds an index in
SnpAll. It is attached below. After the data preparation is done following the above instruction, we run this script and all the tables will be updated.
The next thing we need to do is to change some mm6 specific python codes.
- dataPage.py and dataPageDB.py (cgi-bin). It read data from GeneList_mm6, GeneList_mm7, and GeneList_rn33. Zhaohui has changed both GeneList_mm6 and GeneList_mm7 to GeneList. In the future we don't need to change anything here if we update mm data. However, GeneList_rn33 will need to be changed at some point.
- snpBrowser.py. It generates some URI in UCSC genome browser that has some mm assembly specific arguments. Zhaohui has changed the mm6 string in the URI to mm8. In the future it needs to be changed again when mm assembly is updated.
- intervalPage.py and intervalPage2.py (both in cgi-bin). They generates some URI with mm assembly specific arguments. This is handled as is done for snpBrowser.py.
- AJAX_pref.py. Zhaohui has added mm8 to the option menu. In the future this may also need to be modified.
Correct Geno Files
The current procedure is:
- 1. Zhaohui updates geno files using the position data in mysql Geno table (updateGeno.py)
- 2. Zhaohui checks the order problems in the new geno files and produce reports files - e.g. BXD.geno.csv and log files - e.g. recal.log, using reCalGeno.py.
- 3. Rob used the csv report files or the log files to correct the Mb values of trouble-making markers excluding SNPs by BLAT, and the corrected Mb are entered into mysql Geno table from the GN annotation interface. Rob picks one problamatic geno files (e.g. BXD.geno) with the largest cross to fix at the time.
- 4. Again Zhaohui updates geno files using the position data in mysql Geno table, checks the order problems in the new geno files (repeat step 1-2)
- 5. Zhaohui also re-order the markers based on the Mb values for the repaired geno file (BXD.geno in this example) and produce a report (e.g. BXD.geno.3.reorder.csv).
- 6. Now this geno file should be fixed. We check the physical and genetic interval mapping on web2qtl. Rob may want to look at the re-ordering report (step 5) to see if there are something we don't like to just reorder.
- 7. We pick another large geno file and repeat step 3-6
CorrectGenoFileProgress
Notes:
- Evan's procedure can possibly be automated. That way we will have a pipeline to handle the update and reduce our manual work.
- ProbeSet has a field named Mb_mm6 although it has mm8 data after we update to mm8. So it is a misleading name. We keep this name because this name is used in too many places in the code. We may want to change this field name as well as the codes that use it in the future.
- It seems that we have two assemblies for Rat: rn3 and rn33. We may want to update it some time.
- Celera_mm6 table is still being used in some code (cmdSnpTrack() in TextUI? .py). It is likely that the code is obosolete.
- Data irregularity was found in tables ProbeSet, Geno, and SnpAll. The files recording the irregularity are attached below
--
EvanW - 08 Aug 2006 --
ZhaohuiSun - 28 Nov 2006 --
EvanW - 08 Jul 2009