The
GeneNetwork database runs on the opteron machine in the
GnArchitecture? . Here are some notes on the
database schema. (We should migrate these notes to the Wiki). See the attachments for pictures of the database and a copy of the current SQL schema.
Note June07: There is an interesting article in
AnandTech? about the use of Quad Core Xeon microprocessors with
MySQL databases. "To avoid the scaling problems of
MySQL, we compiled version 5.0.26 with Peter Zaitsev's Mutex patch. This patch gives much better scaling and performance using up to four cores. Eight cores and more give variable results." see
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=2897&p=7
An interesting site regarding
MySQL issues:
http://www.mysqlperformanceblog.com/
Database Schema
A list of all tables in our mySQL database (a more descriptive version of the database schema) on this wiki is as follows:
AccessLog?
AvgMethod
CeleraINFO? _mm6
Chr_Length
Chr_Length_Evan
DBList
DBType
Data?
Genbank?
GeneCategory?
GeneChip?
GeneList
GeneList_Evan
GeneList_hg17
GeneList_mm6
GeneList_mm7
GeneList_mm8
GeneList_rn3
GeneList_rn33
GeneRIF?
GeneRIFXRef?
GeneRIF? _BASIC
Geno
Geno2
GenoAssembly
GenoFreeze?
GenoPos
GenoXRef?
H2
HumanGene?
InbredSet?
LCorr
LCorrRamin?
MachineAccessLog?
NStrain
Phenotype
Probe
ProbeFreeze?
ProbeH2?
ProbeSet
ProbeSet2
ProbeSetFreeze?
ProbeSetPos
ProbeSetXRef?
ProbeSetXRef_TEMP?
ProbeXRef?
Publication
PublishFreeze?
PublishXRef?
SE?
Sample?
SampleXRef?
SnpAll
SnpAllele?
SnpSource?
SnpXRef?
Species
Strain
StrainXRef?
TableComments?
TableFieldAnnotation?
Temp?
TempData?
Tissue?
User
UserPrivilege?
Unfortunately this list must be maintained by hand, whereas the link to database schema above is automatically generated every few hours, and quickly shows any change.
Known Issues
- A known issue with mysqldump and SELECT INTO OUTFILE statements is that they must be run as root@localhost (not root@opteron) because 192.168.1.% has less permissions than localhost. The workaround is to log in as root@localhost.
- Our diagrams below are manually-generated and so they are out of date. I would like to generate new diagrams with MySQL Workbench but it has problems as of 8/1/2006 reverse-engineering large databases (search MySQL bugs).
Tips
- If the database is hung but you don't know why, try SHOW PROCESSLIST. See MySQL documentation for more info.
Schema diagrams and autogeneration of DDL SQL
Bill Bug has worked up
Dia schema diagrams based on the Word files and DDL SQL for the GeneNetwork entities & relations provide below by Jintao. The initial pass of the diagram was created using the
AutoDia Tool created by Aaron James Trevena avialable in it's latest version from the the PERL CPAN site
v2.03 from 27 April 2006. Dia can be installed on Mac OS X using the Fink Open Source software package management system (I've added some info how to install Fink at
this page on the Wiki). You can also get supporting info on installing Dia on other platforms such as WindowsXP from the
Dia FAQ.
Below I describe the several files I created and then hand edited. There are a few issues to keep in mind:
- There are several messy aspects of the model as reconstructed here that should be fixed.
- There were not complete descriptions amongst the Word doc diagrams & SQL file of all the relevant relations. In particular, I had to intuite the foreign key relations based on attributes names, and the somewhat criptic arrows in the Word doc diagrams. These should all be checked.
- Jintao used different colors in his diagrams. I didn't really know what these were meant to indicate, though it was possible to make an educated guess on some of it. As a general rule, one should avoid representing semantic aspects of the model in a manner that doesn't directly translate into DDL SQL. Anything just meant as supporting documentation should be included in text label objects.
- There is a need to clean the naming conventions used across the schema. There is no accepted convention, but I'd recommend using all upper case for entity names and CamelCasing for attribute names, unless someone has a problem with this. So for instance, a table name would be 'PUBLISH_XREF' and an attribute name would be 'fullName'. This can greatly simplify the nature of the code you create to manipulate the data in the database.
- Some of the information in the Word doc diagrams either conflicts or includes addtional information which I tried to reconcile as best I could.
- I believe it would behoove use to review table structures from the Generic Model Organism Database Project, the FuGE OM, and the strain/subject tables Rob, Maryann Martone, and Bill Bug have been working on for the Mouse BIRN project (link pending - BB will be posting this info to the BIRN Wiki).
- Introducing significant changes in the fundemental data model that could not only make maintenance and expansion of the model much easier, it could also help significantly with the changes we will likely need to make to help on both the issues of performance and repository replication.
- It is completely feasible to effect such changes without breaking any of the existing code via use of SQL Views (Materialized or Virtual).
- Using distinct Subschemas - something supported by most RDBMS such as Oracle and PostgreSQL? - is more than just a documentation convience. Maintenance and performance issues can be much more effectively addressed when you compartmentalize the model in a manner similar to what I've done in the these diagrams.
- As Stephen indicates above, really all these RDBMS objects - the tables, their attributes, the relations, etc. - should have clear, concise comment statements. Both the Dia diagram objects and the DDL SQL allows for associating comments with table objects. The autogen tools below support taking the Dia comments and adding them to the DDL SQL. The RDBMS should have Data Dictionary tables maintaining these comments, so there should be no need to create tables of our own to do this, as I believe Jintao has done (see the Admin Subschema). Each RDBMS calls the Data Dictionary tables by a different name, but they all serve the same purpose - to store the DDL SQL elements describing the database schema structure.
- I've made use of Dia "layers" to separate out the different subschemas, since this helps not only give you an easier way to parse and understand the entirety of the schema, it also can help provide a functional view of the model - which tables support which Python models/user functionality.
- The DDL SQL autogeneration tools such as those listed on the link given above (the most full-featured being [[http://tedia2sql.tigris.org/][tedia2sql] - the one I've been porting to Ruby to handle more of PostgreSQL? idiosyncratic entities) weren't able to use Dia layers last time I check, though the Ruby port I'm still working on does. This means to use tedia2sql as is in order to turn a UML-based Dia diagram of a RDBMS E-R diagram into actuall DDL SQL you can run against MySQL to build the physical database you will have to place all these objects from the final Dia diagram I provide below into a single layer. My recommendation would be to maintain the layered Dia file below. When you need to turn out a physical model, duplicate that file and cut-n-paste all the objects to a single layer - then run tedia2sql.
- You will still probably have to review the DDL SQL turned out by tedia2sql - or any of the other tools list on that page above such as dia2sql. Despite this fact, it still saves you many hours of hand editing SQL - and is immensely less error prone - to use these SQL code autogeneration tools. The tools will continue to improve, and the sort of integrated, round-trip tool Stephen described - one supporting an Open format such as XMI - may become avialable soon. In the interim, this set of tools - Dia, AutoDia? , and tedia2sql - is actually quite a functional alternative and doesn't take all that much time or technical acumen to use effectively.
- The process described below took a bit of time (~ 2 - 3 hours), but this sort of major import of the schema into Dia we'll hopefully only have to do once.
With this in mind, the following files have been attached below (as of 2006-08-14):
- GeneNetwork-Schema-2006-08-01a.dia: The original Dia file autogenerated by AutoDia? .
- GeneNetwork-Schema-2006-08-01b.dia:My first pass edit just to spread the tables out, so you could view them all.
- GeneNetwork-Schema-2006-08-01c.dia: A quick edit in a text editor to remove all the '`' from the table and attribute names. There's really no need for them in the Dia file, and I don't think Dia liked having these diacritical characters around.
- GeneNetwork-Schema-2006-08-01d.dia: My final edit using layers and including labels.
- GN-Schema-PNGs.zip: A zip file containing PNG figures for the separate sub schemas and one for the entire GN schema. These can be easily viewed in any graphics program (e.g., Preview or GraphicConverter? on Mac OS X) or web browser. Preview provides an easy to use zoom that helps navigating some of the larger schemas.