BillBug? has provided very insightful advices on current issues of Genenetwork hardware, codebase, and database during his visit to Memphis on Aug. 28-29. Rob, Hongqiang, Arthur, and Zhaohui have discussed with him in a comprehensive scope regarding Genenetwork. Below are summary of some of the thoughts we talked about with him.
Hardware/Codebase
1. After reviewing the current structure of the GN hardware, we all agreed on our current plan of computer setup.
WebqtlMachine will be used as a support unit, and
Web2qtlMachine? will be used as our test unit, while headmaster, node0 - node8, opteron will remain the same use as before. Bill will draw a topology of our current functiona units and hardware setup. Hongqiang proposed that GN beta site be removed from webqtlMachine and put into
Web2qtlMachine? .
2. Bill suggested that we upgrade CPU or motherboard.
ArthurCenteno? and Bill will work out more details about the upgrade. Bill has helped us to confirm that we have 32X100M bps switch for the GN cluster, and it should be good enough.
3. Regarding GN code repository, Bill suggests that the Apache conf file and .htaccess files should be added into the repository, but binary data should be kept out unless they are the neccesary components of GN. We discussed the idea of freezing our website so the users can retrieve the historical data. Bill pointed out that, with subversion repository at hand, it is travial to freeze GN website (both code and data) as release at any time.
Testing, debuging, and performance tuning
1. We reviewed our progress of constructing and debuging beta website.
ZhaohuiSun? mentioned that he had to roll back the revision 15 of the GN repository that
StephenPitts? constructed because the latest revision did not work. Based on revision 15, Zhaohui was able to recover most of the functions of GN. He is going to fully debug this test version and completely test it. Bill suggested that we should make the working functions of this test version the superset of the working functions of the current production site, and our repository be based on this fully-tested, fully-functional test version.
2. Concerning the test of this beta site, Hongqiang suggested that web2qtlMachine is ideal for testing the new beta site because it is isolated from other resources in the production site. Rob is going to work with Zhaohui to identify all test cases of GN. Bill recommanded a tool called Curl that can do website-based testing. Curl can be used to automate our testing process. Zhaohui will spend some time in writing our automatic testing scripts using Curl. Bill also suggested that we use Ant to do testing systematically (see Apache's ant site). With Curl, ant, automatic testing scripts, we should be able to run regression test and unit test all the time, not only to maintain the accuracy of our codebase, but also to monitor the main GN web sites.
3. Bill suggested that we use profiling as a tool to identify the underlying codes/functions/queries for the applications in GN website. With this tool we can find out the CPU and memory usage of the underlying function/queries. Regarding tracing mySQL queries, Bill suggested that we use EXPLAIN. These tools will not only help us debug our code and tune the performance of our queries, but also help document the workflow of our codebase in a object-level when combined with UML. One thing Bill suggested for Monitoring Network traffic is the tool called NTOP. He also suggested that we use TOP and PS (probably in scripts) to monitor the CPU/memory use of GN cluster.
Database
1. We reviewed our database schema. To make our schema more flexible in response to the update of genomic information, Bill suggested that the genomic position information be taken out from Geno table and
ProbeSet table and form new tables. For example, we can create a table called
ProbeSetPos that have the fields of start_position, stop_position, and strand. Jintao had started to work on this issue, but he had not finished. The work-in-progress tables Jintao created for this were put in the same subschema of the currenly-in-use microarray data subschema, and therefore created some confusions. They should really be placed on a separated new subschema or database. The temporary tables Evan created for updating genomic information should be placed in a separated subschema too.
- see the [[http://www.gmod.org/node/101] [GMOD CHADO Sequence sub-schema]]. Hopefully we can re-use this very common model to capture locus/location in a flexible, easy-to-reuse manner. More detailed documentation on how this sub-schema is intended to be used can be found here.
Comment on GMOD CHADO Sequence schema
This schema has 17 tables. As far as I can tell, fifteen of them would not be useful to GN in the foreseeable future. The two that would be useful represent (1) sequence features and (2) positions of sequence features. However, the schema has, as far as I can tell, no explicit representation of versions for a sequence feature. It has a table to allow attachment of any "property" to a sequence feature, but that seems like a clumsy mechanism for information that is critical to our use of sequence features. I think that the three tables that Jintao outlined for markers, SNPs, and probes are more appropriate for GN than GMOD CHADO. However, Jintao's separate tables for markers, SNPs, and probes could be combined, with addition of a sequence feature type to distinguish the three kinds of entries.
GMOD CHADO has an interesting convention for expressing location. Numbers are attached to the spaces between nucleotides, with the 5'-end of the sequence labelled 0. A sequence feature in an N-nucleotide sequence is said to extend from the 5'-end label (range 0 to N) to the 3'-end label (range 1 to N+1) of the feature. With this scheme, the length of a feature is the difference between the two end labels. In contrast, if the nucleotides of a sequence are simply numbered 1 through N, the length of a feature is 1 greater than the difference of the last and first nucleotide numbers. Do we want to use this scheme? It is different from the
Sanger Institute's GFF convention and the convention used by
BioPerl? .
--
KenManly - 15 Sep 2006
2. Bill also commented on the high-level philosophy that we should apply to our database design. When we design database, we should focus on how to make the storage and access of the database more effient, and make the table normalized. We should not design tables on the basis of our application. Instead, we should create views that fit the data models of our applications. In some cases, we need to use materialized view or stored procedure to increase the efficiency.
3. We looked into the schema in details. Data table is huge and heterogeneous, so it should have a field of data type, which will make queries much more efficient when appropariate indexing and partitioning are used. We also put question signs on some tables. For example, table Snpall does not have location information, snpxRef and snpAllele may be combined. Bill also mentioned that GMOD and BIRN's schema are good references for us. Regarding the backup of our GN database, Bill suggested that our current signal copy backup is not good enough, and that we could use weekly dump plus transaction log to keep our data safe.
4. We should also look to the
ArrayExpress MaxD data model for insight on how to handle array level data.
5. In looking to gather insight from these other widely used models, we should:
- review their structure with an eye toward how well they cover the GN requirements
- determine whether additional re-organization/normalization is applicable to the portions of these models that would be useful.
- if there is, be certain to perform that denormalization in such a way so that these models could be supported as views.
- apply the application-specific view procedure cited above if require to support GN functionality.
GN programmatic interface
1. We should be looking to incorporate Web service functionality that accomodates the BIRN collaboration in a way that would be valuable to the field at large (e.g.,
BioMOBY).
2. The same holds for being able to provide a Semantic Web tech (RDF and/or OWL) programmatic view of the GN data (e.g.,
Semantic Moby).
Collaboration with BIRN
1. We reviewed the idea of vitualization using Rocks. Rob talked about making use of the computing power and expertise in BIRNCC and virtualizing GN website in the remote servers. Bill commented that It is reasonable to do so, and one easy way to do this is make a disk image of our installation of GN and use the disk image to install the remote hosts. Zhaohui mentioned that he is going to make the beta site, which he is testing, a self-sufficient package that has no dependency on the environments and can be easily distributed to other site. The grid computing facility that Rocks currently has is going to be great for the computation-intensive jobs like what QTLReaper does. However, Rocks might not have database cluster at this point.
2. We discussed how GN will interface with BIRN in the future. Bill introduced the history and the future of the BIRN project and the mediator technology. He suggets that for the future interface between BIRN mediator and GN could be Web Services. Under this strategy, GN may use materialized views to match the common data model (i.e. MAGE), and may convert some GN python ojects into Ruby on Rail objects which are provided as Web Service RPC to the mediator. We also examined the current database schema to find out if we can capture all the data needed by the MAGE data model. Regarding Zhaohui's question about whether we have got all the necessary data, Bill mentioned that many of the fields in the MAGE data model are optional. Bill also commented that we may want to include first-hand raw data such as the scaned microarry images.
3. This week at BIRN, we worked out what looks to be an effective strategy for accepting probe-level, probe set-level data whether given as spot-derived values or some other type of value - e.g., arithmatic mean. This process has been informed by similar work being done across the field, especially the MAGE work from the MGED consortium and the various useful tools created by the
ArrayExpress? group at EBI. When this is written up, we should review it as an option for loading array-level data into GN in a way that is commensurate to how you currently handle the means used to drive QTL reaper.
--
ZhaohuiSun - 18 Aug 2006