GeneNetwork Web Portal Application Architecture

The following lists the known applications running on the various hardware nodes in the GeneNetwork Hardware Architecture. The document cited at the end of the page written by Jintao Wang (with some edits by Bill Bug and Stephen Pitts) provides more detail on how to install and configure some of these components.

The GeneNetwork Web Portal Application employs a LAMP (Linux/Apache/MySQL/Python)-styled architecture (see also the O'Reilly OnLamp Portal) with several added support services to help promote load-balancing, ease of deployment, and security. GeneNetwork Web Portal Application Functional Architecture diagram, attached to this page, provides a high-level view of how these components interact. As far as their physical location, the GeneNetwork Web App in it's entirety can reside on any live production machine to which the GeneNetwork Linux Virtual Server (LVS) Application Router can forward requests (e.g. node 1 - node 8). It can also reside in it's entirety on any machine running a development version of GN (e.g. web2qtl). In this case, all interactions would be the same except LVS would not be involved and a web browser would interact directly with the Apache HTTPd component of the GeneNetwork Web App.

LAMP Service Architecture

Some of the service components cited above are installed on a central server, while others are replicated on each node running a version of the GeneNetwork Web Portal Application. The components that make up the GN Web Portal Application may each have configuration files that would be best managed with Subversion (SVN) and distributed along with the code. The centralized components break into two separate categories - those services critical to the runtime behavior of the GeneNetwork Web Portal Application and those important for off-line management, logging, and administration. This is all depicted in the organization given in the Table of Contents provided at the top of this page. Functional interactions among GN Web Portal App components and all of the centralized runtime service components are shown in the attached diagram. The GeneNetwork Application Hardware Layout diagram provides information on how these applications are currently deployed on the GeneNetwork hardware.

The following provides a description of the various software components and services that make up the overall GeneNetwork functional software architecture.

Centralized Server Runtime Components

  1. Network Address Translation (NAT)
  2. Linux Virtual Server (LVS)
  3. Network File Service (NFS)
  4. Lightweight Directory Access Protocol (LDAP)
  5. MySQL Relational Database Management System

GeneNetwork Web Portal Application Components

  1. Apache Web Server (HTTPD)
  2. Python comm components
    1. mod_python Apache module
    2. Python DB-API RDBMS Interface library
    3. JSON XML-based HTTP/JS communication module
  3. Python page generation components
    1. HTMLGen HTML page generation library
    2. PYX PS/PDF generation module
    3. pyXLWriter Excel spreadsheet generation module
  4. Python graphics rendering components
    1. Python Imaging (PIL) rendering library
    2. NetworkGraph graph rendering library
    3. SVG XML-based Scalable Vector Graphics module
    4. piddle 2D graphics module
  5. GeneNetwork Python codebase
    1. Compare Correlates

Centralized Server Admin/Management Components

  1. Subversion Source Version Control system (SVN)
  2. GeneNetwork Wiki Server
  3. NTOP Network Packet Logging
  4. Analog Web Server Activity Logging
  5. Roundup Issue Tracker

Other Components

  1. Secure SHell Server (SSHd)



Architecture Components Versions

NOTE: GN is currently running on the Fedora Core 5 Linux distro (BB: 2007-05-20).

Component Version Dependencies (including version)
Core System Components
Network File Service (NFS) ? ?
Secure SHell Server (SSHd) ? ?
Standard Unix Server Components
Lightweight Directory Access Protocol (LDAP) ? ?
MySQL Relational Database Management System ? ?
Apache Web Server (HTTPD) ? ?
Network Address Translation (NAT) ? ?
Subversion Source Version Control system (SVN) ? ?
Non-Standard Unix Server Components
Linux Virtual Server (LVS) ? ?
TWiki Wiki Server ? ?
NTOP Network Packet Logging ? ?
Analog Web Server Activity Logging ? ?
Roundup Issue Tracker ? ?
Python Extension Components
mod_python Apache module ? ?
Python DB-API RDBMS Interface library ? ?
JSON XML-based HTTP/JS communication module ? Xerces?
HTMLGen HTML page generation library ? ?
Python Imaging (PIL) rendering library ? ?
NetworkGraph graph rendering library ? ?
SVG XML-based Scalable Vector Graphics module ? ?
piddle 2D graphics module ? ?
PYX PS/PDF generation module ? Xerces?
pyXLWriter Excel spreadsheet generation module ? ?
Numarray array manipulation module ? ?
Custom GN Python Components
Compare Correlates ? ?
GeneNetwork Python codebase ? ?
Custom GN C/C++ Components
qtlreaper.so ? ?
direct.so ? ?



Network Address Translation (NAT)

Linux-based NAT is used to hide node1 - node8 from the public Internet behind the headmaster. NAT-ing is a strategy employed in many hardware and software environments. For instance, most consumer router/switches are in fact Ethernet switches that include NAT firmware for providing internal-only IP numbers to computers on a home LAN. See this page for more information on Linux-based NAT.

Linux Virtual Server (LVS)

The The Linux Virtual Server project

LVS Direct Routing

LVS-based access to a worker node that does not use NAT.

LVS NAT

LVS-based access to a worker node using NAT to pass-through a task to a worker node.

Network File Service (NFS)

NFS provides a means for a machine on the network to share directories on its local file system with other nodes on the network. Some portion of the GeneNetwork Python codebase were shared in this manner up through early 2007. While NFS is an excellent means to share data among nodes it is not a good way to share components of a complex codebase such as GeneNetwork. (NFS was used initially as an expedient to avoid copying code to separate LVS worker nodes and then have to make certain that these copies stayed in sync.) The better way to solve this problem is to use the Subversion (SVN) Source Version Control system (see below). Zhaohui Sun successfully migrated the Python, JavaScript? , and html codebase to SVN on a test basis in late 2006 and into production March 2007. This migration now makes it much easier to collaborate with other groups on the development of GN code modules. Each development group (e.g., NeuroBSIK? -Amsterdam, HZI-Braunschweig, UWA-Perth) can work in their own Subversion fork and when new code modules are mature they can then be moved from development to open beta testing and then from beta testing to production.

RWilliams NOTE [2007-05-05]: It would be good to have Hongqiang review the paragraph above for accuracy.

Lightweight Directory Access Protocol (LDAP)

The Lightweight Directory Access Protocol is used to provide uniform, network-based authenticated access to named resources. If LDAP is set up properly, there is no need to add a new authenticated user for central resources such as the MySQL database or the NFS server each time a new GN node is brought online. Instead, access to these resources is mediated by a single, LDAP-accessible account that the new node is configured to use. This process consolidates, centralizes, and simplifies the nature of the user account bookkeeping.
  • LDAP is equally important both for the cluster of GN Web Application nodes and for the set of QTLReaper compute nodes.
  • When LDAP is being used to support algorithm authentication for a complex application it is often advisable to have both a live/production user and a development user. You may want these users to have different read/write priviledges and access to distinct resources. For instance, you may want to limit certain GN Web Portal Application development instances to accessing a development version of the GeneNetwork MySQL database (see the proposed changes to the GN data model to more flexibly accommodate organisms genome assembly updates), so as to ensure coding changes don't impact the production data. This can all be easily done from the LDAP server, in which a given network-authenticated user can be assigned specific access rights to specific resources.
  • see this recent Linux LDAP HowTo for more info

Bill Bug NOTE [2006-09-03]: I'm not certain at this writing exactly how the LDAP service is configured and how it is being used.

MySQL Relational Database Management System

There is a separate Wiki page solely dedicated to the GeneNetwork MySQL data model and the development of database-related Python code used in the GeneNetwork Web Portal Application.



Apache

The Apache Software Foundation httpd web server is a critical component of each functioning GeneNetwork node, whether directly routed or NATed.

mod_python

The mod_python Apache httpd + python integration module is used to integrate the GeneNetwork code with the Apache httpd server. Originally, the more generic common gateway interface (cgi) mechanism was used to run the GeneNetwork Python code. We converted to mod_python in 2006 and early 2007.

GeneNetwork Python-based web application

This is composed of the GeneNetwork Python codebase written by members of Ken Manly and Rob Williams's research group. The original author was Jintao Wang from Ken Manly's group. This Python code (roughly 24,000 lines) is supplemented by several additional libraries performing specialized functions. These are listed below. This codebase has a separate Wiki page solely dedicated to the development plan and detailed functional description of this code.

Compare Correlates (Statistics)

Code written by Stephen Pitts in 2004 and 2006 used to find overlap among lists of correlated traits. The "Compare Correlates" function is available GN Trait Collection windows. While functional, this library still needs to be enhanced and error-checked as of May 2007.

#Numarray

Numarray

An array manipulation package used by Stephen's Compare Correlate code. See more information on Numarray.

HTMLGen HTML page generation library

The The Python HTML generation library

Graphics rendering lib

A Python package called Python Imaging (or PIL)

NetworkGraph & GraphViz graph rendering library (used by Stephen Pitts)

This also requires installation of the open-source GraphViz network graph visualization package. This library is used to generate Network Graphs from GN Trait Collections.

piddle

A python module for creating 2-D graphics

JSON

A python module for doing lightweight data-interchanging, in the case of GN, AJAX" target="_top">https://sourceforge.net/projects/json-py/][AJAX]].

PYX

A python module for creating postscript and PDF files

pyXLWriter

A python module for generating Excel Spreadsheets

SVG

A python module for generating SVG graphics

Python Mysql API pacakge

Mysql_python or MySQLdb? module Python DB-API database interface is currently being used by GN codes to access Mysql database.

The Python DB-API database interface has the information about running all SQL statements from within the GeneNetwork Python code against the GeneNetwork MySQL data store. This must be properly configured on each machine in order to be able to access the GeneNetwork MySQL server properly.

qtlreaper.so

It is a package developed in C++ by Ken and Jintao. Its source code available here. It needs to be compiled.

direct.so

A package developed in C++ by Jintao. Its source codes are kept in GN subversion. It needs to be compiled into a python module.



Subversion (SVN) Source Version Control system

SVN is now the canonical store for all versions of the GeneNetwork codebase. The current production release can be exported from SVN to any node requiring a working copy of the code. Any GeneNetwork version currently under development can also be managed within SVN. A working copy of any development version can be checked out to a machine where development is being carried out. As a developer makes changes to a development version of GeneNetwork, he/she will proceed as follows:
  • after the code compiles/interprets properly and passes basic debugging unit tests, it is checked in to the SVN server.
  • These unit tests will need to be updated over time to be certain to cover alterations and additions to the code.
  • on the SVN server, using the ANT automated build system, integration and general regression tests will then be run on the entire version of the codebase. The compendium of unit tests will also be run to ensure any deeply hidden dependencies amongst modules have not been effected adversely.

(R Williams note: The current (May2007) Python code is not as modular as would be optimal. Trivial changes to MySQL table structure will often result in an unpleasant series of apparently unrelated bugs. It will be important over the next year or two to rewrite key parts of the Python code to buffer the code from as many of details of the MySQL implementation as practical. This will improve the ability of research groups to work with our group's code, to add their own modules and improvements, and to port the GN system to other relational database systems.)

Secure SHell server (sshd)

Secured, encrypted access can be provided to any of the GeneNetwork nodes using sshd. For a person to login via ssh, the sshd configuration file needs to either allow connections from the IP domain from which the login attempt originates or specifically designate each user who is allowed to login via SSH. See the document below for an example.

Activity Logs, System documentation, and community feedback

It would probably be a good idea to have these services and the backend stores where the log information is all contained on a single machine. This would help make it clear these services play a different logical role in the system (i.e., they could all fail and the GeneNetwork Web Portal Apps would keep running just fine) and simplify the process of reviewing and backing up this critical information. I'd suggest this collection of services be used both for GeneNetwork and for QTLReaper.

(RWilliams 2007-05-05: This is a good idea. We currently have one critical fileserver (webqtl = HP Proliant DL360 with 2x300 GB SCSI HD) that handles Subversion, Python, JavaScript? , html code and that serves all the GN compute nodes. The Wiki server is also on this machine. As Stephen advises, we should ideally move all Roundup, the Wiki Server, activity logs, system documentation, and community feedback to a new machine. Let's call it "GNMeta".)

Wiki Server

These Wiki pages are served up from this server which is running on a single machine (as of May 2007: webqtl = HP Proliant DL360).

NTOP Network Packet Logging

NTOP is a powerful network traffic analysis suite. It is relatively easy to configure, can monitor traffic between any clearly defined network nodes, and provides detailed HTML-based summary reports. This tool could be used to determine the nature and magnitude of traffic between specific GeneNetwork Web Portal Application nodes and the Headmaster or between a GeneNetwork web app node and the GeneNetwork MySQL data repository. With such information, you can more effectively design SQL interactions so as to better balance the load between the GeneNetwork Python code and the MySQL database engine. At this time (2006-09-03 and 2007-05-05), this is just a proposal, and NTOP has not in fact been installed.

Analog Web Server Activity Logging

The Analog web log analysis suite. We are currently running Analog 6.0, written by Stephen Turner, University of Cambridge. As of May 5, 2007, we have 697 days of log files (re-start date June 6, 2005). Statistics are at webqtllog. Development of Analog appears to have stopped as of 2005. The University of Indiana has a good summary of the uses of Analog data at webmaster.iu.edu/analog/index.shtml#webworks.

Roundup Issue Tracker

The Roundup Issue Tracker implements a flexible content storage model in an RDBMS for tracking software design and implementation issues. It is linked to an email and web server for dissemination and content editing, as well as having the ability to connect via automation scripts to several semi-automated tasks such as automated software builds. Currently Roundup is being used as a means to coordinate requests and feedback on GN both from the user community and from the GN content contributers and software developers.

(RWilliams, 2007-05-05: Since Jintao Wang left we have unfortunately not been able to maintain Roundup functions. This has put a crimp in Rob's style or interacting with co-developers. Summer 2007 we should try to re-mplement Roundup so that it functions well. The site is still nominally running on http://www.genenetwork.org:8888/webqtl/index but there is now almost no activity (only four new entries in all of 2007.)



NOTES

Central Service deployment

In the attached application functional architecture diagram, the LVS, NAT, and LDAP services would all be expected to run on one machine (e.g., Headmaster). Both the MySQL GN database and the NFSd GN file repository could run on the machine with the RAID storage system (opteron). If demand was sufficient and the amount of data transferred by either NFS or MySQL began to degrade performance, these two persistent stores would be split to two different machines, each with its own RAID system. If incoming demand on the front-end of the architecture - e.g., the LVS service - reach limiting capacity, it might be advisable to move all services not essential to the production runtime environment - e.g., SVN, NTOP, Analog, Wiki Server - to a separate machine. That machine would not require much disk space, nor would it require very hefty computational power.

Currently, the webqtl computer (a HP Proliant DL360) serves as a sole-roled support machine and holds many of the centralized services including NSF, Twiki, and SVN. Zhaohui Sun recommends that we install NTOP on this machine or its replacement. webqtl has been taken out of LVS, so it is not serving as a web server. Headmaster (an older 1U P4 machine bought ASA Computers in 2002 by [[http://www.nervenet.org/main/p2002.html][Michael Connolly]) hosts LVS, LDAP, Analog, and Roundup, all of which are all part of the essential web app components. web2qtl (our second HP Proliant DL360) currently (May 2007) holds a nearly complete backup of the MySQL database, several GN development sites, and a "public" beta version of the GN that Rob and others often use for testing. web2qtl is crowded and does not have adequate RAM for all activity. Zhaohui has suggested that the MySQL development database be placed on a separate machine and that RAm be upgraded to 4 GB or more. Arthur Centeno and Hongqiang plan to implement this suggestion, May 2007.

Use of the Common Gateway Interface (CGI)

Some of the interactions between Apache and Python are still implemented via CGI. This is being gradually ported over to mod_python. As of May 2007, this transition is essentially complete.

Python Input/Output (I/O) - local vs. network-based (NFS)

As of early 2007, there is still a moderate amount of file I/O being done against locally stored files. This is an impediment to making the architecture mobile - ie., if specific data sets are required on the local file system, these may not be in Subversion and therefore need to be separately synchronized. What should be done is that all shared static data sets (static in the sense they are not the results of a dynamically query against MySQL) should be stored on a network volume, mounted under NFSd, and then be accessed by any node running the GN Web App framework using an NFS entry in the local fstab file. This local configuration (mounting the NFS volume) can itself be kept in Subversion along with other important system config files (e.g., httpd.conf). A very simple script - also stored in Subversion - can be created to ensure as the GN framwork is installed on a given node, the config files are also processed as required (and the machine rebooted if necessary).

Distribution of shared GN Python code

The GN code modules, on the other hand, ought not be distributed using NFS. This should all be done using Subversion. It is probably sensible to house everything within the GeneNetwork Python Code Framework - including the current versions of the support libraries such as NetworkGraph and HTMLGen - in Subversion. That way they can be correctly deployed when a copy of the framework is exported or checked-out from Subversion.

Distribution of whole open-source LAMP components

The other GN Web Portal App components such as Apache, mod_python, and the Python environment itself, should be included in the Linux OS distro (Fedora Core 5 or CentOS). This may also be true for the GraphViz library that dwells beneath the NetworkGraph Python module and other libraries that support Python functionality. The only potential liability here is if there is some aspect of the Python code that is depenedent on a specific version of these components, that could "break" should the distro update that component outside of your control. This is why I recommend in '3' above that NetworkGraph and HTMLGen be included in SVN, so you can lock down the version you are using.

Distribution of shared GN LAMP component configuration files

As mentioned elsewhere, for config files that are identical across all GN Web App installations, it would be wise to include these in SVN. This is a common way to synchronize and roll-out identical configurations to a large collection of identically configured nodes. As mentioned below, a simple script - also contained in SVN - would be run upon installation to make certain these configurations are properly distributed throughout the new system. A similar mechanism can be used to synchronously upgrade configurations across all nodes.

Upgrade from Fedora Core 3

For a variety of reasons, it is now necessary to upgrade the Linux distro in use from Fedora Core 3 to a distro using more recent versions of the various system and LAMP components. Bill Bug strongly recommends for a variety of reasons moving to CentOS v4.x as opposed to upgrading to Fedora Core 5, either of which migration is likely to take approximately the same amount of effort and expertise to successfully execute to completion:

  • Would put GN in sync with the Linux distro being employed going forward at the BIRN-CC, thus making is much easier to transplant GN Web App and/or QTLReaper nodes on to a BIRN Rack running their ROCKS Linux appliance configuration. ROCKS previously used Red Hat (possibly the Fedora distro) but as of last year, BIRN-CC is moving to the more enterprise-oriented, open-source CentOS distro for several of the reasons listed below
  • CentOS includes the following enterprise-level capabilities that would be of particular value to either GN:
    • Cluster Suite: this is essentailly built on the Linux Virtual Server (LVS) package currently in use on GN, but it is included in the distro and tools have been added to greatly simply configuration, integration into other components (such as NAT), and system management.
    • Red Hat Global File System: This is not typically a part of the standard Fedora core 5 release and requires a specialized kernel. It can simply access to network-based storage systems and preclude having to locally manage file mount settings such as SAMBA or NFS mounts.
    • The YUM software management & deployment system: This system can greatly reduce the burden on a system administrator who must support deploying fully configured systems across a collection of network computing nodes. Though avaialble for Fedora core, it comes pre-configured on CentOS and includes a management GUI, the Yum Extender that can simplify the process of creating and deploying configurations accross the nodes.

Python package/module management system

This also brings up the issue of whatever Python package management system (i.e., the equivalent of CPAN in PERL or GEMS in Ruby - possibly PyPPM or pypan?) is being used to properly provision the Python installation with those additional modules required beyond the base set. These package/module management tools are all command line executables, so you can in fact collect up all the calls you need to properly download your modules and run these as a script (which would again be kept in SVN). The only caveat to this is these apps need to access sites on the Internet that are sometimes not available, so the scripts may not always run without error. Typically these apps have a way to fall back to mirror sites, so you just need to read the documentation regarding how to effectively automate using a Python module/package management system. If this solution can be implemented successfully, it may not be necessary to store accessory modules/libraries/packages in SVN as recommended above. The critical issue here though is the package management system would need to make it possible for you to specify a version of the module - that version coded to and tested in the GN Python code. I'm not certain they all support this. Usually, it is assumed you will want to use the latest version which SHOULD be backward compatible with any code written to previous versions, but that may not always be the case. Having extensive integration and regression testing in place can help ensure all is well. These tests should always be run after installing or updating the GN framework on a given node.



Jintao Administrative Notes

JintaoWang left the following administration documents. They provide critical installation, configuration, and exectution notes on some of the services described above.



-- BillBug - 04 Sep 2006

Topic attachments
I Attachment Action Size Date Who Comment
elsedia GN-App-Arch-001.dia manage 61.6 K 05 Jun 2008 - 16:31 BillBug GeneNetwork Web Portal Application Functional Architecture (Dia format)
pngpng GN-App-Arch-001.png manage 71.6 K 05 Jun 2008 - 16:31 BillBug GeneNetwork Web Portal Application Functional Architecture (PNG format)
elsedia GN-Ntwrk-And-App-Arch-001.dia manage 105.0 K 05 Jun 2008 - 16:31 BillBug GeneNetwork Application Hardware Layout (Dia format)
pngpng GN-Ntwrk-And-App-Arch-001.png manage 119.1 K 05 Jun 2008 - 16:31 BillBug GeneNetwork Application Hardware Layout (PNG format)
Topic revision: r14 - 02 Sep 2008 - 20:20:43 - RobWilliams
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback