MARVIN, MULTI-AGENT RETRIEVAL VAGABOND ON INFORMATION NETWORKS
Health On the Net Foundation (HON) and the Swiss Institute of Bioinformatics (SIB) at Geneva University Hospital have developed MARVIN -- Multi-Agent Retrieval Vagabond on Information Networks -- a robot that searches sites and documents. Robots like this are already in use for health and medicine as well as other domains such as molecular biology.
MEDHUNT, HONSELECT, HON CODE OF CONDUCT, HONCODE
MedHunt, HONselect and the HON Code of Conduct (HONcode) are among HON services that actively promote effective Internet use and seek to demonstrate best-in-class implementation and application. The first initiative, MedHunt, consists of an intelligent and specialized search engine designed to locate Internet information related to a given medical and health domain. The second, HONselect, is the first assisted-search facility that integrates heterogeneous databases to offer users a full assortment of healthcare information and resources available on the Web. The third one is an authoritative set of voluntary guidelines designed to raise the quality of Web-based medical and health information. The HONcode is today the most widely endorsed set of ethical guidelines for site developers in this domain.
With the number of World Wide Web sites growing every day, the problem is not just to find information, but to locate the right piece of information. Current solutions for structuring information, subject hierarchies and general search engines have both advantages and drawbacks. Subject hierarchies are precise enough due to manual classification. However, the number of results provided in response to an average query is usually low due to the small amount of documents indexed. General World Wide Web search engines, indexing most of the Web, return a long list of documents, but often to the detriment of precision. The search result is then barely usable because of the large number of answers from different domains and topics. Only complex queries may, in a given situation, produce a limited number of potentially relevant documents. To make searches more efficient and useful to ordinary users, we need intelligent and specialized search engines on the Net.
The primary objective of the MARVIN project or the Multi-Agent Retrieval Vagabond on Information Networks, started in January 1996, was to reduce the search space by considering and indexing only a given field by filtering Web pages and to support the multilinguality of the Web. MARVIN, HON`s own Web-spider, was first applied to the medical domain. Armed with a dictionary of medical terms, MARVIN tirelessly skims the Web for new sources of medical information. MARVIN feeds and constantly updates MedHunt, HON`s medical and health search engine. The 16th November 2000, 2000 visits (different computer) and 8000 accesses to MedHunt show the effectiveness and utility of this complementary set MARVIN-MedHunt.
MEDICAL SEARCH ENGINE, MULTI AGENT SOFTBOT
MARVIN (Multi-Agent Retrieval Vagabond on Information Networks) searches the Web and selects only documents that are relevant to a specific and chosen domain. Document relevance is computed according to a formula that takes into consideration the number of words from a glossary of significant terms that MARVIN finds in the document, as well as their place in the document. MARVIN has first been applied to healthcare. MARVIN stores selected documents in a database that users can then query, for example MedHunt, HON`s own medical search engine. MARVIN is also applied to a variety of scientific domains, such as molecular biology and 2-D electrophoresis, constantly feeding and updating the different databases.
MARVIN was designed as a multi agent softbot. Each agent possesses filtering capabilities. The agent downloads Web pages and computes the medical ``score`` of each page. Using a glossary of medical terms which calculates the frequency of the appearance of words in the glossary. The score processed by MARVIN defines if a Web page is medical or health-related or not by adding up the number of medical terms in the document, taking into account the different translations and the weight of each medical terms as defined by the built-in glossary.
In the medical domain many thesauruses and glossaries already existed such as the MeSH (Medical Subject Headings) from the National Library of Medicine (NLM) and the glossary in nine European languages developed at the Heymans Institute of Pharmacology, University of Ghent, Belgium, within the framework of a European project. For our application, HON built its own thesaurus by compiling several of these sources. Starting with bilingual (English/French) medical terms (12,000), the thesaurus was expanded with Danish, Dutch, German, Italian, Portuguese and Spanish, resulting in a thesaurus of 20,000 multilingual medical terms (not counting the 33,000 MeSH terms).
Studies were undertaken to provide an estimate of the relative importance of a term in a document and in a collection of documents, allowing us to weight each medical term included in our medical glossary. 1,000 documents known to be related to the medical and health topics and 1,000 related to other domains except medical and health were analyzed. The medical terms included in each Web page were then evaluated. This study, associated with other techniques such as the formula of Wilbur and Yang (An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts, Comp. Bio. Med. 26.3 p. 209-222, 1996) allowed us to define a threshold for each terms contained in our medical glossary.
Using our multilingual medical thesaurus of 50,000 terms, the download of Web pages and the calculation of a score according to the page content, MARVIN generates using a classical inverted index: in which each word is associated with the list of documents containing the word. Matching the requested terms is then a simple and efficient task.
2DHUNT, BIOHUNT, PACSHUNT, HON INDEX, EXPASY INDEX
The Hunt projects are based on a flexible search engine to a specific domain (MedHunt, 2DHunt, BioHunt, PACSHunt, HON index, ExPASy index) using the databases created by way of MARVIN. Six active applications are already in use: medical (MedHunt), a version for searching 2-D electrophoresis documents (2DHunt), a molecular biology version (BioHunt), a version for searching documents related to the PACS (Picture Archiving Computer System) and two versions for searching Web sites content (the HON Foundation and ExPASy, the proteomics server). With these six applications, MARVIN is an ideal way to customize any specific search of Internet-based information.