Online English Summarizer tool, free and accurate!
1.1 Definition of Information Retrieval System 1.2 Objectives of Information Retrieval Systems 1.3 Functional Overview 1.4 Relationship to Database Management Systems 1.5 Digital Libraries and Data Warehouses 1.6 Summary This chapter defines an Information Storage and Retrieval System (called an Information Retrieval System for brevity) and differentiates between information retrieval and database management systems.All writers have a vocabulary limited by their life experiences, environment where they were raised and ability to express themselves.There are natural obstacles to specification of the information a user needs that come from ambiguities inherent in languages, limits to the user's ability to express what information is needed and differences between the user's vocabulary corpus and that of the authors of the items in the database.Course Overview 6 or incorporate in advertising.1.2.??????????????????????????????????????????????
1.1 Definition of Information Retrieval System
1.2 Objectives of Information Retrieval Systems
1.3 Functional Overview
1.4 Relationship to Database Management Systems
1.5 Digital Libraries and Data Warehouses
1.6 Summary
This chapter defines an Information Storage and Retrieval System (called an Information Retrieval System for brevity) and differentiates between information retrieval and database management systems. Tied closely to the definition of an Information Retrieval System are the system objectives. It is satisfaction of the objectives that drives those areas that receive the most attention in development. For example, academia pursues all aspects of information systems, investigating new theories, algorithms and heuristics to advance the knowledge base. Academia does not worry about response time, required resources to implement a system to support thousands of users nor operations and maintenance costs associated with system delivery. On the other hand, commercial institutions are not always concerned with the optimum theoretical approach, but t h e approach that minimizes development costs a n d increases the salability of their product. This text considers both view points and technology states. Throughout this text, information retrieval is viewed from both the theoretical and practical viewpoint.
The functional view of an Information Retrieval System is introduced to put into perspective the technical areas discussed in later chapters. As detailed algorithms and architectures are discussed, they are viewed as subfunctions within a total system. They are also correlated to the major objective of an Information Retrieval System which is minimization of human resources required in the
finding of needed information to accomplish a task. As with any discipline, standard measures are identified to compare the value of different algorithms. In information systems, precision and recall are the key metrics used in evaluations. Early introduction of these concepts in this chapter will help the reader in understanding the utility of the detailed algorithms and theory introduced throughout this text.
Introduction of Information Retrieval Systems
Information Storage and Retrieval Systems [Type the company name] | Course Overview 4
There is a potential for confusion in the understanding of the differences between Database Management Systems (DBMS) and Information Retrieval Systems. It is easy to confuse the software that optimizes functional support of each type of system with actual information or structured data that is being stored and manipulated. The importance of the differences lies in the inability of a database management system to provide the functions needed to process “information.” The opposite, an information system containing structured data, also suffers major functional deficiencies. These differences are discussed in detail in Section 1.4.
1.1. Definition of Information Retrieval System
An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video and other multi-media objects. Although the form of an object in an Information Retrieval System is diverse, the text aspect has been the only data type that lent itself to full functional processing. The other data types have been treated as highly informative sources, but are primarily linked for retrieval based upon search of the text. Techniques are beginning to emerge to search these other media types (e.g., EXCALIBUR’s Visual RetrievalWare, VIRAGE video indexer). The focus of this book is on research and implementation of search, retrieval and representation of textual and multimedia sources. Commercial development of pattern matching against other data types is starting to be a common function integrated within the total information system. In some systems the text may only be an identifier to display another associated data type that holds the substantive information desired by the system’s users (e.g., using closed captioning to locate video of interest.) The term “user” in this book represents an end user of the information system who has minimal knowledge of computers and technical fields in general.
The term “item” is used to represent the smallest complete unit that is processed and manipulated by the system. The definition of item varies by how a specific source treats information. A complete document, such as a book, newspaper or magazine could be an item. At other times each chapter, or article may be defined as an item. As sources vary and systems include more complex processing, an item may address even lower levels of abstraction such as a contiguous passage of text or a paragraph. For readability, throughout this book the terms “item” and “document” are not in this rigorous definition, but used interchangeably. Whichever is used, they represent the concept of an item. For most of the book it is best to consider an item as text. But in reality an item may be a combination of many modals of information. For example a video news program could be considered an item. It is composed of text in the form of closed captioning, audio text provided by the speakers, and the video images being displayed. There are multiple "tracks" of information possible in a single item. They are typically correlated by time. Where the text discusses multimedia information retrieval keep this expanded model in mind.
An Information Retrieval System consists of a software program that facilitates a user in finding the information the user needs. The system may use standard computer hardware or specialized hardware to support the search subfunction and to convert non-textual
Information Storage and Retrieval Systems 5 الادارة العامة للتعليم الفني الصحي
sources to a searchable media (e.g., transcription of audio to text). The gauge of success of an information system is how well it can minimize the overhead for a user to find the needed information. Overhead from a user’s perspective is the time required to find the information needed, excluding the time for actually reading the relevant data. Thus search composition, search execution, and reading non-relevant items are all aspects of information retrieval overhead.
The first Information Retrieval Systems originated with the need to organize information in central repositories (e.g., libraries) (Hyman-82). Catalogues were created to facilitate the identification and retrieval of items. Chapter 3 reviews the history of cataloging and indexing. Original definitions focused on “documents” for information retrieval (or their surrogates) rather than the multi-media integrated information that is now available (Minker-77, Minker-77.)
As computers became commercially available, they were obvious candidates for the storage and retrieval of text. Early introduction of Database Management Systems provided an ideal platform for electronic manipulation of the indexes to information (Rather-77). Libraries followed the paradigm of their catalogs and references by migrating the format and organization of their hardcopy information references into structured databases. These remain as a primary mechanism for researching sources of needed information and play a major role in available Information Retrieval Systems. Academic research that was pursued through the 1980s w as c o n s t r ain e d b y t h e paradigm of the indexed structure associated with libraries and the lack of computer power to handle large (gigabyte) text databases. The Military and other Government entities have always had a requirement to store and search large textual databases. As a result they began many independent developments of textual Information Retrieval Systems. Given the large quantities of data they needed to process, they pursued both research and development of specialized hardware and unique software solutions incorporating Commercial Off The Shelf (COTS) products where possible. The Government has been the major funding source of research into Information Retrieval Systems. With the advent of inexpensive powerful personnel computer processing systems and high speed, large capacity secondary storage products, it has become commercially feasible to provide large textual information databases for the average user. The introduction and exponential growth of the Internet along with its initial WAIS (Wide Area Information Servers) capability and more recently advanced search servers (e.g., INFOSEEK, EXCITE) has provided a new avenue for access to terabytes of information (over 800 million indexable pages -Lawrence-99.) The algorithms and techniques to optimize the processing and access of large quantities of textual data were once the sole domain of segments of the Government, a few industries, and academics. They have now become a needed capability for large quantities of the population with significant research and development being done by the private sector. Additionally the volumes of nontextual information are also becoming searchable using specialized search capabilities. Images across the Internet are searchable from many web sites such as WEBSEEK, DITTO.COM, ALTAVISTA/IMAGES. News organizations such as the BBC are processing the audio news they have produced and are making historical audio news searchable via the audio transcribed versions of the news. Major video organizations such as Disney are using video indexing to assist in finding specific images in their previously produced videos to use in future videos
Information Storage and Retrieval Systems [Type the company name] | Course Overview 6
or incorporate in advertising. With exponential growth of multi-media on the Internet capabilities such as these are becoming common place. Information Retrieval exploitation of multi-media is still in its infancy with significant theoretical and practical knowledge missing.
1.2. Objectives of Information Retrieval Systems
The general objective of an Information Retrieval System is to minimize the overhead of a user locating needed information. Overhead can be expressed as the time a user spends in all of the steps leading to reading an item containing the needed information (e.g., query generation, query execution, scanning results of query to select items to read, reading non-relevant items). The success of an information system is very subjective, based upon what information is needed and the willingness of a user to accept overhead. Under some circumstances, needed information can be defined as all information that is in the system that relates to a user’s need. In other cases it may be defined as sufficient information in the system to complete a task, allowing for missed data. For example, a financial advisor recommending a billion dollar purchase of another company needs to be sure that all relevant, significant information on the target company has been located and reviewed in writing the recommendation. In contrast, a student only requires sufficient references in a research paper to satisfy the expectations of the teacher, which never is all inclusive. A system that supports reasonable retrieval requires fewer features than one which requires comprehensive retrieval. In many cases comprehensive retrieval is a negative feature because it overloads the user with more information than is needed. This makes it more difficult for the user to filter the relevant but non-useful information from the critical items. In information retrieval the term “relevant” item is used to represent an item containing the needed information. In reality the definition of relevance is not a binary classification but a continuous function. From a user’s perspective “relevant” and “needed” are synonymous. From a system perspective, information could be relevant to a search statement (i.e., matching the criteria of the search statement) even though it is not needed/relevant to user (e.g., the user already knew the information). A discussion on relevance and the natural redundancy of relevant information is presented in Chapter 11.
The two major measures commonly associated with information systems are precision and recall. When a user decides to issue a search looking for information on a topic, the total database is logically divided into four segments shown in Figure 1.1. Relevant items are those documents that contain information that helps the searcher in answering his question. Non-relevant items are those items that do not provide any directly useful information. There are two possibilities with respect to each item: it can be retrieved or not retrieved by the user’s query. Precision and recall are defined as:
Information Storage and Retrieval Systems 7 الادارة العامة للتعليم الفني الصحي
where Number_Possible_Relevant are the number of relevant items in the database. Number_Total_Retieved is the total number of items retrieved from the query. Number_Retrieved_Relevant is the number of items retrieved that are relevant to the user’s search need. Precision measures one aspect of information retrieval overhead for a user associated with a particular search. If a search has a 85 per cent precision, then 15 per cent of the user effort is overhead reviewing non-relevant items. Recall gauges how well a system processing a particular query is able to retrieve the relevant items that the user is interested in seeing. Recall is a very useful concept, but due to the denominator, is non-calculable in operational systems. If the system knew the total set of relevant items in the database, it would have retrieved them. Figure 1.2a shows the values of precision and recall as the number of items retrieved increases, under an optimum query where every returned item is relevant. There are “N” relevant items in the database. Figures 1.2b and 1.2c show the optimal and currently achievable relationships between Precision and Recall (Harman-95). In Figure 1.2a the basic properties of precision (solid line) and recall (dashed line) can be observed. Precision starts off at 100 per cent and maintains that value as long as relevant items are retrieved. Recall starts off close to zero and increases as long as relevant items are retrieved until all possible relevant items have been retrieved. Once all “N” relevant items have been retrieved, the only items being retrieved are non-relevant. Precision is directly affected by retrieval of non-relevant items and drops to a number close to zero. Recall is not effected by retrieval of non-relevant items and thus remains at 100 per
Information Storage and Retrieval Systems [Type the company name] | Course Overview 8
Figure 1.2a Ideal Precision and Recall
Figure 1.2b Ideal Precision/Recall Graph
Figure 1.2c Achievable Precision/Recall Graph
Cent once achieved. Precision/Recall graphs show how values for precision and recall change within a search results file (Hit file) as viewed from the most relevant to least relevant item. As with Figure 1.2a, in the ideal case every item retrieved is relevant. Thus precision stays at 100 per cent (1.0). Recall continues to increase by moving to the right on the x-axis until it also reaches the 100 per cent (1.0) point. Although Figure 1.2c stops here, continuation stays at the same x-axis location (recall never changes) but precision
Information Storage and Retrieval Systems 9 الادارة العامة للتعليم الفني الصحي
decreases down the y-axis until it gets close to the x-axis as more non-relevant are discovered and precision decreases. Figure 1.2c is from the latest TREC conference (see Chapter 11) and is representative of current capabilities.
To understand the implications of Figure 1.2c, its useful to describe the implications of a particular point on the precision/recall graph. Assume that there are 100 relevant items in the data base and from the graph at precision of .3 (i.e., 30 per cent) there is an associated recall of .5 (i.e., 50 per cent). This means there would be 50 relevant items in the Hit file from the recall value. A precision of 30 per cent means the user would likely review 167 items to find the 50 relevant items.
The first objective of an Information Retrieval System is support of user search generation. There are natural obstacles to specification of the information a user needs that come from ambiguities inherent in languages, limits to the user’s ability to express what information is needed and differences between the user’s vocabulary corpus and that of the authors of the items in the database. Natural languages suffer from word ambiguities such as homographs and use of acronyms that allow the same word to have multiple meanings (e.g., the word “field” or the acronym “U.S.”). Disambiguation techniques exist but introduce significant system overhead in processing power and extended search times and often require interaction with the user.
Many users have trouble in generating a good search statement. The typical user does not have significant experience with nor even the aptitude for Boolean logic statements. The use of Boolean logic is a legacy from the evolution of database management systems and implementation constraints. Until recently, commercial systems were based upon databases. It is only with the introduction of Information Retrieval Systems such as Retrieval Ware, TOPIC, AltaVista, Info seek and INQUERY that the idea of accepting natural language queries is becoming a standard system feature. This allows users to state in natural language what they are interested in finding. But the completeness of the user specification is limited by the user’s willingness to construct long natural language queries. Most users on the Internet enter one or two search terms.
Multi-media adds an additional level of complexity in search specification. Where the modal has been converted to text (e.g., audio transcription, OCR) the normal text techniques are still applicable. But query specification when searching for an image, unique sound, or video segment lacks any proven best interface approaches. Typically they are achieved by having presorted examples of known objects in the media and letting the user select them for the search (e.g., images of leaders allowing for searches on "Tony Blair".) This type specification becomes more complex when coupled with Boolean or natural language textual specifications.
In addition to the complexities in generating a query, quite often the user is not an expert in the area that is being searched and lacks domain specific vocabulary unique to that particular subject area. The user starts the search process with a general concept of the information required, but not has a focused definition of exactly what is needed. A limited knowledge of the vocabulary associated with a particular area along with lack of
Information Storage and Retrieval Systems [Type the company name] | Course Overview 10
focus on exactly what information is needed leads to use of inaccurate and in some cases misleading search terms. Even when the user is an expert in the area being searched, the ability to select t h e proper search terms is constrained by lack of knowledge of the author’s vocabulary. All writers have a vocabulary limited by their life experiences, environment where they were raised and ability to express themselves. Other than in very technical restricted information domains, the user’s search vocabulary does not match the author’s vocabulary. Users usually start with simple queries that suffer from failure rates approaching 50% (Nordlie-99).
Thus, an Information Retrieval System must provide tools to help overcome the search specification problems discussed above. In particular the search tools must assist the user automatically and through system interaction in developing a search specification that represents the need of the user and the writing style of diverse authors (see Figure 1.3) and multi-media specification.
Summarize English and Arabic text using the statistical algorithm and sorting sentences based on its importance
You can download the summary result with one of any available formats such as PDF,DOCX and TXT
ٌYou can share the summary link easily, we keep the summary on the website for future reference,except for private summaries.
We are working on adding new features to make summarization more easy and accurate
السيادة في الدولة الفدرالية لا يمكن أن يتوافق مفهوم السيادة في الدولة الدستورية مع الفصل بين السلطات...
كخلاصة لما جاء في هذا الفصل، فالسياسة الخارجية الجزائرية بمقارباتها المختلفة حققت العديد من المكاسب ...
لن يعود شيء كما كان بعد نهاية العصر الجليدي، حيث عُزلت جيوب كبيرة من البشرية على جانبي الكرة الأرضية...
كما مٌكن ب عٌ الأصل التجاري الإلكترون ،ً فإنه مٌكن تقد مٌه حصة ف شركة والمقصود بتقد مٌ الأصل التجاري...
تغزو سهول شرق أفريقيا موطن الغابات التقليدي لأسلافنا من القردة، حيث تقل الأشجار وتتسع المسافات بينها...
الكود الزائف يشبه لغات البرمجة مثل C++ ، لكنك لستِ مجبرة على الالتزام بقواعدها الصارمة (Syntax). نحن...
الأصالة: قوة أن تكون حقيقي فالأصالة هي حجر الزاوية للقيادة الفعالة. تخلق القيادات النسائية اللواتي ي...
تفرض طبيعة الحياة الإنسانية على الفرد مواجهة سلسلة مستمرة من التغيرات والتحديات التي تترافق مع ضغوط ...
يعتبر الضغط النفسي من بين أكثر المتغيرات النفسية شيوعا عند الناس في الفترة الراهنة، باعتبار أن الضغ...
واستمرارا لهذا النسق، جرت بتاريخ 02أكتوبر 2024م بالجزائر العاصمة محادثات بين مسؤولين من البلدين في ...
توصلت الدراسة إلى أن رقمنة القطاع الصحي والصحة الإلكترونية لم تعودا خياراً ترفيهياً أو شكلياً، فقد أ...
مقدمة قال المصطفى خير الأنام صلى الله عليه وسلم في حديثه الشريف "اطلبوا العلم من المهد إلى ا...