Showing posts with label relevance. Show all posts
Showing posts with label relevance. Show all posts

Thursday, 27 October 2011

DITA coursework blog - Web 1.0 (the internet and WWW, databases and information retrieval)


Title: Language and access in Digital Information Technologies and Architecture, with a focus on law libraries

1. Introduction

An underlying principle of digital information is that it is data which must be written in a specific language so that it can be stored in sources, communicated by systems and retrieved by users. Once this is achieved, access to data must be managed using appropriate technologies. I will consider this statement in the context of modern law libraries to assess the present and future impact on the provision of digital resources to their users.

2. Evaluating

Digital technologies must take into account the information needs of library users, who in today’s digital age, most commonly seek information from online subscription databases and web resources. Sources of information in law libraries are typically law reports, journal articles or legislation: predominantly accessed as either printed or digital text based information. The latter must be in a specified format in order to be read: it is data attributed a form capable of precise meaning through logical coding and sequencing – in essence a ‘language’. 

Computers are system linguists which communicate data over connected networks (the internet) via a service (the World Wide Web). Computers read and interpret data in binary form: bits are assigned characters and form words as ASCII text; and collected together, they create files which make up documents, such as database records or web pages. Human users are only able to subjectively evaluate text for meaning and relevance in a form they understand. Computers do not understand “human” language, and so evaluate the language within the data: metadata. Hypertext is a language used to inter-link data in one document, or link data between documents. Web pages are written in Hypertext Mark-up Language (HTML) so the data can be read by internet browsers, which interpret metatags (ordered ASCII text relaying strict instructions on layout and structure) as distinct from standard ASCII text. 

The advent of e-books has seen a shift towards digital readership, where books translated into ASCII text can enjoy wider distribution to library users over the internet. This indicates the future of how libraries will provide materials to their users; but issues of cost, reliability and user misgivings on rapid technological advancement still impact on access.

3. Managing

Managing data at core is concerned with providing users with access points. There are two sources of digital information available to library users: internal (databases) and external (the internet). 

Databases organise and order available data in accordance with the user’s information needs, a primary example being an OPAC catalogue of a library’s holdings. Language is the control. Structured Query Language (SQL) commands relational databases to perform queries to retrieve selective data from a number of interrelated data tables. 
Databases permit searches by two methods: natural language and controlled vocabularies. If the natural language search terms are not clear, or irrelevant search results are returned, the user may deploy query modification to adjust the language used and yield better results. Controlled vocabularies such as indexing and thesauri may signpost users in context to data that may or may not be relevant. We should expect more relevant database search results than compared to say an internet search engine's results, permitting that the data is there to be retrieved.

Libraries can combine access to both databases and the web concurrently to permit wider scope for information retrieval. Brophy (2007, p.113-4) sees an importance of use behind the access and retrieval process, thus directly linking users to resources. He also implies that use involves the creation of “information objects of various kinds”. A library portal, such as created by the Inner Temple Library[1], is a good example of this – it is an online access point to a number of databases, together with hyperlinks to web resources including a subject index and current awareness blog. Maloney and Bracke (2005, p.87) emphasises that this “is not a single technology. Rather it is a combination of several systems, standards and protocols that inter-operate to create a unified experience for the user”. This means of federated searching[2] is emerging as a possible solution to remove the complexities of cross-searching multiple databases.

Information retrieval over the web is a double-edged sword: on one hand there is a wealth of dedicated resources available online; however an inexpert user will only ever retrieve a small percentage of relevant data from this due to the “invisible web”[3]: a detrimental consequence of a global resource that is dynamically evolving, but where authenticity and permanence is compromised as more and more information goes online. Limb (2004, p.60) believes this could be combated by building federated repositories to harvest in a wealth of relevant cyber resources, but the task may appear onerous and unmanageable.

4. Conclusion

The communication chain between users, systems and sources is dependent on the efficient and concise use of language in order to access and retrieve data. A break in the chain, such as incomplete HTML code or a broken hyperlink, can shutdown access to information, leaving the information seeker locked-out. The architects of the computer systems dictate the choice and methods by which data is represented, but as non-subject specialists, they may not understand the information they give access may not fulfil the user’s needs. A compromise perhaps should be reached.[4]

Recent developments such cloud sourcing[5] look set to change how society store and access digital information, in that information users can retrieve documents via the internet without prior knowledge of where the source document is physically rooted. It appears cloud sourcing makes the service, the source.[6] 

I cannot see how law libraries could happily subscribe to these developments: information retrieval is too deeply rooted in specialist knowledge and language coupled with the need for reasonable proximity between the user and their sources. As technologies enable information to become cheaper to produce and maintain; the information is more eagerly consumed by non-experts who have inexpert skill and knowledge in accessing and evaluating relevant information. 

The legal information professional, acting as the bridge between users, systems and sources, therefore remains crucial to the information access and retrieval processes.

Bibliography

Brophy, P. (2007). The library in the twenty-first century. 2nd ed. London: Facet Publishing.

The Inner Temple Library Catalogue: http://www.innertemplelibrary.org/external.html (accessed: 25th October 2011).

Maloney, K. & Bracke, P.J. (2005). Library portal technologies. In: Michalak, S.C., ed. 2005. Portals and libraries. New York: The Haworth Information Press. Ch.6.

Limb, P. (2004). Digital Dilemmas and Solutions. Oxford: Chandos Publishing.

Pedley, P. (2001). The invisble web: searching the hidden parts of the internet. London: Aslib-IMI.

Harvey, T. (2003). The role of the legal information officer. Oxford: Chandos Publishing.

Géczy, P., Izumi, N. and Hasida, K. (2012). Cloudsourcing: managing cloud adoption. Global Journal of Business Research, 6(2), 57-71. (accessed: EBSCOhost - 25th October 2011.)

References


[1] The Inner Temple Library Catalogue: http://www.innertemplelibrary.org/external.html (accessed: 25th October 2011)
[2] See Limb (2004, p.59).
[3] For further discussion, see: Pedley (2001) The Invisible Web: Searching the hidden parts of the internet. London: Aslib-IMI.
[4] See Harvey (2003, p.143-6) for a persuasive discussion on the ‘librarian vs lawyer’ in terms of information retrieval within the legal profession.
[5] For detailed discussion of the concerns and benefits of cloud sourcing, see Géczy, Izumi and Hasida (2012) in Global Journal of Business Research, 6(2), 57-71.
[6] i.e. the internet becomes the storage and service provider of digital documents, which are no longer anchored to a physical location.

Tuesday, 18 October 2011

DITA - Understanding blog No. 4: Information Retrieval

After last week's session on retrieving structured data from a database management system, this week's ask of retrieving unstructured data from the wide expanse of the Internet seems a staggering insurmountable task on paper. But is it really? I argue not. We do this kind of thing on a daily basis and we don't really give it much thought. The next time you want to use Google to search for tickets for an upcoming gig or theatre show, think carefully about what you are actually ... retrieving specific information from a whole mass of information deposited on the net. It has some order (websites, webpages) but we don't know exactly where we are going to find it, or even whether we will actually find anything relevant.

Information retrieval has three definitions depending on your viewpoint as either a user, a system or a source. A user typically has inadequate knowledge on the subject they are searching for, and hence seek to retrieve information through a search request to enlighten them. A system stores information, processes it and makes it available for retrieval through software and hardware. it is the technology that allows the user to search how they want to. A source is the document that contains the information we wish to retrieve, it has an intended purpose and audience. Information is a valuable commodity which is ripe for exploitation: it can be bought and sold as a service.

Information retrieval on the internet occurs whenever we make a web search (we want to find some information online). Broder (2000) conceived a taxonomy for web searching by looking at the different types of query we make:
  • Navigational queries (e.g. finding the home page for a company when you don't know the precise URL)
  • Transactional queries (e.g. a mediated activity, such as purchasing a box of fudge)
  • Informational queries (e.g. finding information on a particular subject, such as what is available or how to do something)
All the above queries are textual based (i.e. we are seeking a written record of the information). The web is home to a selection of different non-textual media, such as images and videos, and therefore the scope of our searching can be expanded to the following categories:
  • Known-item retrieval i.e. the user knows the exact item necessary to satisfy their informational need (e.g. a particular movie or video hosted online)
  • Fact retrieval i.e. the user knows what they want but do not have the precise information in order to fulfill their need. (e.g. which actor played a certain part in a particular movie)
  • Subject retrieval i.e. the user is looking to a subject, which is not precisely defined (e.g. the most memorable deaths in horror films)
  • Exploratory retrieval i.e. checking out what data is available for a provided selection (e.g. searching for classical music on iTunes)
Before information can be searched, it needs to be in a specific format in order to be retrieved (e.g. HTML, XML, MPEG). Media needs to be processed in the correct way before it can indexed correctly. In order to assist the indexing process, a number of processes should be followed with the text descriptors for the media to be retrieved: 

     0.  identify the fields you wish to make searchable in the index (e.g. the most memorable parts of the
          document which are typically searched for, such as title, author, year etc. This allows for highly
          accurately, focused searching to be carried out)
  1. identify words that will act as keywords for a search procedure, which will be those terms or phrases that are likely to be searched by the user. A consideration of whether digits, non A-Z characters will be included or excluded needs to be undertaken. Keeping the keywords in lowercase will yield more accurate search results.
  2. remove stop words such as and, the.
  3. stem words, by cutting off the suffix to allow for wider searching of a concept or term e.g. act! would bring up results for acting, actors, actions etc.
  4. define synonyms, i.e. different words that have the same meaning.
Once you the information has been prepared for indexing, it needs to be formatted into a structure - it can be in the form of a surrogate record, i.e. a record within the database which acts as a 'list of records' for all the information contained in the database that you are interested in) or as an inverted file (i.e. we look at words to find documents, rather than the other way around ... looking from the inside out!)

Index structure in place ...  we can now search! Search models for information retrieval include boolean connectors (AND, OR, NOT), proximity searching (within same sentence, paragraph or phrase; word adjacency), best match results generated through ranking systems built into search engines such as Google, and simply browsing the internet (bypasses any indexes in place).
Should the preliminary search fail, we can then try a manual query modification (by adding or removing terms from the initial search query) or try an automatic query modification such as a 'show me more on this topic' which is provided for you by the search engine.

Once you have conducted a search, how do you determine how relevant the results are? You need to evaluate it.

It can be done qualitatively, from a user viewpoint (was that user was satisfied with the search results) or a sources viewpoint (how much should the user be charged for search services providing relevant results)

It can be done quantitatively from a systems viewpoint, by which we can evaluate the retrieval effectiveness and efficiency by calculating precision and recall respectively:

Precision = the proportion of retrieved documents that are relevant
   = relevant documents retrieved
      total documents returned

Recall = the proportion of relevant docs documents
   =  relevant documents retrieved
       total number of relevant documents in the database

The practical lab session allowed us to explore the exercise of  information retrieval by using two internet search engines, Google and Bing, to search for a variety of information by making search queries, then calculating the precision and recall of each engine. Because we are already well versed in searching the internet, and because I already used advanced search models such as boolean connectors for online searching, I was able to find relevant searches efficiently. The session as a whole however reinforced the need for well structured index structures and precise searching models to be in place if we are to retrieve information that is relevant to our needs at the time we need to access it.