Thursday 27 October 2011

DITA coursework blog - Web 1.0 (the internet and WWW, databases and information retrieval)


Title: Language and access in Digital Information Technologies and Architecture, with a focus on law libraries

1. Introduction

An underlying principle of digital information is that it is data which must be written in a specific language so that it can be stored in sources, communicated by systems and retrieved by users. Once this is achieved, access to data must be managed using appropriate technologies. I will consider this statement in the context of modern law libraries to assess the present and future impact on the provision of digital resources to their users.

2. Evaluating

Digital technologies must take into account the information needs of library users, who in today’s digital age, most commonly seek information from online subscription databases and web resources. Sources of information in law libraries are typically law reports, journal articles or legislation: predominantly accessed as either printed or digital text based information. The latter must be in a specified format in order to be read: it is data attributed a form capable of precise meaning through logical coding and sequencing – in essence a ‘language’. 

Computers are system linguists which communicate data over connected networks (the internet) via a service (the World Wide Web). Computers read and interpret data in binary form: bits are assigned characters and form words as ASCII text; and collected together, they create files which make up documents, such as database records or web pages. Human users are only able to subjectively evaluate text for meaning and relevance in a form they understand. Computers do not understand “human” language, and so evaluate the language within the data: metadata. Hypertext is a language used to inter-link data in one document, or link data between documents. Web pages are written in Hypertext Mark-up Language (HTML) so the data can be read by internet browsers, which interpret metatags (ordered ASCII text relaying strict instructions on layout and structure) as distinct from standard ASCII text. 

The advent of e-books has seen a shift towards digital readership, where books translated into ASCII text can enjoy wider distribution to library users over the internet. This indicates the future of how libraries will provide materials to their users; but issues of cost, reliability and user misgivings on rapid technological advancement still impact on access.

3. Managing

Managing data at core is concerned with providing users with access points. There are two sources of digital information available to library users: internal (databases) and external (the internet). 

Databases organise and order available data in accordance with the user’s information needs, a primary example being an OPAC catalogue of a library’s holdings. Language is the control. Structured Query Language (SQL) commands relational databases to perform queries to retrieve selective data from a number of interrelated data tables. 
Databases permit searches by two methods: natural language and controlled vocabularies. If the natural language search terms are not clear, or irrelevant search results are returned, the user may deploy query modification to adjust the language used and yield better results. Controlled vocabularies such as indexing and thesauri may signpost users in context to data that may or may not be relevant. We should expect more relevant database search results than compared to say an internet search engine's results, permitting that the data is there to be retrieved.

Libraries can combine access to both databases and the web concurrently to permit wider scope for information retrieval. Brophy (2007, p.113-4) sees an importance of use behind the access and retrieval process, thus directly linking users to resources. He also implies that use involves the creation of “information objects of various kinds”. A library portal, such as created by the Inner Temple Library[1], is a good example of this – it is an online access point to a number of databases, together with hyperlinks to web resources including a subject index and current awareness blog. Maloney and Bracke (2005, p.87) emphasises that this “is not a single technology. Rather it is a combination of several systems, standards and protocols that inter-operate to create a unified experience for the user”. This means of federated searching[2] is emerging as a possible solution to remove the complexities of cross-searching multiple databases.

Information retrieval over the web is a double-edged sword: on one hand there is a wealth of dedicated resources available online; however an inexpert user will only ever retrieve a small percentage of relevant data from this due to the “invisible web”[3]: a detrimental consequence of a global resource that is dynamically evolving, but where authenticity and permanence is compromised as more and more information goes online. Limb (2004, p.60) believes this could be combated by building federated repositories to harvest in a wealth of relevant cyber resources, but the task may appear onerous and unmanageable.

4. Conclusion

The communication chain between users, systems and sources is dependent on the efficient and concise use of language in order to access and retrieve data. A break in the chain, such as incomplete HTML code or a broken hyperlink, can shutdown access to information, leaving the information seeker locked-out. The architects of the computer systems dictate the choice and methods by which data is represented, but as non-subject specialists, they may not understand the information they give access may not fulfil the user’s needs. A compromise perhaps should be reached.[4]

Recent developments such cloud sourcing[5] look set to change how society store and access digital information, in that information users can retrieve documents via the internet without prior knowledge of where the source document is physically rooted. It appears cloud sourcing makes the service, the source.[6] 

I cannot see how law libraries could happily subscribe to these developments: information retrieval is too deeply rooted in specialist knowledge and language coupled with the need for reasonable proximity between the user and their sources. As technologies enable information to become cheaper to produce and maintain; the information is more eagerly consumed by non-experts who have inexpert skill and knowledge in accessing and evaluating relevant information. 

The legal information professional, acting as the bridge between users, systems and sources, therefore remains crucial to the information access and retrieval processes.

Bibliography

Brophy, P. (2007). The library in the twenty-first century. 2nd ed. London: Facet Publishing.

The Inner Temple Library Catalogue: http://www.innertemplelibrary.org/external.html (accessed: 25th October 2011).

Maloney, K. & Bracke, P.J. (2005). Library portal technologies. In: Michalak, S.C., ed. 2005. Portals and libraries. New York: The Haworth Information Press. Ch.6.

Limb, P. (2004). Digital Dilemmas and Solutions. Oxford: Chandos Publishing.

Pedley, P. (2001). The invisble web: searching the hidden parts of the internet. London: Aslib-IMI.

Harvey, T. (2003). The role of the legal information officer. Oxford: Chandos Publishing.

Géczy, P., Izumi, N. and Hasida, K. (2012). Cloudsourcing: managing cloud adoption. Global Journal of Business Research, 6(2), 57-71. (accessed: EBSCOhost - 25th October 2011.)

References


[1] The Inner Temple Library Catalogue: http://www.innertemplelibrary.org/external.html (accessed: 25th October 2011)
[2] See Limb (2004, p.59).
[3] For further discussion, see: Pedley (2001) The Invisible Web: Searching the hidden parts of the internet. London: Aslib-IMI.
[4] See Harvey (2003, p.143-6) for a persuasive discussion on the ‘librarian vs lawyer’ in terms of information retrieval within the legal profession.
[5] For detailed discussion of the concerns and benefits of cloud sourcing, see Géczy, Izumi and Hasida (2012) in Global Journal of Business Research, 6(2), 57-71.
[6] i.e. the internet becomes the storage and service provider of digital documents, which are no longer anchored to a physical location.

Tuesday 18 October 2011

DITA - Understanding blog No. 4: Information Retrieval

After last week's session on retrieving structured data from a database management system, this week's ask of retrieving unstructured data from the wide expanse of the Internet seems a staggering insurmountable task on paper. But is it really? I argue not. We do this kind of thing on a daily basis and we don't really give it much thought. The next time you want to use Google to search for tickets for an upcoming gig or theatre show, think carefully about what you are actually ... retrieving specific information from a whole mass of information deposited on the net. It has some order (websites, webpages) but we don't know exactly where we are going to find it, or even whether we will actually find anything relevant.

Information retrieval has three definitions depending on your viewpoint as either a user, a system or a source. A user typically has inadequate knowledge on the subject they are searching for, and hence seek to retrieve information through a search request to enlighten them. A system stores information, processes it and makes it available for retrieval through software and hardware. it is the technology that allows the user to search how they want to. A source is the document that contains the information we wish to retrieve, it has an intended purpose and audience. Information is a valuable commodity which is ripe for exploitation: it can be bought and sold as a service.

Information retrieval on the internet occurs whenever we make a web search (we want to find some information online). Broder (2000) conceived a taxonomy for web searching by looking at the different types of query we make:
  • Navigational queries (e.g. finding the home page for a company when you don't know the precise URL)
  • Transactional queries (e.g. a mediated activity, such as purchasing a box of fudge)
  • Informational queries (e.g. finding information on a particular subject, such as what is available or how to do something)
All the above queries are textual based (i.e. we are seeking a written record of the information). The web is home to a selection of different non-textual media, such as images and videos, and therefore the scope of our searching can be expanded to the following categories:
  • Known-item retrieval i.e. the user knows the exact item necessary to satisfy their informational need (e.g. a particular movie or video hosted online)
  • Fact retrieval i.e. the user knows what they want but do not have the precise information in order to fulfill their need. (e.g. which actor played a certain part in a particular movie)
  • Subject retrieval i.e. the user is looking to a subject, which is not precisely defined (e.g. the most memorable deaths in horror films)
  • Exploratory retrieval i.e. checking out what data is available for a provided selection (e.g. searching for classical music on iTunes)
Before information can be searched, it needs to be in a specific format in order to be retrieved (e.g. HTML, XML, MPEG). Media needs to be processed in the correct way before it can indexed correctly. In order to assist the indexing process, a number of processes should be followed with the text descriptors for the media to be retrieved: 

     0.  identify the fields you wish to make searchable in the index (e.g. the most memorable parts of the
          document which are typically searched for, such as title, author, year etc. This allows for highly
          accurately, focused searching to be carried out)
  1. identify words that will act as keywords for a search procedure, which will be those terms or phrases that are likely to be searched by the user. A consideration of whether digits, non A-Z characters will be included or excluded needs to be undertaken. Keeping the keywords in lowercase will yield more accurate search results.
  2. remove stop words such as and, the.
  3. stem words, by cutting off the suffix to allow for wider searching of a concept or term e.g. act! would bring up results for acting, actors, actions etc.
  4. define synonyms, i.e. different words that have the same meaning.
Once you the information has been prepared for indexing, it needs to be formatted into a structure - it can be in the form of a surrogate record, i.e. a record within the database which acts as a 'list of records' for all the information contained in the database that you are interested in) or as an inverted file (i.e. we look at words to find documents, rather than the other way around ... looking from the inside out!)

Index structure in place ...  we can now search! Search models for information retrieval include boolean connectors (AND, OR, NOT), proximity searching (within same sentence, paragraph or phrase; word adjacency), best match results generated through ranking systems built into search engines such as Google, and simply browsing the internet (bypasses any indexes in place).
Should the preliminary search fail, we can then try a manual query modification (by adding or removing terms from the initial search query) or try an automatic query modification such as a 'show me more on this topic' which is provided for you by the search engine.

Once you have conducted a search, how do you determine how relevant the results are? You need to evaluate it.

It can be done qualitatively, from a user viewpoint (was that user was satisfied with the search results) or a sources viewpoint (how much should the user be charged for search services providing relevant results)

It can be done quantitatively from a systems viewpoint, by which we can evaluate the retrieval effectiveness and efficiency by calculating precision and recall respectively:

Precision = the proportion of retrieved documents that are relevant
   = relevant documents retrieved
      total documents returned

Recall = the proportion of relevant docs documents
   =  relevant documents retrieved
       total number of relevant documents in the database

The practical lab session allowed us to explore the exercise of  information retrieval by using two internet search engines, Google and Bing, to search for a variety of information by making search queries, then calculating the precision and recall of each engine. Because we are already well versed in searching the internet, and because I already used advanced search models such as boolean connectors for online searching, I was able to find relevant searches efficiently. The session as a whole however reinforced the need for well structured index structures and precise searching models to be in place if we are to retrieve information that is relevant to our needs at the time we need to access it.

Marathon update. It happened. I rocked.

In-between a heavy dose of DITA today (almost caught up!), I totally forgot to follow up on something important.

I ran the Abingdon Marathon on Sunday and it didn't kill me!!!

Debut marathon time of 3 hours 18 minutes and 31 seconds. Finished 206th out of 777 runners. So, so proud of this achievement. I've spent the last quarter of a year training for that one race and I'm relieved that I gave it a bloody good go!

Loved the experience, but won't be repeating it a hurry! Having a week off from running but I'll be back out there before you know it ;-)

DITA - Understanding Blog No. 3 - Relational Databases

To start off, I will confess now that databases fascinate me. In every job I have worked, from being on the customer service desk in a busy supermarket to managing the IP portfolio for a multinational drinks producer, I have used a Database Management System (DBMS) to assist me in my vocation. The three main components of a database are the data-set, the users requiring access to that data, and the systems applications and processes which permit the user to access that data. A balance of three is required although I would suggest that a database without users loses its sense of identity and purpose and reverts back to being simply information. If users are the valuers of the data; the database is the facilitator of data access.

DBMS manage 'structured' data, that is, data which we have carefully selected and stored in a specific form, to be accessed for a particular purpose, on any number of occasions, potentially by a number of different users. Its acts as the user interface to access this information, and imposes security controls to restrict access according to the data/user type. The efficient management of large amounts of data is crucial, because we are only likely to need to access small amounts of data subjectively relevant to our needs at any particular time, such as when we make a query on an individual component of any data-set (i.e. we may only want to find out two pieces of data such as name, location, salary of a specified employee - not every employee in the same building or department). Every database user is likely to have a different informational need that such a query seeks to satisfy, and the ability of a DBMS to sift through and filter data in accordance with our individual requirements is fundamental in achieving this.

DBMS facilitate and permit access to a core set of data. This eliminates the need for duplicate entries, and thus promotes user efficiency and improves data integrity. In order to provide better access to the data, relationships need to be established between pieces of data which draw upon the logical process of user enquiry. In its simplest form, a database can be represented as a table consisting of rows and columns: each row stands as a single entry under a data-field (column) (i.e. a person, company, item), each row has a unique identifier, 'a primary key', which identifies the data in that row to be different from all other rows. Where data in the table is duplicated or diluted (i.e. it is not focused on the user), data from one column is removed and is replaced by a 'foreign key' in the first table, which links to a second separate table which contains the now removed data-field. Each row in the second table relates to the specified 'foreign key' for data contained in the first table, thus creating links between the two tables. This linking is the essence of database construction. With the links now in place, we can now search for data between the available tables to create a 'database'.

Creating a simple database requires the use of precise, uniform language to retrieve data from a number of tables. This is commonly known as SQL (Structured Query Language, in full). I'll refer to basic examples given in the lecture notes, as follows:

"To create a table, we can insert the following SQL commands:
create table tablename ( column1, column2, ... columnN );
...where column is the column name, followed by the column data type,
possibly followed by modifiers like 'primary key.

To populate a table you use the insert into command as follows...
insert into Department values ( 1, 'Sales', 'London' );"










Once we have created and populated the table, we can now query it.

To query a database, we need to SELECT a data field (i.e. name, location, salary) FROM a specified table/s (i.e. user table, location table) WHERE certain conditions are met (i.e. = equals, a "precise item", > is greater etc.) AND where a second condition is required.

An example of an advanced SQL command is:

SELECT Fname, Lname, Dept_Name
FROM Employee, Department
WHERE Dept = 2
AND Dept = Dept_No


Our practical lab session involved us interrogating a database containing bibliographic data for a number of publication, using a variety of increasingly complex SQL commands to retrieve specific information from the database. Getting the SQL commands correct and retrieving something resembling 'useful' information was an uphill struggle to begin with, but improved as I became more fluent in the language. Like HTML, it is essential that the instructions you give are fully realised and executed with precision, as you are given no leeway if a single character is wrong or is out of place!
The clear and correct use of commands and connectors is imperative for effective querying.
 
I have never thought of the "science" behind databases, and this session gave me a great insight and appreciation of the DNA of a simple relational database. One day (technical ability abiding), I would love to be be able to write a database, but until I get the hang of SQL and querying other users' databases, it may be a long day coming!!!

Saturday 15 October 2011

Getting up to speed ... and a loooooong run

Finally getting back up to speed with my blog entries. Obviously a little behind (I need to get a shift on and write about relational databases!), but other 'life' things are happening and are causing a distraction.

Major distraction with this weekend is my first ever marathon. Arrrrrrgh!

I'm running the Abingdon Marathon in Oxfordshire. The race starts at 8:45am (majorly early!) and I hope it'll only take me a few hours or so to run it!

It feels like I'm about to take my final exam, or graduate or something: 17 long weeks of intensive training, and it all comes down to one day!

Once that's out of my system, I'm head-down back into study! Wish me luck and I'll update you next week as to how it all went!

DITA - Understanding Blog No. 2B: HTML and the Internet (Practical)

The practical lab exercise essentially asked us to explore HTML (hyperlink text mark-up language) and create some documents that we would be able to publish on the web through the University's webspace (too kind City, too kind!).

HTML, like any language, needs to be a clearly defined set of instructions in terms which must be followed and understood by the end user. The document is the mouthpiece of the creator (here, for example, our instructions are set out as ASCII text in a simple wordpad format), and the listener is the world wide web (it is reads the HTML code from the wordpad document, translates and reproduces the "ideas" in a visual form which it publishes on the designated medium i.e. as a webpage on the internet).  It must be universal in application otherwise we would there would be inconsistencies and misunderstandings in the content, structure and meaning of the information which we wish to communicate. It is therefore crucial that we understand how to communicate fluently in HTML, otherwise the information we wish to share will become "lost in translation".

The 'instructions' of the HTML are known as tags. Examples are <p> for paragraph, which specifies that a new paragraph is to be inserted; <hr> for horizontal rule, which specifies that a horizontal line is inserted at the place on the document (presumably to act as a a divider) and <ol type=""><li></li></ol> for an ordered list, which specifies that you are making a list of items which are to run in a specific order (i.e. they are numbered, lettered).
If you've ever posted on an internet forum, you might already have a flavour of what the basic tags are and how to use them (I am an absolute stickler for making things <b>bold</b>, <u>underlined</u>and using lots of pretty colours to grab your attention when reading this). The essence of tags is that they must consist of clear instructions, which fundamentally tell the WWW where and when the requested formatting of the ASCII text is to start and where it is to stop on the webpage. A start tag is the instruction in brackets <p>; the end tag is a forward slash preceding the instruction again in brackets </p>. Tags work in pairs; if you only have one, the solo tag will be read as ASCII text only.

Soooo, with the basics in place, we can now confidently write a basic webpage in HTML. The example used in the lecture being:

A Simple HTML Page With Hyperlink
<HTML>
  <HEAD>
    <TITLE>A Simple HTML Page</TITLE>
  </HEAD>
  <BODY>
    A web page using HTML to produce
    a hyperlink to
    <a href="http://www.city.ac.uk/">
    City University</a>.
  </BODY>
</HTML>

The HTML page opens with a <HTML> start tag and closes with a </HTML> stop tag. This tells the receiver that we will be writing HTML code to say what we want to appear on our page. Every webpage has a HEAD and a TITLE is contained within that. The BODY is the context that appears on in the main browser window, which can include ASCII text, images and hyperlinks.

By creating more HTML webpages, you can effectively create a website by linking them together.

Here is my self-made webpage, as published on the City webspace! Liam's webpage
(note how basic it is ... I have included a few links to other webpages, an ordered and unordered list. I did create subsequent pages and an index page to link them all, but clearly I forgot to publish them. D'oh!)

Cascading style sheets (CSS) can additionally be applied to the internet browser you are using to view the HTML code as a webpage, which applies different stylistic qualities to the format, font size, background colours etc.

So if we master the language, create some content and apply a little creativity (and remember to publish it!!!) ... we can all make our thoughts accessible through HTML and the internet!



Thursday 6 October 2011

DITA - Understanding Blog No. 2A: All things Internet and World Wide Web

Our lecture opened with an analogy: if the Internet is the road infrastructure, then the WWW is the car driving down it. I like analogies :-)

The Internet is an large infrastructure connecting computers across networks. This allows us to share and access information remotely. It forms the building blocks of all online communications.

The World Wide Web (WWW) is the service or the vehicle designed to enable us to use and manage information across the global network we refer to as the Internet.

The Internet facilitates the operation of the WWW: the latter being dependent on the former. In essence, client computers (such as the everyday PCs or laptops we use to surf the web, check emails etc.) send requests for information to all powerful server computers (which store masses of archived data backups) whenever we attempt to access an online resource such as a webpage. The server computer listens out for the requests and by way of acknowledging them, send back the requested information to the client computer. The lines by which the electronic communications travel are the networks, this global network of networks being the Internet.

Everything you see and touch in the online world is anchored: the resource file containing that information will be saved on a hard disk somewhere i.e. it has a physical location. In order to access that file, we need to ask for it. If we know the precise location, it becomes easy to find. We can do this using a Uniform Resource Locator (URL). A typical URL contains the name of the server, domain and the folder and/or sub-folders containing the file on the server computer.

In the lecture notes, a URL is represented using the following formula:

<protocol>://<server dns name >/<local file path in relation to server folder>

http://www.fvspartans.org.uk/clubchamps.shtml
can be broken up into
http://    www.    fvspartans   .org.uk/     clubchamps.shtml

The first two bits of information tell us that we are seeking a world wide web document and that it is to be transferred to us through the hypertext transfer protocol (HTTP). The file will seek is therefore a hypertext mark-up language document (HTML) called 'clubchamps', stored on the server machine named 'www' at 'fvspartans', which is part of the domain 'org' in the United Kingdom, or 'uk'. HTML uses a special type of natural language which only exists in the digital world, which links sections of documents or documents to other documents. Text marked up with links is referred to as hypertext.

The practical side of this topic, explored fully in the lab tutorial, looked at the composition of HTML, which is largely a series of content (such as text and images) surrounded by mark-up codes (such as meta tags which define style and format).

We have been asked to generate a simple HTML document and publish it on the City University web server. Due to time constraints, I am only 60% of the way there and hence will be revisiting this topic in the concluding part of this DITA understanding blog 2B.

Sunday 2 October 2011

A Standalone statistic ...

Okay, I aside from my desire and urges to learn *everything* informatics and science (looking forward to another dose of DITA and a load of LISF tomorrow!), I took part in a race this morning. As in running race ... not the race to get in the shower, get my clothes on and either slouch in the garden/on the sofa etc etc.

Standalone 10K was that race. Set in Letchworth, Hertfordshire (10 minutes drive away from me), its a popular local race where my running club always have a formidable presence. Seriously, we looked like a mob of troublemakers standing on the street corner all in our identical stripey blue club vests.

My stat is this:

I finished 52nd out of 1062 finishers. That's in the top 5% of all people who turned up and decided that running up and down the unkindest hills in the county, on the hottest Sunday in October I've ever had the pleasure of getting sweaty on, would actually be a good, *fun* idea. Categorically we're labelled "runners" which means we have a high pain threshold and do not garner any sympathy from other normal folk, who believe we are actually bonkers. Those normal folk are sadly correct in their beliefs.

Oh, and it was a personal best time for me too - 39 minutes and 39 seconds! I've wanted to go under 40 minutes over 10K for a long while now so it felt really, really great ... once I had got my breath back and recovered of course!

If I can source a picture of me being a loon on Sunday morning, I'll post it here.

Now ... back to the fun part of my day. Study!