Bibliographic databases | Deduplication

A World of
Knowledge
&
Value Creation

Our Services & Products

Digital conversion & epublishing services

Search Engine Development

NES offers a premier service for the analysis of bibliographic databases to improve their customers value and appeal. ***** Standardize your metadata with look-ups and de-duplication. ***** Organize your digital assets with appropriate classification & controlled vocabulary, and excellent search & retrieval tools available from NES. ***** NES... your best choice for the evaluation and conversion of from one format to another.

Content QA & QC

ANALYZING BIBLIOGRAPHIC DATABASES

A NES Blueprint

Types of Problems in Bibliographic Databases

Databases filled with bibliographic records accompanied by links to fulltext provide an un-paralleled service to students, scientists and researchers. These databases, however, often are plagued by errors—both common and rare—that hamper effective search and retrieval of relevant items or accurate linking. It is imperative to have clean and consistently presented data. From a retrieval standpoint, an erroneous or wrongly entered record is a lost entity. Enriched with years of expertise, NES is equipped to tackle these problems through: 1. Bibliographic database analysis; and 2. Rectification of errors and normalization of variations A database can arrive in any proprietary or other format such as MARC, CDF (comma de-limited format), TDF (tab de-limited format), TAGGED, DUBLIN Core, MS Access, HTML, and XML. NES reviews the data and prepares a specification for cleanup. Problems in databases usually are two-fold:

a) Format errors

b) Data errors

a) Format errors

One of the major errors in MARC format is the improper/wrong positioning of the data elements, which results in disarray of fields. CDF & TDF could miss some fields or position irrelevant data in fields or skip field de-limiters. Tagged format data at times misses the record de-limiters and could storm one field into another. Commonly, all these formats could have empty/null records. With decades of experience, NES is equipped to handle such situations. NES' utilities report the instances mentioned above and suggest the corrective measures to be taken.

b) Data errors

These types of errors constitute a major portion that affects the actual content of the bibliographic records. Some of the errors are listed below:

i) Missing fields

Essential fields go missing resulting in instances such as

Journal article records without journal title/title/volume/issue information
Book records without title/publisher
Records without an individual/institutional author
Records without collation/imprint
Dissertations without the institution names
Conference records without conference details

ii) Wrong representation/misplaced data in various fields

Just alphabets in year field/Year data beginning with anything other than 1 or 2
Numerals only in the author field
ISBN/ISSN with lesser/more characters than permitted
Records without title
Repetition of fields such as author repeated in title or vice-versa
Repetition of terms in index terms/keywords

iii) Messed up fields

Instances of large strings of stars, question marks, slash, etc.
Too many space/null characters within the data
Field/Record de-limiters occurring within the running text of an abstract or
full-text

iv) Bad records

Records that are less than 50K in size could have missing fields/data
Records with phrases such as Test record, check & testing, etc.

v) Sundry checks performed

Check all entities, foreign-language character sets & non-ASCII characters used in the dataset.
Examine if large text data fields (abstracts, notes, full-text, etc.) containing tables, paragraphs, indented text are properly formatted.
Evaluate journal titles/publisher names/places for consistency and construct look-up tables to standardize them
Run utilities to report non-functioning URLs in a database

vi) To analyze and rectify XML files, NES can

Assist to design, create and modify XSD schema
Provide solutions to counter discrepancies in the XML data
Validate XML files using different schemas

De-duping Bibliographic Records

A common and un-avoidable feature in database files is the occurrence of duplicate records. These records look exactly similar with or without slight variations.

Why do they occur?

Unknown duplication done at the same division due to accidental copying
of records in the input tool.
Improper bibliographic control—when two divisions of the same
organization are distributed in different geographic locations.
When two or more bibliographic databases are integrated

NES' COMPARE & COMPOSITE TECHNOLOGY

This a dynamic semi-automated process that unifies SIMILAR bibliographic records irrespective of the source of arrival. A multitude of rules have been laid down for the software to compare bibliographic elements pertaining to various publication types.

At times, minor variations exist between records from different databases because of typing or other errors. NES goes an extra mile in handling these subtle variations while de-duping them. Some examples follow:

Author Variations

Title Variations

We execute several Quality Checks on the customers’ publications.
  We have experience in performing Quality Assurance tests on
   databases/publications using specifications, support files and other
   specially developed utilities.
With nearly ten years of experience in the field, we have developed scores of program utilities to check quality and generate reports on erroneous or
misplaced contentin databases and publications.

Title Variations