Brief description: 

Large-scale document search service that automatically categorizes and relates texts by content, regardless of the language used. Around 500 categories based on the EUROVOC thesaurus have been considered to process texts. It is available through an HTTP-RESTful API and a web interface. The service is built on top of librAIry, a document management framework that combines natural language processing techniques and automatic learning algorithms to efficiently process texts. In the context of the European project where it has been developed, TheyBuyForYou, the search service relates contracts and tenders published throughout Europe.

 

Main Features: 

Through a HTTP-RESTful API, large document collections can be explored by retrieving those related to a given text or document. It allows users filtering by language and reduce the search set by document tags. Massive operations on documents (add, delete, list) can also be performed from this interface. A minimalist view of the document collection is also available through a web-based portal where users can create increasingly complex filters by mouse selections.

 

Areas of Application: 
  • Public Procurement Data
  • Any dataset related to EUROVOC categories
Market Trends and Opportunities: 

Searching for similar documents and exploring major themes covered across groups of documents are common actions when browsing collections of multilingual texts. This manual, knowledge-intensive task may become less tedious and even lead to unforeseen relevant findings if unsupervised techniques are applied to help users. The Cross-lingual Document Search Engine leverages advanced unsupervised learning techniques to process multilingual texts by establishing associations between them based on their content. Each document is classified by thematic levels according to a reference model that serves as a context for relationships to emerge dynamically.

 

Customer Benefits: 
  • Easily integrated into complex business processes through its REST API
  • Relations between texts from different languages without the need for translation
  • Remote or local deployment based on virtual images.
  • Based on web standards (Swagger, JSON, REST principles )

Technological novelty: 
  • This solution provides an added value service to end users and domain experts interested in exploring large collections of public procurement documents.
  • Based on probabilistic topic models, it does not need to translate texts written in different languages to relate them.
  • Ready to use API based on virtual deployment

 

Workflow: 
Published
Component / Service / App
Cross-lingual API
cross-lingual-explorer

Owner

Carlos Badenes-Olmedo
Spain
Type: 
BDVA member
Contact: 
BDV Reference categories: 
Data Management
Data Visualisation and Interaction
Markets: 
Information Service activities
Public administration (eGovernment)
Tourism
Readiness Level: 
Licensing: 
Apache 2.0
Keywords: 
document similarity
multilingual
categorization