Lucene Revolution 2011 Presentations


Day 1 Presentation Abstracts

Click on the slide thumbnail to download the presentation

The Once and Future History of Enterprise Search and Open Source

Download PresentationPresented by Marc Krellenstein, Lucid Imagination

While it remains challenging to build best practice search applications, core search technology has become commoditized. Open source Lucene/Solr represents the best form of that commodity. It is as good as or better than than any commercial search technology while also providing the cost, control and flexibility advantages of open source. In this talk, we'll look at how past challenges in search were met and new ones evolved, and the place of Lucene/Solr in that evolution.

Watch session video.

From Publisher To Platform: How The Guardian Embraced the Internet using Content, Search, and Open Source

Download PresentationPresented by Stephen Dunn, Guardian News and Media UK

In 2009 The Guardian launched The Open Platform, a suite of services and tools that enable content partners and developers to build applications with The Guardian's rich content. The content API, hosted on Solr instances on EC2, contains JSON representations of all Guardian articles back to 1999 - over 1 million articles, and is an increasingly complete representation of the output of the organisation. The DataStore contains curated data sets for use in applications and virtualizations.

This talk will cover how The Guardian opened up their business, enriched it, and reached new markets with its Open Platform strategy. Stephen will cover the technical architecture, implementation of Solr (the key technology powering the platform), and how The Guardian has used it to embrace disruption in the media space, while finding new sources of revenue and innovation. With two years since its launch, Stephen will cover some of the lessons learned, and explain how the Guardian complements use of Solr with other open-source non-relational technology, as it platform evolves.

Watch session video.

Finite State Automata in Lucene: Internals and Applications

Download PresentationPresented by Dawid Weiss, Poznan University of Technology, Poland

Finite state automata and transducers made it into Lucene fairly recently, but already show a very promising impact on search performance. This data structure is rarely exploited because it is commonly (and unfairly) associated with high complexity. During the talk, I will try to show that automata and transducers are in fact very simple, their construction can be very efficient (memory and time-wise) and their field of applications very broad. This will be backed by an introduction to how FSTs are implemented in Lucene (construction and traversals) and practical use cases of where FSTs have been useful so far. If you'd like to see how to squeeze a 150MB of text data into 1.8MB of compact data structure, this talk is for you.


Boosting Documents in Solr by Recency, Popularity and Personal Preferences

Download PresentationPresented by Timothy Potter

Attendees with come away from this presentation with a good understanding and access to source code for boosting and/or filtering documents by recency, popularity, and personal preferences. My solution improves upon the common "recip" based solution for boosting by document age. The framework also supports boosting documents by a popularity score, which is calculated and managed outside the index. I will present a few different ways to calculate popularity in a scalable manner. Lastly, my solution supports the concept of a personal document collection, where each user is only interested in a subset of the total number of documents in the index. My presentation will provide a good example of how to filter and/or boost results based on user preferences, which is a very common requirement of many Web applications.

Watch session video.

Real-time Search at Yammer

Download PresentationPresented by Boris Aleksandrovsky, Yammer, Inc

This talk will be focused on the architecture, scalability concerns, performance bottlenecks, operational characteristics and lessons learned while designing and implementing Yammer distributed real-time search system. Yammer is an enterprise social network SaaS offering with over 100,000 networks (including 85% of the Fortune 100) and nearly 2 million users. The search system we developed scales well up to 1B messages and serves a foundation of knowledge base analysis services Yammer is developing.

Watch session video.

Case Study - Panasonic Europe Powered by Apache Solr

Download PresentationPresented by Daniel Potzinger, AOE media GmbH

In 2010 Panasonic made the decision to replace their legacy enterprise search tool and switched the search for all their European websites to a Apache Solr based solution.

Now their customers benefit from an incredibly fast and feature rich solution that is much more than just a search and has become a valuable sales-driving tool for Panasonic. Features like relevancy manipulation, autosuggest, contextual filtering for properties like color or product category were implemented under not the most ideal circumstances mainly that there was no access to structured data. The search was rolled out in close to 30 countries so far also putting Solr multi-lingual handling to a test.


Searching The United States Code with Solr/Lucene

Download PresentationPresented by Ronald Matamoros, Search Technologies

What are the challenges in searching an 85 year old document? The United States Code was published by the United States Congress in 1926 as a single bound volume containing all of the general and permanent laws of the United States Government. It has been updated every year since and has grown into a 30 volume set of some 40,000 pages divided into 50 titles.

The talk will cover the challenges searching this collection and the specific Solr and Lucene solutions and plug-ins implemented at each point, including hierarchical browsing of the TOC, searching and highlighting sub-sections of documents, custom query features, and search user interface components. The implemented required custom token filters, query parsers, document parsing and processsing, and Span operators.


Jazzed about Solr: People as a Search Problem

Download PresentationPresented by Joshua Tuberville, eHarmony

Search oriented architectures are obvious approaches for web pages, emails, documents, and other text based entities. Often with traditional structured data, text searching is "added on" to the traditional Boolean queries in relational stores. When Jazzed was initiated we wanted search to be front and center. When we evaluated Solr we realized we could take the opposite approach "add on" Boolean components to textual searches. This hybrid query approach makes transitioning to flexible ranking easy and straightforward. In this talk we will cover

  • How we model semi-structured user data in Solr
  • Indexing strategies and their tradeoffs
  • Where in Jazzed architecture Solr does and doesn't fit
  • What aspects of Solr we are using
  • Future considerations
Watch session video.

Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0

Download PresentationPresented by Simon Willnauer, Apache Lucene PMC

Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the "next generation" features. DocValues enable Lucene to efficiently store and retrieve type-safe Document & Value pairs in a column stride fashion either entirely memory resident random access or disk resident iterator based without the need to un-invert fields. It's final goal is to provide a independently update-able per document storage for scoring, sorting or even filtering. This talk will introduce the current state of development, implementation details, its features and how DocValues have been integrated into Lucene's Codec API for full extendability.

Watch session video.

Integrating Advanced Text Analytics into Solr

Download PresentationPresented by Steve Kearns, Basis Technology

Text analytics provides a number of interesting analytic capabilities that can enhance enterprise search applications, though in practice it is not always obvious how these can be integrated effectively into Solr. This presentation will describe some of the practical ways that leading organizations are using text analytics by integrating them directly into Solr and their user interface to improve relevance, navigate results, and discover new information. The combination of Solr and quality text analytics can improve existing keyword search solutions, and enable new ways of discovering knowledge hidden in existing data.


Search, APIs, capability management and the Sensis journey

Download PresentationPresented by Craig Rees, Sensis

Earlier this year, Sensis launched its Business Search API, which allows publishers to develop local search propositions powered by the two million business listings contained in the Australian Yellow Pages® and White Pages® directories.

This case study will explore Sensis' strategic direction for search and explain how the framework and metrics by which search is managed at Sensis were used to define our search roadmap. Key architectural decisions including our use of Solr and MongoDB will be discussed as well as our approach to real-time search tuning and quality management.

Watch session video.

Four Pillars of Designing the Search Experience

Download PresentationPresented by Tyler Tate, Twigkit

Lucene and Solr provide many excellent tools for presenting information to users, but what makes some search user interfaces better than others? Should you aim for a rich, advanced UI or should you "just make it look like Google"?

Through his work at TwigKit with blue-chip corporations, scientific institutes, and governments, Tyler has identified four guiding pillars of the search experience:

This discussion covers the approach taken by Open Text to deploy Nutch, Solr and other technologies into a self-provisioned cloud-based offering including:

  • User Expertise - Novices orienteer, experts teleport
  • User Behaviour - Lookup, learn, and investigate
  • Information Diversity - homogenous vs. heterogenous data
  • Situational Context - factors from the surrounding environment

We'll delve deep into each dimension and discuss how to achieve useful, useable, and beautiful search interfaces using design patterns including: autocomplete, faceted navigation, breadcrumbs, best bets, related searches, spelling suggestions, clickable metadata, result clustering, saved searches, data visualisation, and more.

Watch session video.

A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

Download PresentationPresented by Ed Bueche, EMC

Documentum xPlore provides an integrated Search facility for the Documentum Content Server. The standalone search engine is based on EMC's xDB (Native XML database) and Lucene. In this talk we will introduce xPlore and some of its key components and capabilities. These include aspects of a tight integration of Lucene with the XML database: xQuery translation and optimization into Lucene query/API's as well as transactional update Lucene). In addition, xPlore is being deployed aggressively into virtualized environments (both disk I/O and VM). We cover some performance results and tuning tips in these areas.


Using Solr in Online Travel Shopping to Improve User Experience

Download PresentationPresented by Esteban Donato, Sudhakara Karegowdra, Travelocity

In this talk we would like to present three different use cases of Solr in the travel industry. First of all we would describe how we implemented faceted navigation for hotel shopping. Then, we will introduce how we implemented destination searching functionality like auto-complete and misspelling. Lastly, we will show you how we integrated Solr to provide better experiences to mobile users.


Solr @ eBay Kleinanzeigen

Download PresentationPresented by Olaf Zschiedrich,

Attendees will learn how eBay Germany has implemented Solr, why Solr was selected, which Solr features are utilized. and how Solr is configured and used in production. Recommended best practices will be profiled alomng with eBay Kleinanzeigen plans for future deployment of Solr.

Watch session video.

Rapid Prototyping with Solr

Download PresentationPresented by Erik Hatcher, Lucid Imagination

Got data? Let's make it searchable! This interactive presentation will demonstrate getting documents into Solr quickly, will provide some tips in adjusting Solr's schema to match your needs better, and finally will discuss how showcase your data in a flexible search user interface. We'll see how to rapidly leverage faceting, highlighting, spell checking, and debugging. Even after all that, there will be enough time left to outline the next steps in developing your search application and taking it to production.

Watch session video.

Search Analytics: What? Why? How?

Download PresentationPresented by Otis Gospodnetic, Sematext

You've indexed your data and people are searching it. But how do you know if they are happy with the results? How do you know if they are finding what they need? With search incresingly becoming the primary information access mechanism, knowing how your search is doing is not just a matter of mere curiosity, but often has direct business impact.

In this talk we'll talk about Search Analytics and how it can be used to answer questions like:

  1. Are too many users getting the dreaded "no matches" results?
  2. How deep into search results do people dig?
  3. Which hits are they clicking on, or what percentage of them don't click on any hits?
  4. How much do they use the Did You Mean or Auto-Complete suggestions?

We'll explore what specific Search Analytics reports tell us and what specific actions you should take based on those reports.


"Stump The Chump": Get On The Spot Solutions To Your Real Life Solr/Lucene Challenges

 Presented by Grant Ingersoll, Lucid Imagination

Got a tough problem with your Solr or Lucene application? Facing challenges that you'd like some advice on? Looking for new approaches to overcome a Lucene/Solr issue? Not sure how to get the results you expected? Don't know where to get started? Then this session is for you.

Now, you can get your questions answered live, in front of an audience of hundreds of Lucene Revolution attendees! Back again by popular demand, "Stump the Chump" at Lucene Revolution 2011 is hosted by PMC chairman and Lucid Imagination co-founder Grant Ingersoll.Grant's going on the hot seat in front of your peers, to tackle questions live.

Our MC will read the questions, and Grant have to formulate a solution on the spot. A panel of judges will decide if he has provided an effective answer. Prizes will be awarded by the panel for the best question - and for those deemed to have "stumped the chump".

Watch session video.

Day 2 Presentation Abstracts



All Data Big and Small

Download PresentationPresented by Stephen O'Grady, Redmonk

The last twenty four months have seen a veritable explosion in discussion around what is commonly refererred to as Big Data and the infrastructure technology employed to manage it. The wealth of available open source software means that businesses from any industry have easily accessible tools with which to tackle projects that would have been out of their reach just a few years prior. Less heralded, however, has been the fact that making data actually useful - whatever its size - remains a challenge. In this session we'll explore the role of search in putting data - big and small - to work answering the important questions for businesses and society by reducing the friction between question and answer.

Watch session video.

Highly Relevant Search Result Ranking for Large Law Enforcement Information Sharing Systems

Download PresentationPresented by Ronald Mayer, Forensic Logic

Law enforcement data has many interesting complexities for search. Cross-agency searches are even more challenging because each agency has it's own shorthand. Many different types of similarity between search clauses and documents should influence the ranking of results. For example, a search clause mentioning a "tall suspect" might want to include results with "6 foot 4 suspect". Spatial clusters are important, as are temporal patterns. Different fields may be more or less important depending on the type of crime - for example, a victim's race may matter more than a vehicle's make in a sex crime; but less in an auto theft. Also, documents may be related to each other in various ways that may also affect their ideal search ranking.

Solr's great flexibility in its analyzers, filters, synonyms, and boosting make it excellent tool for such diverse requirements.

We've contributed a patch to Solr (#SOLR-2058) that helped us further improve search result ranking for cases where a search for a suspect with a 'red baseball cap, black leather jacket' is compared against many documents mentioning red caps, black caps, etc.

This presentation will describe how we addressed some domain-specific challenges of our data.


Intuit's Live Community

Download PresentationPresented by Floyd Morgan, Intuit

TurboTax Live Community is a large scale web application that uses user contribution and open source technology to assist millions of TurboTax users complete their tax returns. Other benefits from Live Community include reducing support calls, highly effective advertising campaigns, usability engineering and new for this year conversion prediction analytics. I will present how Solr/Lucene powers the many facets of TurboTax Live Community now in the future.

Watch session video.

Using Solr/Lucene/LWE for eCommerce

Download PresentationPresented by Grant Ingersoll, Lucid Imagination

If you're user can't find it, they can't buy it right? In this talk, Apache Lucene and Solr committer Grant Ingersoll will discuss architecture, techniques and tips for successfully deploying search tools like Lucene, Solr and LucidWorks Enterprise in eCommerce environments.

Watch session video.

Flexible Indexing in Lucene 4.0

Download PresentationPresented by Uwe Schindler, SD DataSolutions GmbH

Apache Lucene's next major release, 4.0, will introduce lots of flexibility into indexing, but also fundamental changes to the well-known APIs: It features a new and consistent, 4-dimensional iteration API on top of a low-level, pluggable codec API giving applications full control over the postings data. Terms are now arbitrary opaque bytes enabling users to store terms in any encoding, not necessarily UTF-8, natively in the index (e.g. numeric fields). Currently under development is a higher performance postings iteration API, enabling interesting codecs based on recent encoding algorithms to work effectively. Several codecs have already been created, including the default "standard" codec, which enables sizable RAM reduction for searchers, and a "pulsing" codec that inlines postings data directly into the terms dictionary, which provides a solid performance boost for primary key fields. A lot of new codecs are under development like "PFOR", "FOR", "AFOR", or "Simple64". In this talk, Uwe presents an overview of all of these exciting changes, as well as several concrete, real-world examples of how applications can tap into these new features.


Handy Installation Tool "Anuenue" for Solr Cluster, and Implementation of "Did you mean" Facility for Queries in Japanese.

Presented by Takahiko Ito, mixi

mixi is one of the largest social networking services in Japan, providing various communication services for over 14M monthly active users. The latest internal mixi project is to replace the in-house search engine with Apache Solr. This session covers two topics; a simple packaging system for Solr that eases the installation process and daily operations, and implementation of a "Did you mean" facility for Japanese queries using a log mining tool. These tools have been released as OSS projects.



Extending Solr: Behind CareerBuilder's Cloud-like Knowledge Discovery Platform

Download PresentationPresented by Trey Grainger, CareerBuilder

For CareerBuilder, a 1% deviance in search relevancy can mean millions of missed job opportunities for our users. When CareerBuilder moved to Solr from an expensive, proprietary search vendor, our top priorities were maintaining the quality of our search results and drastically improving our agility. This talk will describe how we addressed both needs. For search quality, we'll cover some of our internal studies and resulting methods for dealing with multi-lingual content across dozens of languages, as well as customizing and experimenting with relevancy calculations. For platform agility, we'll discuss CareerBuilder's cloud-like search API framework which seamlessly handles millions of searches an hour, processes hundreds of millions of documents, and is powered by hundreds of globally-distributed servers. Come hear the results of our studies and some best practices for quality and performance. Learn how our framework has lead to staggering improvements in both maintainability and technology innovation, allowing us to learn from our content, not just find it.


Transforming the House Hunting Experience: How Solr is Helping Trulia Reshape the Real Estate Industry

Download PresentationPresented by Alexander Kanarsky, Trulia

Trulia is a real estate search company that helps customers find homes for sale or to rent and provides them with information to help them make better decisions in the process. It is also a hub for real estate professionals to market their listings, view real estate data and promote their services.

The presentation describes how Solr helped Trulia to transform the traditional real estate experience and make real estate data accessible and understandable to millions of users. It discusses approaches we took to achieve this by using custom-built distributed index management, indexing integration with Hadoop and geospatial search enhancements to Solr.

Watch session video.

Implementing Click-through Relevance Ranking in Solr and Lucid Works Enterprise

Download PresentationPresented by Andrzej Bialecki, Lucid Imagination

This talk will present what are click-through events and how to process them with LucidWorks Enterprise. This innovative technique puts powerful search and relevancy at your fingertips -- at a fraction of the time and effort required to program them yourself with native Apache Solr. Andrzej will discuss and present how you can use LucidWorks Enterprise for:

  1. Click Scoring to automatically configure relevance for most popular results
  2. Simplified implementation of auto-complete and "did-you-mean" functionality
  3. Unsupervised feedback to automatically provide relevance improvement on every query


Building specialized industry applications using Solr, and migration from FAST ESP

Download PresentationPresented by Rahul Agarwalla, Uchida Spectrum Inc

Uchida Spectrum, Inc. is a leader in the Japan search market. USI provides SMART InSight, a search application used by many Fortune 500 companies for specialized industry applications like R&D and quality assurance for manufacturing, claims and customer management etc.

Originally SMART/InSight was based on Microsoft FAST. This talk will review how SMART/InSight has migrated from FAST ESP to LucidWorks Enterprise, and how SMART/InSight incorporates virtual data integration, enterprise search, and the ability for users to have a unified way to navigate diverse data sources, analyze data more easily, and personalize results. Several use cases will be profiled with demonstrations of real-world use cases.


Using SOLR For Enabling Highly Customized Sitewide Navigation

Download PresentationPresented by Shantanu Deo, AT&T

The organization needed to enable a very customizable form of Global Navigation for the various types of users (based on their profile and other factors). This would normally have involved complex logic to figure out the appropriate set of links to show for a customer, and would have been a maintenance nightmare. Instead we approached the problem as a search problem. Coupled with a novel encoding scheme we were able to solution the problem simply by searching on the customers profile groups and return a coherent global navigation using SOLR to index the data.

This has resulted in a very simple to understand and maintain solution that will stand in good stead in the future.

The presentation is meant to be a description of using SOLR to implement a real-world application.


Using Solr to find the Right Person for the Right Job

Download PresentationPresented by Laura Kang, TheLadders

In this talk, we'll describe how uses Lucene/Solr to instantly recommend candidates to a recruiter when he/she posts a job on the recruiter site. Our matching algorithm scores candidates from our job seeker site based on the criteria and description of jobs and job seekers' resume and profile data. This helps recruiters quickly identify candidates that are right for the job and increases the chance of our job seekers getting hired.

The talk covers an overview of our Solr architecture and a description of our matching algorithm. We'll also a discuss criteria for evaluating the algorithm, including an overview of our testing sessions and their format. Finally, we'll also demo the feature so you can see how it works in practice.

Watch session video.

The Seven Deadly Sins of Solr

Download PresentationPresented by Jay Hill, Lucid Imagination

In this talk, we'll describe how uses Lucene/Solr to instantly recommend candidates to a recruiter when he/she posts a job on the recruiter site. Our matching algorithm scores candidates from our job seeker site based on the criteria and description of jobs and job seekers' resume and profile data. This helps recruiters quickly identify candidates that are right for the job and increases the chance of our job seekers getting hired.

The talk covers an overview of our Solr architecture and a description of our matching algorithm. We'll also a discuss criteria for evaluating the algorithm, including an overview of our testing sessions and their format. Finally, we'll also demo the feature so you can see how it works in practice.


Advanced Search and Analytics in 20 Minutes

Download PresentationPresented by Mark Davis, Kitenga

Kitenga's ZettaVox and ZettaSearch products support SOLR and Lucene ecosystems at both the ingestion point and for the search user. In this talk, I will show how ZettaVox, our professional content mining platform on Hadoop, can be used to index content and rich metadata into a LucidWorks Enterprise installation. Being built on Hadoop, ZettaVox scales up by scaling out. I will then create an end-user search and analytics experience using our ZettaSearch solution that leverages the faceted metadata to enhance information discovery and analysis. All in about 20 minutes.


Solr and Lucene at Etsy

Download PresentationPresented by Gregg Donovan, Etsy

Etsy is using Solr and Lucene to serve queries at a rate of more than 8 billion per year (and growing). In this case study, we will describe how Etsy has integrated Solr/Lucene into our continuous deployment infrastructure (see: , allowing for Solr configuration, Java-based indexers, and query parsing logic to go from passing tests to production code in minutes. We'll also discuss how we're leveraging Solr's new Geo-search to power both local item search and GeoIP-personalized location autosuggest.

We'll also share how we've extended Solr, adding personalized faceting and filtering as well as multi-currency sorting and filtering that accounts for realtime currency fluctuation (contributed in SOLR-2202) Note that code will be open-sourced/contributed for both of these features]. We will share our real-time monitoring techniques, including how we track Solr replication, query, and GC times in Ganglia. Finally, we'll discuss how we've used Hadoop-based user analytics to improve relevance and power data-driven spelling corrections, autocomplete suggestions, and related searches.

Watch session video.

Solr Performance: Key Innovations

Download PresentationPresented by Yonik Seeley, Lucid Imagination

Recent developments in Solr/Lucene have made significant contributions to distributed search processing, scalability, and throughput. In this talk, Yonik Seeley, creator of Solr, will survey key performance strategies for building search applications with Solr, and review innovations included in Solr 3.1, as well as forthcoming development work in Solr 4.0 and beyond.

Watch session video.

Building SaaS Solutions for Online Media Using Apache Solr

Download PresentationPresented by Alberto Mijares, Canoo Engineering AG

In the last years, the idea of building applications that can be used remotely by mean of the Web, has coined a new concept called "Software as a Service". Such applications, have the advantage of a remote web deployment that can be instantaneously be used by potentially any consumer in internet or of the cost reduction that a Web-based deployment provides.

The speaker explains in this talk the architecture of an innovative "Software as a Service" solution built for Axel Springer media group in Switzerland.

This application is capable of extracting remotely the content of multiple online newspaper articles, analyze them and classify them determining which articles are the most similar to a given one. This information is then integrated back into the article to provide the user with a "related articles" feature. The key points that made this "SaaS" application successful are: the low integration effort, the minimal TCO, the superior results quality and the capability to integrate information across different websites with a pragmatical approach.

The core components of the analysis process are: language-specific tools (used to filter the superfluous language terms) and semantic knowledge bases (like Wikipedia, used to enrich the indexed information with new context specific terms, or to disambiguate the extracted terms).

In a more technical layer, the speaker will explain the criteria to select the emerging enterprise search framework Apache Solr as platform and how it reduced drastically the development effort required.

As a summary, a list of the key achievements and conclusions will be presented to the public, pointing out the maturity and robustness of Apache Solr as a flexible and open-source based enterprise search platform.

Using Solr Cloud to Tame an Index Explosion

Download PresentationPresented by Jon Gifford, Loggly

We have hundreds of customers, each of whom may have dozens of shards. To manage this explosion of indexes, I'll describe how we're using Solr Cloud to manage every index - from creation, through migration from box to box, and finally destruction. I'll describe some of the performance issues we had to deal with, especially with ZooKeeper.

Lucene @ Yelp

Download PresentationPresented by Sudarshan Gaikaiwari, Yelp

This talk describes how the Yelp uses Lucene to provide search services. It includes

  • Statistics of Yelp search usage
  • Overview of Yelp search architecture: Yelp uses different services to provide searches for different types of data. Some are based on Lucene and some on SOLR
  • Deeper dive into business and review search. This is the most important search service at Yelp.

We will cover:

  1. Yelp's implementation of a micro sharded architecture and differences with Katta.
  2. Yelp extensions to Lucene to implement features such as filters and performance comparison with solr/Bobo
  3. Yelp's implementation of index replication.
  4. Various tricks used at Yelp to make the service faster.


Watch session video.

CPython Embedded in Solr - Search Solution for Python Lovers With the Speed of Native Java

Download PresentationPresented by Roman Chyla, CERN

SPIRES is the biggest bibliographic database for High Energy Physics, ArXiv is the biggest fulltext repository for the fulltext papers in High Energy Physics, and INSPIRE is the biggest digital library that merges the two. We must work with result sets bigger than 1 million for citation related queries and our partners from Astrophysics with 6 million sets, however INSPIRE is written in Python. So how do we move several million result sets between the two systems fast? How do we take advantage of our special NLP processing pipeline written in Python? How do we join them? We do not use Jython. We do not use pipes. We do not embed Solr inside INSPIRE. We embed INSPIRE into Solr! The talk shows benefits and challenges of this surprisingly elegant solution.

Watch session video.

Lots of Facets, Fast

Presented by Anne Veling, BeyondTrees

We created a web application for a well-known US newspaper, to create a maps-like zooming application on top of the 60,000 newspapers since 1850 and using Solr over the 28,000,000 articles to create an interactive heatmap over it. The out-of-the-box faceting solution was optimized using domain knowledge by order-of-magnitude which allowed us to create a great visual way of exploring trends in historical newspapers.