Presentation Abstracts - Day 2
Day 1 Day 2
Integration of Solr With Web Crawling
Andrzej Bialecki, Lucid Imagination
This talk will describe issues involved in scalable web crawling and web search, and explain how to integrate Apache Solr as a search platform with web crawling functionality, using existing web crawling platforms: Nutch, Aperture and Lucene Connectors Framework.
Cloud-scale Search at Salesforce.com
Bill Press, Software Development Manager, Salesforce.com
How do you deploy Lucene to support millions of searches per day, by hundreds of thousands of users (each with distinct privacy settings), over tens of thousands of document sets containing both structured and unstructured data, all the while indexing hundreds of millions of document updates per day? In this talk, we will discuss the scalability challenges of search at salesforce.com, our current architecture, and the new challenges posed by new product lines, including Chatter, our new collaboration and social networking application for the enterprise.
Solr 3.1 and Beyond
Yonik Seeley, Lucid Imagination
The Solr/Lucene community is hard at work designing and developing a range of new features and fixes for Apache Solr, advancing the frontiers of search. Solr creator Yonik Seeley will provide a preview survey of these developments, and talk about how one can leverage new functionality. Topics will include new faceting functionality, new function queries, increased scalability, field collapsing, and spatial search. The talk will span features already included in trunk, features slated for the next release, as well as incomplete features under consideration for future releases.
How Open Source Leads Infrastructure Innovation
Marten Mickos, CEO, Eucalyptus Systems
Everything is online or is going online, with mobile computing accelerating the transition. A set of highly innovative infrastructure software products and technologies, mainly open source, is driving that shift. But how will information flow across this new cloud and mobile infrastructure, and how will it be found and accessed? In his presentation, Marten looks at how open source technology and business models are driving change across all the infrastructure layers - and how it affects enterprise computing built on this new foundation.
Lucene/Solr Roadmap Panel: What's Coming
Yonik Seeley, Grant Ingersoll, Chris Hostetter, Lucid Imagination; Michael McCandless, IBM; Michael Busch, Twitter
The Solr/Lucene community is hard at work designing and developing a range of new features and improvements for Apache Lucene and Solr, advancing the frontiers of search. This panel of ASF commiters will provide a preview of these developments, and will talk about how you can leverage this new functionality. Topics will increased scalability, and improved performance and new features. Come hear what's on the horizon in this interactive session hosted by Grant Ingersoll.
How Cisco’s Pulse Leverages Solr/Lucene To Put Social Networks To Work
Presented by Sonali Sambhus, Senior Search Architect & Engineering Manager, Cisco Systems, Inc.
Cisco's new Pulse(TM) is a powerful platform that uses embedded Lucene/Solr search technology to tag and indexes key terms and topics from a broad range of media -- from email to video -- in real time. Tapping into internal communications traffic, it helps find expertise from withi n the enterprise's internal social network. Cutting edge enterprise search techniques were developed at Cisco with the help of Lucid Imagin ation.
This in-depth technical workshop covers how the Cisco team designed and optimized Pulse with Lucene and Solr, on topics including:
- How Cisco’s Pulse leverages Solr/lLucene
- Optimizing stored field retrieval performance for real time search
- Operational optimization with full index hot backups
- Performance efficient methods for highlighting text
Fun with Flex: an Introduction to Flexible Indexing, coming in Lucene 4.0
Presented by Michael McCandless, Senior Software Engineer, IBM
Flexible indexing is one of the new features in Lucene's next major release, 4.0. It includes big changes to a number of places in Lucene: a new, higher performance postings iteration API; terms as arbitrary opaque bytes (not chars); direct visibility and control of deleted documents; a low-level, pluggable codec API giving applications full control over the postings data. Several interesting codecs have already been created, including the default "standard" codec, which enables sizable RAM reduction for searchers, and a "pulsing" codec that inlines postings data directly into the terms dictionary, which provides a solid performance boost for primary key fields. In this talk Michael presents an overview of all of these exciting changes, as well as several concrete, real-world examples of how applications can tap into these new features.
Streaming into Solr: Nearly Real Time Indexing and Search of Log Data
Presented by Jon Gifford, Co-Founder, Loggly
This talk will describe a streaming Log file search system based on Solr, that indexes data in real-time, and provides search access to it in at most 10 seconds later. We describe the use of 0MQ to move data around the system, and the distributed shard management system based on SolrCloud/Zookeeper that gives the system its elasticity. We take advantage of a number of non-traditional features of both the data and the expected search behavior to minimize the overall system size, while still allowing for very large indices, and input rates (all going well) in excess of 100,000 events/second.
Power to the People: How We Listened to Users and Made Them Want to Hang Around
Presented by David Oliver, Manage My Life/Sears
Manage My Life is a going-on-four-years-long experiment in a community centered experience by Sears Holdings Company. The site offers expert advice, articles and projects, and owner manuals among other attractive content. But, month after month users would find their way to our site-wide search page, find nothing useful and promptly leave. Our site is implemented in Ruby on Rails, but our search implementation was a home-grown search engine and API (using Lucene libraries under the hood) written in Java, coupled with a crawling and indexing scheme also hand-rolled and also in Java. The search experience on the site, which should have been a first-class citizen, was in reality the ugly step-child. We observed that we had a steady “hard bounce rate” at or near 100% once users landed on our search page. Also at or near 100% was the “exit rate”. We took user feedback--raw user behavior from Omniture as well as survey results--and combined a revamped user experience with the power of Solr to give users what they are actually looking for on the site. Find out the challenges we faced (both technical and otherwise), how we overcame them and what our stakeholders and users are saying now.
Solr Powered: Revamping Search, Reshaping the Publishing Industry
Presented by Paul Oakes, LuLu
Lulu is creating a new model in publishing — open publishing — that empowers more creators to sell more content to more readers more profitably than ever before. We have more than a million creators registered from more than 200 countries and territories, and each month they add approximately 20,000 new works to our catalog.
Therein lies one of Lulu's big challenges: How to sort through all that content to quickly, effectively, and efficiently meet the needs of buyers. We needed a best-of-breed search and discovery platform; we chose Lucene and Solr.
Lulu's experience with Lucene and Solr has evolved over the past few years, and our index has grown many times over its earlier size. We've had our share of growing pains, and we've learned a lot from the challenges of integrating internal services into indexing.
With our modern implementation, Lulu has achieved remarkably faster and more meaningful search results, indexing times have been reduced by orders of magnitude, and because this project is open source, we have executed these improvements with minimal costs. Lulu has great plans for its future search and discovery experience, and we look forward to the benefits Lucene and Solr will continue to bring.
Realtime Search With Lucene
Presented by Michael Busch, Search Engineer, Twitter
Lucene has for a while already a nice feature that we call "Near-realtime search" (NRT). The approach works well for a lot of applications, but we're currently working on an even better real-time solution in Lucene: directly searching IndexWriter's RAM buffer while do cuments are being added! This will dramatically improve indexing performance compared to NRT, and the search latency (the time it takes for a d ocument to become searchable) will shrink to a minimum - hence we will scratch the N in NRT!
This talk will discuss this new approach and give an overview of the current status of Lucene's realtime-search branch.
Size and Type Matters: Scale, Integration and Search
Presenters:
Jason Eiseman, Yale Law School
Daniel Lovins, Yale University
Jeffrey Barnett
Tom Burton-West, Retrieval Programmer, University of Michigan
Our panel of experts have practical experience with implementation and scaling of Solr within the context of major university library systems. Using the power of Solr within the Drupal integration at Yale – Jason Eiseman will highlight how Drupal and Solr were used to improve the search functionality and usability of Yale University Law School’s library website. His colleagues across campus, Jeffrey Barnett and Daniel Lovins will discuss the use of the Lucene/Solr platform, integrated with ICU and language detection and why it is the best way for the Yale Library to provide the same high standard of relevancy ranking and faceting with non-Roman scripts Rounding out this discussion is Tom Burton-West of the University of Michigan Library who will explain his experience with scaling Solr to provide full-text access to millions of books at a reasonable cost for the HathiTrust Large Scale Search project.
How Search Enables Cloud Applications
Presented by Mitch Stewart, Boomi
As more applications are moving into the Cloud, the need to organize and locate relevant data becomes a critical part of any Cloud Application. Cloud Users have become more tech-savvy and expect the application to respond quickly to search requests as well as allow for easy customizations. This talk covers how Boomi utilizes Solr to monitor its Cloud Integration solution, how search can be used to audit the data flowing between applications and the challenges and benefits of implementing Solr in a multi-tenant fashion.
Migration from a Commercial Search Platform (specifically FAST ESP) to Lucene/Solr
Presented by Michael McIntosh, VP, Enterprise Search Technologies, TNR Global
There are many reasons that an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company’s purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users.
This presentation will compare Lucene/Solr to FAST ESP on a feature basis, and as applied to an enterprise search installation. We will further explore how various advanced features of commercial enterprise search platforms can be implemented as added functions for Lucene/Solr. Actual cases will be presented describing how to map the various functions between systems.
Spatial Search Using Geohash Prefixes
Presented by David Smiley, Senior Software Developer, MITRE
Spatial search is a growing area in the Lucene/Solr community that is steadily progressing. One method used to accomplish spatial search involves geohashes. A geohash is a latitude/longitude geocode system in the public domain, described in detail on Wikipedia. Geohashes are strings that further narrow a latitude-longitude box on the earth with each added character. Given this property, Lucene’s inverted index is well suited as the basis for a geohash-based search filter. There are two challenges to such an implementation: One is dealing with the fact that not every point near another necessarily has the same prefix due to inevitable box boundary conditions. Secondly, an ideal implementation should be optimized to handle searches spanning a great number of points. This presentation will discuss such an implementation with closing thoughts on addressing performance for the latter case.
LinkedIn's Distributed Realtime Faceting Search System Based on Lucene
Presented by John Wang, Seach Architect, LinkedIn
LinkedIn is a high traffic consumer internet site and LinkedIn search, built on Lucene, is serving millions of queries on a daily basis.
The search problem at LinkedIn is unique in the following ways:
1) High volume with both unstructured and rich structured data
2) Different types of structured data, e.g. social graph
3) Data is changing in realtime
4) Constant expansion of the underlying corpus
This presentation will cover the challenges we have faced and the solutions we have come up with. Furthermore, the future plans and next steps we are taking in enhancing our search system using Lucene.
PANGAEA - Providing Access to Geoscientific Data Using Apache Lucene Java
Presented by Uwe Schindler, Schindlers Software/PANAGAEA
PANGAEA (Publishing Network for Geoscientific and Environmental Data, www.pangaea.de) is a data library for georeferenced data from earth syste m research operated in Open Access. Scientific primary data are long-term archived with related meta-information using a relational data base. On top of this data base, which is used for maintaining and curating the data in the backend, all data citations and corresponding documentatio n are searchable using Apache Lucene Java. Users are able to use conventional scoring Lucene queries as well as geographical filters to retriev e archived data sets. In this talk, Uwe Schindler presents the use of NumericRangeQueries in combination with custom scoring to create a map-ba sed search and dynamic results display (possibly with live demo). Lucene is also used to quickly lookup relations based on Digital Object Ident ifiers (DOIs) between these data citations and conventional research papers hosted by scientific publishers. Uwe will also present the XML-base d workflow used for indexing content from the underlying relational database.
Closing Panel: Data Crossroads - At The Intersection Of Search And Open Source
As open source exercises greater and greater influence on the next generation of IT infrastructure, data, application, and deployment models, who's getting disrupted by open source Solr/Lucene search?
Now that open source has a firm foothold in the mainstream, is disruption of commercial legacy search companies - or their demise - inevitable?
What barriers is open source up against in going beyond its base of developer fans and enthusiasts to conquer new frontiers? Does it have what it takes transform data and change enterprise search beyond recognition? Join noted blogger and industry guru Stephen E. Arnold as he puts these and other tough questions to veteran CEOs in the search and content processing industry: Eric Gries of Lucid Imagination, Charlie Hull of Lemur Consulting and Paul Doscher of Exalead USA. Don't miss this spirited discussion and debate.
Apache Lucene, Lucene, Apache Solr, Solr, Apache Hadoop and Hadoop are trademarks of The Apache Software Foundation.

