|
|||||||||||||||||||||
|
|||||||||||||||||||||
ABSTRACT
Text, speech, images, video, DNA sequences provide information about entities that people can recognize when looking at a particular instance. But those entities and their attributes and relationships are not directly accessible to queries that join across types of sources. Information extraction methods based on supervised machine learning recognize mentions of entities and relationships of predefined types in different kinds of sources, which can then be used to answer some useful types of queries. However, supervised learning relies on hand-annotated training sets that are difficult to create and limit what types of entities and relationships can be joined for new applications. These limitations have prompted research into unsupervised extraction methods that rely on correlations among sources rather than hand-annotated training sets. While these methods are not yet as accurate as those based on supervised learning, they have the potential for a new query-by-example approach to information integration in which seed sets of query answers are expanded into ranked lists of potential answers by learning occurrence patterns from the seed answers. I will give examples of both types of methods from our research on biomedical information extraction, leading to some ideas on a possible convergence of search and databases through machine learning. |
|||||||||||||||||||||