|Topic:||Numerical Entity Extraction from Text|
|Supervisor:||Mgr. Petr Baudiš , Ing. Jan Šedivý CSc.|
|Description:|| In our group, we are building the "brmson" system for question
answering. The answer for many questions like "What is the distance of
Earth from Sun?", "What is the maximum height of a Japanese train?",
"How wide are the train rails?" or "What is the critical mass of
plutonium?" are numerical quantities. Sometimes, they are stored in
a structured database that we can simply query, but all too often they
are embedded in free text Wikipedia articles and such.
The task here is building an NLP system that scans massive amounts
of unstructured text (e.g. English Wikipedia) and extract numerical
entities that specify such relations, like "Shinkansen network employs
standard gauge and maximum width of 3.40 m (11 ft 2 in) and maximum
height of 4.50 m (14 ft 9 in)"; the system should generate Shinkansen.width
and Shinkansen.height values from such text. Often (not always) the
values will be accompanied by units that can help infer the type of
relation (but they can come from different measurement systems).
However, notice that while kilograms will typically represent mass,
meters can represent either of distance, height, width and more; the
system will need to use other cues to distinguish these.
The goal is not building an extensive set of hard-coded heuristics,
but rather a flexible machine learning system that will infer the
extraction rules automatically; we will help you to focus at the right
state-of-art algorithms and getting up to speed on them. This is a hard
problem, but we do not have to fully solve it - some good progress is
more than enough for an excellent thesis!