# Detail of the student project

Topic: Numerical Entity Extraction from Text Katedra kybernetiky Mgr. Petr Baudiš , Ing. Jan Šedivý CSc. DP In our group, we are building the "brmson" system for question answering. The answer for many questions like "What is the distance of Earth from Sun?", "What is the maximum height of a Japanese train?", "How wide are the train rails?" or "What is the critical mass of plutonium?" are numerical quantities. Sometimes, they are stored in a structured database that we can simply query, but all too often they are embedded in free text Wikipedia articles and such. The task here is building an NLP system that scans massive amounts of unstructured text (e.g. English Wikipedia) and extract numerical entities that specify such relations, like "Shinkansen network employs standard gauge and maximum width of 3.40 m (11 ft 2 in) and maximum height of 4.50 m (14 ft 9 in)"; the system should generate Shinkansen.width and Shinkansen.height values from such text. Often (not always) the values will be accompanied by units that can help infer the type of relation (but they can come from different measurement systems). However, notice that while kilograms will typically represent mass, meters can represent either of distance, height, width and more; the system will need to use other cues to distinguish these. The goal is not building an extensive set of hard-coded heuristics, but rather a flexible machine learning system that will infer the extraction rules automatically; we will help you to focus at the right state-of-art algorithms and getting up to speed on them. This is a hard problem, but we do not have to fully solve it - some good progress is more than enough for an excellent thesis! 18.11.2014
Responsible person: Petr Pošík