NoDoSE

NoDoSE – a tool for semi-automatically extracting structured and Semi-structured data from text documents. Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from these sources into a DBMS or of writing wrappers around the sources. This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is an interactive tool for semi-automatically determining the structure of such documents and then extracting their data. Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions and then describing their semantics. This task is expedited by a mining component that attempts to infer the grammar of the file from the information the user has input so far. Once the format of a document has been determined, its data can be extracted into a number of useful forms. This paper describes both the NoDoSE architecture, which can be used as a test bed for structure mining algorithms in general, and the mining algorithms that have been developed by the author. The prototype, which is written in Java, is described and experiences parsing a variety of documents are reported.


References in zbMATH (referenced in 13 articles )

Showing results 1 to 13 of 13.
Sorted by year (citations)

  1. Doermann, David (ed.); Tombre, Karl (ed.): Handbook of document image processing and recognition (2014)
  2. Qureshi, Pir Abdul Rasool; Memon, Nasrullah: Hybrid model of content extraction (2012) ioport
  3. Fazzinga, Bettina; Flesca, Sergio; Tagarelli, Andrea: Schema-based Web wrapping (2011) ioport
  4. Liu, Wei; Yan, Hualiang; Xiao, Jianguo: Automatically extracting user reviews from forum sites (2011) ioport
  5. Li, Qing; Chen, Jing; Wu, Yipu: Algorithm for extracting loosely structured data records through digging strict patterns (2009) ioport
  6. Cesario, Eugenio; Folino, Francesco; Locane, Antonio; Manco, Giuseppe; Ortale, Riccardo: Boosting text segmentation via progressive classification (2008) ioport
  7. Becker, Simon M.; Haase, Thomas; Westfechtel, Bernhard: Model-based A-posteriori integration of engineering tools for incremental development processes (2005) ioport
  8. Deng, Xu-Bin; Zhu, Yang-Yong: L-Tree match: a new data extraction model and algorithm for huge text stream with noises (2005) ioport
  9. Corradini, Flavio; Mariani, Leonardo; Merelli, Emanuela: An agent-based approach to tool integration (2004) ioport
  10. Corradini, Flavio; Mariani, Leonardo; Merelli, Emanuela: An agent-based approach to tool integration (2004) ioport
  11. Corradini, Flavio; Mariani, Leonardo; Merelli, Emanuela: An agent-based approach to tool integration (2004) ioport
  12. Rajaraman, Anand; Ullman, Jeffrey D.: Querying websites using compact skeletons. (2003)
  13. Kushmerick, N.: Wrapper induction: Efficiency and expressiveness (2000)