Pattern-Based Extraction of Addresses from Web Page Content

Asadi, S., Yang, G., Zhou, X., Shi, Y., Zhai, B. and Jiang, W.W-R. (2008). Pattern-Based Extraction of Addresses from Web Page Content. In: Zhang, Y., Yu, G., Bertino, E. and Xu,G., Progress in WWW Research and Development: 10th Asia Pacific Web Conference proceedings. 10th Asia Pacific Web Conference (APWeb 2008), Shenyang, China, (407-418). 26-28 April 2008. doi:10.1007/978-3-540-78849-2_41


Author Asadi, S.
Yang, G.
Zhou, X.
Shi, Y.
Zhai, B.
Jiang, W.W-R.
Title of paper Pattern-Based Extraction of Addresses from Web Page Content
Conference name 10th Asia Pacific Web Conference (APWeb 2008)
Conference location Shenyang, China
Conference dates 26-28 April 2008
Proceedings title Progress in WWW Research and Development: 10th Asia Pacific Web Conference proceedings   Check publisher's open access policy
Journal name Progress in Www Research and Development, Proceedings   Check publisher's open access policy
Series Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Place of Publication Heidelberg, Germany
Publisher Springer
Publication Year 2008
Sub-type Fully published paper
DOI 10.1007/978-3-540-78849-2_41
Open Access Status DOI
ISBN 9783540788485
ISSN 0302-9743
1611-3349
Editor Zhang, Y.
Yu, G.
Bertino, E.
Xu,G.
Volume 4976
Start page 407
End page 418
Total pages 12
Language eng
Abstract/Summary Extraction of addresses and location names from Web pages is a challenging task for search engines. Traditional information extraction and natural processing models remain unsuccessful in the context of the Web because of the uncontrolled heterogenous nature of the Web resources as well as the effects of HTML and other markup tags. We describe a new pattern-based approach for extraction of addresses from Web pages. Both HTML and vision-based segmentations are used to increase the quality of address extraction. The proposed system uses several address patterns and a small table of geographic knowledge to hit addresses and then itemize them into smaller components. The experiments show that this model can extract and itemize different addresses effectively without large gazetteers or human supervision.
Subjects 2614 Theoretical Computer Science
1700 Computer Science
Keyword Address extraction
Web page analysis
Web page analysis
Q-Index Code E1
Q-Index Status Confirmed Code
Institutional Status UQ

 
Versions
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 3 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 7 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Fri, 17 Apr 2009, 21:35:05 EST by Ms Kimberley Nunes on behalf of School of Information Technol and Elec Engineering