Extract data from Wikipedia as clean as possible using Rails 3 -
i developing rails 3 application want able extract data (title , short text) topic wikipedia.
i need info "clean" in other words free html, wikitags , irrelevant data reference list , such.
is possible title , text topic?
i using gem data ugly.
{{for|the television series|solsidan (tv series)}} {{infobox settlement |official_name = solsidan |image_skyline = |image_caption = |pushpin_map = sweden |pushpin_label_position = |coordinates_region = se |subdivision_type = [[country]] |subdivision_name = [[sweden]] |subdivision_type3 = [[municipalities of sweden|municipality]] |subdivision_name3 = [[nacka municipality]] |subdivision_type2 = [[counties of sweden|county]] |subdivision_name2 = [[stockholm county]] |subdivision_type1 = [[provinces of sweden|province]] |subdivision_name1 = [[uppland]] |area_footnotes = {{cite web | title=tätorternas landareal, folkmängd och invånare per km2 2000 och 2005 | publisher=[[statistics sweden]] | url=http://www.scb.se/statistik/mi/mi0810/2005a01b/t%c3%a4torternami0810tab1.xls | format=xls | language=swedish | accessdate=2009-05-08}} |area_total_km2 = 0.23 |population_as_of = 2005-12-31 |population_footnotes = |population_total = 209 |population_density_km2 = 895 |timezone = [[central european time|cet]] |utc_offset = +1 |timezone_dst = [[central european summer time|cest]] |utc_offset_dst = +2 |coordinates_display = display=inline,title |latd=59 |latm=17 |lats= |latns=n |longd=17 |longm=51 |longs= |longew=e |website = }} '''solsidan''' [[urban areas in sweden|locality]] situated in [[nacka municipality]], [[stockholm county]], [[sweden]] == references == {{reflist}} {{stockholm-geo-stub}} {{localities in nacka municipality}} [[category:populated places in stockholm county]] [[no:solsidan]] [[sv:solsidan, nacka kommun]]
wikipedia provides regular images @ wikipedia:database download both mysql dumps in schema used mediawiki, , in xml interchange format. can load these onto own server (~6gib download, ~30 gb uncompressed current text of english wikipedia articles), , query/process wish. content not yet processed html, can process wiki markup , emit whatever want around it. page has lots of links libraries in various languages process these dumps, though don't see ruby 1 might have yourself.
there various subsets provided. abstract.xml contains titles , abstracts, sounds want, , 3gb.
see wikipedia:mirrors_and_forks discussion licensing requirements involved in reusing wikipedia content.
Comments
Post a Comment