Extract data from Wikipedia as clean as possible using Rails 3 -


i developing rails 3 application want able extract data (title , short text) topic wikipedia.

i need info "clean" in other words free html, wikitags , irrelevant data reference list , such.

is possible title , text topic?

i using gem data ugly.

{{for|the television series|solsidan (tv series)}} {{infobox settlement |official_name = solsidan |image_skyline = |image_caption = |pushpin_map = sweden |pushpin_label_position = |coordinates_region = se |subdivision_type = [[country]] |subdivision_name = [[sweden]] |subdivision_type3 = [[municipalities of sweden|municipality]] |subdivision_name3 = [[nacka municipality]] |subdivision_type2 = [[counties of sweden|county]] |subdivision_name2 = [[stockholm county]] |subdivision_type1 = [[provinces of sweden|province]] |subdivision_name1 = [[uppland]] |area_footnotes = {{cite web | title=tätorternas landareal, folkmängd och invånare per km2 2000 och 2005 | publisher=[[statistics sweden]] | url=http://www.scb.se/statistik/mi/mi0810/2005a01b/t%c3%a4torternami0810tab1.xls | format=xls | language=swedish | accessdate=2009-05-08}} |area_total_km2 = 0.23 |population_as_of = 2005-12-31 |population_footnotes = |population_total = 209 |population_density_km2 = 895 |timezone = [[central european time|cet]] |utc_offset = +1 |timezone_dst = [[central european summer time|cest]] |utc_offset_dst = +2 |coordinates_display = display=inline,title |latd=59 |latm=17 |lats= |latns=n |longd=17 |longm=51 |longs= |longew=e |website = }} '''solsidan''' [[urban areas in sweden|locality]] situated in [[nacka municipality]], [[stockholm county]], [[sweden]] == references == {{reflist}} {{stockholm-geo-stub}} {{localities in nacka municipality}} [[category:populated places in stockholm county]] [[no:solsidan]] [[sv:solsidan, nacka kommun]] 

wikipedia provides regular images @ wikipedia:database download both mysql dumps in schema used mediawiki, , in xml interchange format. can load these onto own server (~6gib download, ~30 gb uncompressed current text of english wikipedia articles), , query/process wish. content not yet processed html, can process wiki markup , emit whatever want around it. page has lots of links libraries in various languages process these dumps, though don't see ruby 1 might have yourself.

there various subsets provided. abstract.xml contains titles , abstracts, sounds want, , 3gb.

see wikipedia:mirrors_and_forks discussion licensing requirements involved in reusing wikipedia content.


Comments

Popular posts from this blog

linux - Mailx and Gmail nss config dir -

c# - Is it possible to remove an existing registration from Autofac container builder? -

php - Mysql PK and FK char(36) vs int(10) -