- 09 Oct, 2014 1 commit
-
-
matjaz authored
-
- 07 Oct, 2014 2 commits
-
-
matjaz authored
Included OLD_HTML_OUTPUT into the main build as this functionality is still required for LatinoInterfaces/ClowdFlows/TextFlows which fails to build otherwise.
-
matjaz authored
Corrected LatinoWorkflows2010 project to be compilable with the latest VS2010 releases of Latino, SemWeb and OpenNLP projects. Included file config.cs & reference to System.Web.Extensions.
-
- 19 May, 2014 1 commit
-
-
Miha authored
Separated code page detection and language detection
-
- 16 May, 2014 1 commit
-
-
Miha authored
-
- 15 May, 2014 1 commit
-
-
Miha authored
-
- 12 May, 2014 1 commit
-
-
Miha authored
Fixed links in README.md
-
- 06 May, 2014 2 commits
- 05 May, 2014 21 commits
-
-
Miha authored
-
saso authored
Added avro persistance module to the LatinoWorkflows2010.csproj. It provides reader and writer classes for streaming lwf documents to avro files. Extendable classes are provided to add support for arbitrary schemes (eg. html docs).
-
Miha authored
-
Miha authored
Moved boilerplate tag (skip tag) assessment from HtmlTokenizerComponent to UrlTreeBoilerplateRemoverComponent
-
Miha authored
HtmlTokenizerComponent: added additional skip tags (forms, embedded content, header, footer, nav, aside...)
-
Miha authored
-
Miha authored
UrlTreeBoilerplateRemoverComponent now takes linkToTextRatio into account (simple heuristic)
-
Miha authored
-
Miha authored
Fixed a bug in DocumentWriterComponent
-
Miha authored
First step towards improved boilerplate remover: HTML tokenizer now assigns isLink feature to text blocks
-
Miha authored
-
Miha authored
-
Miha authored
-
Miha authored
-
Miha authored
-
Miha authored
-
Miha authored
-
Miha authored
-
Miha authored
-
Miha authored
-
Miha authored
-
- 13 Oct, 2013 1 commit
-
-
matjaz authored
Added overloads of Document's WriteXmlCompressed and ReadXmlCompressed functions which now accept stream parameters (before the only option was a file).
-
- 17 Jun, 2013 3 commits
- 07 Jun, 2013 1 commit
-
-
Miha authored
DocumentWriterComponent: implemented retry-on-deadlock, bug fix (html -> xml in file names written to DB) Added .gitignore
-
- 05 Jun, 2013 1 commit
-
-
Miha authored
-
- 01 Jun, 2013 1 commit
-
-
Miha authored
-
- 31 May, 2013 1 commit
-
-
Miha authored
-
- 21 May, 2013 1 commit
-
-
Miha authored
DocumentCorpusWriterComponent.cs: not writing HTMLs anymore DocumentWriterComponent.cs: now writes old corpus and document ID into DB
-
- 09 May, 2013 1 commit
-
-
Miha authored
-