A platform for community discussion. Free, open, simple.
Go to file
Régis Hanol 501b19b6e0
FIX: server-side HtmlToMarkdown improvements (#9586)
TLDR; this commit vastly improves how whitespaces are handled when converting from HTML to Markdown.
It also adds support for converting HTML <tables> to markdown tables.

The previous 'remove_whitespaces!' method was traversing the whole HTML tree and used a heuristic to remove
leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements)

It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example.
One such example can be found [here](https://meta.discourse.org/t/86782).

For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser.
The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules).
They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following:

- Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line)
- Remove any leading/trailing whitespaces inside an inline context

One quick & dirty way of getting this 90% solved would be to do 'HTML.gsub!(/[[:space:]]+/, " ")'.
We would also need to hoist <pre> elements in order to not mess with their whitespaces.
Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more '.strip!' calls than I can bear.

I decided to "emulate" the browser's handling of whitespaces and came up with a solution in 4 parts

1. remove_not_allowed!

The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown.
All the nodes that aren't handled by the library (eg. <script>, <style> or any non-textual HTML tags) are "swallowed".
In order to reduce the number of nodes visited, the method 'remove_not_allowed!' will automatically delete all the nodes
that have no "visitor" (eg. a 'visit_<tag>' method) defined.

2. remove_hidden!

Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden.
The 'remove_hidden!' method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0.

3. hoist_line_breaks!

The 'hoist_line_breaks!' method is there to handle <br> tags. I know those tiny <br> don't do much but they can be quite annoying.
The <br> tags are inline elements but they visually work like a block element (ie. they create a new line).
If you have the following HTML "<i>Foo<br>Bar</i>", it ends up visually similar to "<i>Foo</i><br><i>Bar</i>".
The latter being much more easy to process than the former, so that's what this method is doing.
The "hoist_line_breaks" will hoist <br> tags out of inline tags until their parent is a block element.

4. remove_whitespaces!

The "remove_whitespaces!" is where all the whitespace removal is happening. It's broken down into 4 methods as well

- remove_whitespaces!
- is_inline?
- collapse_spaces!
- remove_trailing_space!

The 'remove_whitespace!' method is recursively walking the HTML tree (skipping <pre> tags).
If a node has any children, they will be chunked into groups of inline elements vs block elements.
For each chunks of inline elements, it will call the "collapse_space!" and "remove_trailing_space!" methods.
For each chunks of block elements, it will call "remote_whitespace!" to keep walking the HTML tree recursively.

The "is_inline?" method determines whether a node is part of a inline context.
A node is inline iif it's a text node or it's an inline tag, but not <br>, and all its children are also inline.

The "collapse_spaces!" method will collapse any kind of (white) space into a single space (" ") character, even accros tags.
For example, if we have "  Foo \n<i> Bar </i>\t42", it will return "Foo <i>Bar </i>42".

Finally, the "remove_trailing_space!" method is there to remove any trailing space that might creep in at the end of the inline chunk.

This solution is not 100% bullet-proof.
It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed.

FIX: better detection of hidden elements when converting HTML to Markdown
FIX: take into account the 'allowed_href_schemes' site setting when converting HTML <a> to Markdown
FIX: added support for 'mailto:' scheme when converting <a> from HTML to Markdown
FIX: added support for <img> dimensions when converting from HTML to Markdown
FIX: added support for <dl>, <dd> and <dt> when converting from HTML to Markdown
FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown
FIX: added support for <acronym> when converting from HTML to Markdown
DEV: remove unused 'sanitize' gem

Wow, did you just read all that?! Congratz, here's a cookie: 🍪.
2020-04-30 12:21:25 +02:00
.github/workflows Rename many .js.es6 files to .js 2020-03-12 13:29:55 -04:00
.tx Update translations 2020-04-20 11:37:59 +02:00
app FEATURE: More improvements to crawler and old browsers view 2020-04-30 12:07:51 +03:00
bin DEV: Add docker cleanup script to d/ folder 2020-03-01 12:09:07 -08:00
config FEATURE: More improvements to crawler and old browsers view 2020-04-30 12:07:51 +03:00
db DEV: correct drop logic for columns in post table 2020-04-30 12:58:09 +10:00
docs DOCS: Update DEVELOPER-ADVANCED.md (#9313) 2020-03-31 09:52:23 +11:00
images fix image location 2014-09-11 17:56:29 +10:00
lib FIX: server-side HtmlToMarkdown improvements (#9586) 2020-04-30 12:21:25 +02:00
log Initial release of Discourse 2013-02-05 14:16:51 -05:00
plugins Revert "HACK: Add dummy plugin folder" 2020-04-30 11:02:15 +03:00
public Revert "FIX: lower case URLs before comparing for embedding comments" 2020-01-23 20:36:05 +05:30
script DEV: stop freezing frozen strings 2020-04-30 16:48:53 +10:00
spec FIX: server-side HtmlToMarkdown improvements (#9586) 2020-04-30 12:21:25 +02:00
test DEV: Add acceptance tests for bookmarks with reminders (#9592) 2020-04-30 14:58:26 +10:00
vendor Improve support for old browsers (#9515) 2020-04-29 21:40:21 +03:00
.editorconfig Set trim_trailing_whitespace false for markdown 2016-06-25 22:29:01 +04:30
.eslintignore Support for transpiling .js files (#9160) 2020-03-11 09:43:55 -04:00
.eslintrc DEV: Ember linting - disallow Ember.* variable usage (#8782) 2020-02-05 10:14:42 -06:00
.git-blame-ignore-revs DEV: Make discourse-common an Ember addon. (#9578) 2020-04-29 12:18:21 -04:00
.gitattributes Use proper encoding for email fixtures. 2018-02-21 17:06:35 +08:00
.gitignore Revert "HACK: Add dummy plugin folder" 2020-04-30 11:02:15 +03:00
.prettierignore DEV: Prettify *.en_US.yml files 2019-05-20 23:21:43 +02:00
.rspec DEV: Use --profile and --fail-fast in CI only 2019-03-11 22:04:47 -04:00
.rspec_parallel DEV: Introduce parallel rspec testing 2019-04-01 11:06:47 -04:00
.rubocop.yml DEV: Add rswag to aid in api documention (#9546) 2020-04-27 16:40:07 -06:00
.ruby-gemset.sample rvm has offically depreicated .rvmrc and recommends using .ruby-version and .ruby-gemset instead. 2013-05-23 09:16:11 -07:00
.ruby-version.sample Make version the same as install docs (#8713) 2020-01-14 12:33:37 +11:00
.template-lintrc.js FIX: template-lint uses strict rel-noopener rule which requires noreferrer (#9449) 2020-04-16 22:38:10 +02:00
.travis.yml Fix frontend tests on Travis (#8089) 2019-09-12 10:31:51 +10:00
adminjs Initial release of Discourse 2013-02-05 14:16:51 -05:00
Brewfile DEV: enable frozen string literal on all files 2019-05-13 09:31:32 +08:00
config.ru DEV: enable frozen string literal on all files 2019-05-13 09:31:32 +08:00
CONTRIBUTING.md Proper long form for CLA 2015-09-10 20:49:03 +02:00
COPYRIGHT.txt fix trademark 2013-06-27 09:38:15 +10:00
crowdin.yml DEV: Add configuration file for Crowdin 2020-02-12 22:45:17 +01:00
d add wrappers for mailcatcher and sidekiq 2016-12-13 09:05:45 +11:00
Dangerfile Rename many .js.es6 files to .js 2020-03-12 13:29:55 -04:00
discourse.sublime-project DEV: Exclude i18n .yml files from Sublime Text project. (#6473) 2018-10-10 20:21:24 +08:00
Gemfile FIX: server-side HtmlToMarkdown improvements (#9586) 2020-04-30 12:21:25 +02:00
Gemfile.lock FIX: server-side HtmlToMarkdown improvements (#9586) 2020-04-30 12:21:25 +02:00
jsapp Initial release of Discourse 2013-02-05 14:16:51 -05:00
lefthook.yml DEV: Run rubocop in parallel in pre-commit hook. 2020-04-29 13:48:59 +08:00
LICENSE.txt Initial release of Discourse 2013-02-05 14:16:51 -05:00
package.json DEV: Make discourse-common an Ember addon. (#9578) 2020-04-29 12:18:21 -04:00
Rakefile DEV: enable frozen string literal on all files 2019-05-13 09:31:32 +08:00
README.md Replace Travis build status with Github Actions status 2020-04-23 18:34:25 +02:00
yarn.lock DEV: Update jquery.fileupload and dependencies (#9466) 2020-04-28 10:39:29 -04:00

Discourse is the 100% open source discussion platform built for the next decade of the Internet. Use it as a:

  • mailing list
  • discussion forum
  • long-form chat room

To learn more about the philosophy and goals of the project, visit discourse.org.

Screenshots

Boing Boing

Mobile

Browse lots more notable Discourse instances.

Development

To get your environment setup, follow the community setup guide for your operating system.

  1. If you're on macOS, try the macOS development guide.
  2. If you're on Ubuntu, try the Ubuntu development guide.
  3. If you're on Windows, try the Windows 10 development guide.

If you're familiar with how Rails works and are comfortable setting up your own environment, you can also try out the Discourse Advanced Developer Guide, which is aimed primarily at Ubuntu and macOS environments.

Before you get started, ensure you have the following minimum versions: Ruby 2.6+, PostgreSQL 10+, Redis 4.0+. If you're having trouble, please see our TROUBLESHOOTING GUIDE first!

Setting up Discourse

If you want to set up a Discourse forum for production use, see our Discourse Install Guide.

If you're looking for business class hosting, see discourse.org/buy.

Requirements

Discourse is built for the next 10 years of the Internet, so our requirements are high.

Discourse supports the latest, stable releases of all major browsers and platforms:

Browsers Tablets Phones
Apple Safari iPadOS iOS
Google Chrome Android Android
Microsoft Edge
Mozilla Firefox

Built With

  • Ruby on Rails — Our back end API is a Rails app. It responds to requests RESTfully in JSON.
  • Ember.js — Our front end is an Ember.js app that communicates with the Rails API.
  • PostgreSQL — Our main data store is in Postgres.
  • Redis — We use Redis as a cache and for transient data.

Plus lots of Ruby Gems, a complete list of which is at /master/Gemfile.

Contributing

Build Status

Discourse is 100% free and open source. We encourage and support an active, healthy community that accepts contributions from the public including you!

Before contributing to Discourse:

  1. Please read the complete mission statements on discourse.org. Yes we actually believe this stuff; you should too.
  2. Read and sign the Electronic Discourse Forums Contribution License Agreement.
  3. Dig into CONTRIBUTING.MD, which covers submitting bugs, requesting new features, preparing your code for a pull request, etc.
  4. Always strive to collaborate with mutual respect.
  5. Not sure what to work on? We've got some ideas.

We look forward to seeing your pull requests!

Security

We take security very seriously at Discourse; all our code is 100% open source and peer reviewed. Please read our security guide for an overview of security measures in Discourse, or if you wish to report a security issue.

The Discourse Team

The original Discourse code contributors can be found in AUTHORS.MD. For a complete list of the many individuals that contributed to the design and implementation of Discourse, please refer to the official Discourse blog and GitHub's list of contributors.

Copyright 2014 - 2020 Civilized Discourse Construction Kit, Inc.

Licensed under the GNU General Public License Version 2.0 (or later); you may not use this work except in compliance with the License. You may obtain a copy of the License in the LICENSE file, or at:

https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Discourse logo and “Discourse Forum” ®, Civilized Discourse Construction Kit, Inc.

Dedication

Discourse is built with love, Internet style.